Use LangChain to keep updated with Tech News Easily

·

4 min read

In today’s fast-paced tech world, staying updated with the latest news is crucial. However, manually reading through numerous articles can be time-consuming. LangChain is an open-source framework for building applications with large language models, offering tools for development, production, and deployment. In this blog post, we’ll explore how to use LangChain to automate the summarization of tech news scraped from a website. Specifically, we’ll focus on summarizing the top 10 tech news articles from Lobsters.

Prerequisites Before we begin, ensure you have the following:

  1. Python Environment: Make sure you have Python installed.

  2. Required Libraries: Install the necessary libraries using pip:

     pip install langchain langchain-community langchain-openai requests beautifulsoup4 validators
    
  3. DeepSeek API Key: Sign up for a DeepSeek API Key and store it in a .env file. (Don’t push to git, and it will be more secure if the API key is not hardcoded in the code.)

Step-by-Step Guide

Step 1: Load Environment Variables

First, load the environment variables, including the DeepSeek API key from a .env file.

from dotenv import load_dotenv
import os

load_dotenv()

api_key = os.getenv("OPENAI_API_KEY")
if api_key is None:
    print("OPENAI_API_KEY was not loaded from the .env file. Please check the .env file.")
else:
    print("OPENAI_API_KEY has been successfully loaded.")

Step 2: Setup the DeepSeek API Key

Set the DeepSeek API key as the environment variable and initialize the DeepSeek model. Here I use an open-sourced distilled model - DeepSeek-R1-Distill-Qwen-7B.

os.environ["OPENAI_API_KEY"] = api_key
llm = OpenAI(temperature=0.7, base_url="https://api.siliconflow.cn/v1", model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")

Step 3: Define a URL Validation Function

Ensure that the URLs we scrape are valid.

from validators import url as validate_url

def validate_url(url):
    return validate_url(url)

Step 4: Load Data from a Web Page

Create a function to load data from a given URL using WebBaseLoader.

from langchain_community.document_loaders import WebBaseLoader

def load_data_from_web(url):
    documents = []
    if not validate_url(url):
        print(f"Invalid URL: {url}")
        return documents

    loader = WebBaseLoader(url)
    try:
        docs = loader.load()
        documents.extend(docs)
    except Exception as e:
        print(f"Error loading data from {url}: {e}")
    return documents

The load_data_from_web function will load the data from the URL of the web page by using the LangChain community’s WebBaseLoader module.

Step 5: Split Text into Chunks

Split the loaded text into smaller chunks to make the summarization model easier to process.

from langchain.text_splitter import CharacterTextSplitter

def split_text(documents):
    text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=100)
    texts = text_splitter.split_documents(documents)
    return texts

Step 6: Generate Summaries

Use LangChain’s summarization chain to generate summaries from the text chunks.

from langchain.chains.summarize import load_summarize_chain

def generate_summary(texts):
    chain = load_summarize_chain(llm, chain_type="map_reduce")
    try:
        summary = chain.invoke({"input_documents": texts})
        return summary["output_text"]
    except Exception as e:
        print(f"Error generating summary: {e.__class__.__name__}: {str(e)}")
        return None

Step 7: Scrape Top News URLs from Lobsters

Scrape the top news URLs from the Lobsters website.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def get_hacker_news_urls(base_url="https://lobste.rs/"):
    headers = {"User-Agent": "Mozilla/5.0"}
    try:
        response = requests.get(base_url, headers=headers)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        links = soup.select('.story_liner .u-url')[:10]
        urls = [urljoin(base_url, link.get('href')) for link in links if link.get('href') and validate_url(urljoin(base_url, link.get('href')))]
        print(f"Fetched {len(urls)} URLs: {urls}")
        return urls
    except requests.RequestException as e:
        print(f"Error fetching news from {base_url}: {e}")
        return []

Here I will select the top 10 news from the lobste.rs site. See the following line of code.

links = soup.select('.story_liner .u-url')[:10]

Step 8: Process News Articles
Combine the functions to load data, split text, and generate summaries for each news article.

def process_news(url):
    documents = load_data_from_web(url)
    if not documents:
        return None

    split_docs = split_text(documents)
    summary = generate_summary(split_docs)
    return summary

Step 9: Automate the Summarization Process

Finally, automate the summarization process for each of the top news articles.

if __name__ == "__main__":
    hacker_news_urls = get_hacker_news_urls()
    for index, url in enumerate(hacker_news_urls, start=1):
        print(f"Processing news {index}: {url}")
        try:
            summary = process_news(url)
            if summary:
                print(f"Summary of news {index}:")
                print(summary)
            else:
                print(f"Failed to generate summary for news {index}")
        except Exception as e:
            print(f"Error processing news {index} from {url}: {e.__class__.__name__}: {str(e)}")
        print("-" * 80)

This “main” entry will call the process_news function, which is responsible for generalizing the summary of the news article. While the process_news function will call generate_summary function to generate the summary.

Conclusion

In this blog post, we’ve demonstrated how to automate the summarization of tech news using LangChain. By scraping news articles from Lobste and leveraging DeepSeek’s language models, you can see how easy it is to utilize the Langchain modules to do things like “Load web data” and “Generate summaries”.

If you are keen on how to use an AI framework like LangChain to do something intelligent or automate something for your daily life, this is a good starting point. I hope you enjoy it so far! If you have any comments or feedback, please leave a message to me.