Use LangChain to keep updated with Tech News Easily
In today’s fast-paced tech world, staying updated with the latest news is crucial. However, manually reading through numerous articles can be time-consuming. LangChain is an open-source framework for building applications with large language models, offering tools for development, production, and deployment. In this blog post, we’ll explore how to use LangChain to automate the summarization of tech news scraped from a website. Specifically, we’ll focus on summarizing the top 10 tech news articles from Lobsters.
Prerequisites Before we begin, ensure you have the following:
Python Environment: Make sure you have Python installed.
Required Libraries: Install the necessary libraries using pip:
pip install langchain langchain-community langchain-openai requests beautifulsoup4 validators
DeepSeek API Key: Sign up for a DeepSeek API Key and store it in a .env file. (Don’t push to git, and it will be more secure if the API key is not hardcoded in the code.)
Step-by-Step Guide
Step 1: Load Environment Variables
First, load the environment variables, including the DeepSeek API key from a .env file.
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
if api_key is None:
print("OPENAI_API_KEY was not loaded from the .env file. Please check the .env file.")
else:
print("OPENAI_API_KEY has been successfully loaded.")
Step 2: Setup the DeepSeek API Key
Set the DeepSeek API key as the environment variable and initialize the DeepSeek model. Here I use an open-sourced distilled model - DeepSeek-R1-Distill-Qwen-7B.
os.environ["OPENAI_API_KEY"] = api_key
llm = OpenAI(temperature=0.7, base_url="https://api.siliconflow.cn/v1", model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
Step 3: Define a URL Validation Function
Ensure that the URLs we scrape are valid.
from validators import url as validate_url
def validate_url(url):
return validate_url(url)
Step 4: Load Data from a Web Page
Create a function to load data from a given URL using WebBaseLoader.
from langchain_community.document_loaders import WebBaseLoader
def load_data_from_web(url):
documents = []
if not validate_url(url):
print(f"Invalid URL: {url}")
return documents
loader = WebBaseLoader(url)
try:
docs = loader.load()
documents.extend(docs)
except Exception as e:
print(f"Error loading data from {url}: {e}")
return documents
The load_data_from_web function will load the data from the URL of the web page by using the LangChain community’s WebBaseLoader module.
Step 5: Split Text into Chunks
Split the loaded text into smaller chunks to make the summarization model easier to process.
from langchain.text_splitter import CharacterTextSplitter
def split_text(documents):
text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=100)
texts = text_splitter.split_documents(documents)
return texts
Step 6: Generate Summaries
Use LangChain’s summarization chain to generate summaries from the text chunks.
from langchain.chains.summarize import load_summarize_chain
def generate_summary(texts):
chain = load_summarize_chain(llm, chain_type="map_reduce")
try:
summary = chain.invoke({"input_documents": texts})
return summary["output_text"]
except Exception as e:
print(f"Error generating summary: {e.__class__.__name__}: {str(e)}")
return None
Step 7: Scrape Top News URLs from Lobsters
Scrape the top news URLs from the Lobsters website.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
def get_hacker_news_urls(base_url="https://lobste.rs/"):
headers = {"User-Agent": "Mozilla/5.0"}
try:
response = requests.get(base_url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
links = soup.select('.story_liner .u-url')[:10]
urls = [urljoin(base_url, link.get('href')) for link in links if link.get('href') and validate_url(urljoin(base_url, link.get('href')))]
print(f"Fetched {len(urls)} URLs: {urls}")
return urls
except requests.RequestException as e:
print(f"Error fetching news from {base_url}: {e}")
return []
Here I will select the top 10 news from the lobste.rs site. See the following line of code.
links = soup.select('.story_liner .u-url')[:10]
Step 8: Process News Articles
Combine the functions to load data, split text, and generate summaries for each news article.
def process_news(url):
documents = load_data_from_web(url)
if not documents:
return None
split_docs = split_text(documents)
summary = generate_summary(split_docs)
return summary
Step 9: Automate the Summarization Process
Finally, automate the summarization process for each of the top news articles.
if __name__ == "__main__":
hacker_news_urls = get_hacker_news_urls()
for index, url in enumerate(hacker_news_urls, start=1):
print(f"Processing news {index}: {url}")
try:
summary = process_news(url)
if summary:
print(f"Summary of news {index}:")
print(summary)
else:
print(f"Failed to generate summary for news {index}")
except Exception as e:
print(f"Error processing news {index} from {url}: {e.__class__.__name__}: {str(e)}")
print("-" * 80)
This “main” entry will call the process_news function, which is responsible for generalizing the summary of the news article. While the process_news function will call generate_summary function to generate the summary.
Conclusion
In this blog post, we’ve demonstrated how to automate the summarization of tech news using LangChain. By scraping news articles from Lobste and leveraging DeepSeek’s language models, you can see how easy it is to utilize the Langchain modules to do things like “Load web data” and “Generate summaries”.
If you are keen on how to use an AI framework like LangChain to do something intelligent or automate something for your daily life, this is a good starting point. I hope you enjoy it so far! If you have any comments or feedback, please leave a message to me.