Introduction

In this tutorial, we will be building a Python program that searches for a specified topic on Wikipedia, extracts the text from the top search result, and generates a summary using OpenAI and Trafilatura. We will use the LangChain library to connect these different functionalities.

Required Libraries

We will be using the following external libraries:

  • loguru
  • trafilatura
  • openai
  • langchain

Before we dive deeper, ensure that all these libraries are installed in your system.

# standard
import os
import json
import re

from loguru import logger
import trafilatura

from langchain import LLMChain
from langchain.prompts.prompt import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.chains import SimpleSequentialChain
from langchain.chains import LLMChain
from langchain.chains.base import Chain
from typing import Dict, List

Authenticating OpenAI

We will be using OpenAI’s API to generate a summary of the text. To access the API, you will have to create an account and obtain an API key. Once you have the key, you can store it in a secrets.json file and read it into the program as shown below:

if __name__ == "__main__":
    logger.info("Starting app...")
    # read settings
    with open("secrets.json", "r") as jsonfile:
        secrets = json.load(jsonfile)

    # set openai key
    os.environ["OPENAI_API_KEY"] = secrets["openai_api_key"]

Make sure to replace openai_api_key with your OpenAI API key.

Creating a URL Chain

Using the LangChain library, we can create a chain that prompts the user to enter the topic they would like to search on Wikipedia. This chain will then use OpenAI to generate a URL to the most relevant article on Wikipedia.

    # init chat object
    chat = ChatOpenAI(temperature=0)

    # setup url chain
    url_chain_human_prompt_template = HumanMessagePromptTemplate(
        prompt=PromptTemplate(
            template = URL_TEMPLATE['text'],
            input_variables = URL_TEMPLATE['input_variables'],
        )
    )
    url_chain_chat_prompt_template = ChatPromptTemplate.from_messages(
        [url_chain_human_prompt_template])

    url_chain = LLMChain(llm=chat, prompt=url_chain_chat_prompt_template, output_key="url", verbose=True)

This chain prompts the user for input and generates a URL by using OpenAI to search for the specified topic on Wikipedia. It then stores the URL in the url_chain variable.

Creating a custom Chain for web scraping

We will use Trafilatura to extract the text from the Wikipedia page corresponding to the URL generated in the previous step. The TrafilaturaChain defined in the following code block uses the trafilatura library to extract and clean the text, then splits the text into paragraphs of maximum length 500 characters.

class TrafilaturaChain(Chain):
    input_key: str = "url"  # :meta private:
    output_key: str = "text" # :meta private:

    @property
    def input_keys(self) -> List[str]:
        return [self.input_key]

    @property
    def output_keys(self) -> List[str]:
        return [self.output_key]

    def _get_text(self, url):
        downloaded = trafilatura.fetch_url(url)
        result = json.loads(trafilatura.extract(
            downloaded, output_format='json', include_links=False))
        text = result["text"].replace("\n", "")
        paragraph = self._get_paragraphs_from_text(text, 500)
        return "".join(paragraph[0:10])

    def _get_paragraphs_from_text(self, text, max_length):
        sentences = re.findall('[^\.!?]+[\.!?]', text)
        current_paragraph = ""
        current_length = 0
        paragraphs = []
        for sentence in sentences:
            if current_length + len(sentence) <= max_length:
                current_paragraph += sentence.strip() + " "
                current_length += len(sentence)
            else:
                paragraphs.append(current_paragraph.strip())
                current_paragraph = sentence.strip() + " "
                current_length = len(sentence)
        paragraphs.append(current_paragraph.strip())
        return paragraphs

    def _call(self, inputs: Dict[str, str]) -> Dict[str, str]:
        return {self.output_key: self._get_text(inputs['url'])}
   
trafilatura_chain = TrafilaturaChain()

The TrafilaturaChain takes in the URL generated by the url_chain and outputs a cleaned version of the corresponding Wikipedia page text.

Creating a Summary Chain

The SummaryChain defined in the following code block uses OpenAI to generate a summary of the cleaned text generated by the TrafilaturaChain.

    # setup summary chain
    summary_chain_human_prompt_template = HumanMessagePromptTemplate(
        prompt=PromptTemplate(
            template = SUMMARY_TEMPLATE['text'],
            input_variables = SUMMARY_TEMPLATE['input_variables'],
        )
    )
    summary_chain_chat_prompt_template = ChatPromptTemplate.from_messages(
        [summary_chain_human_prompt_template])

    logger.info("Composing chain...")
    summary_chain = LLMChain(llm=chat, prompt=summary_chain_chat_prompt_template, output_key="summary", verbose=True) 

The SummaryChain prompts the user to enter the length of the summary they would like to generate, and then uses OpenAI to generate the summary of the cleaned text generated by the TrafilaturaChain.

Creating the Sequential Chain

Finally, we use the SimpleSequentialChain library to connect the individual chains (url_chain, trafilatura_chain, and summary_chain) into a single chain.

    simple_sequential_chain = SimpleSequentialChain(
        chains=[url_chain, trafilatura_chain, summary_chain],
        verbose=True
    )
    logger.info("Fetching user input...")
    user_prompt = input("\n What do you want to search in Wikipedia?: ")
    simple_sequential_chain.run(user_prompt)

The SimpleSequentialChain takes in the user’s input, runs it through each of the individual chains in order, and produces the final summary.

Conclusion

In this tutorial, we learned how to create a Python program that searches for a topic on Wikipedia, extracts the corresponding text, and generates a summary. We used OpenAI and Trafilatura to perform the text processing and LLMChain to connect the different functionalities. By following this tutorial and tweaking the variables to your liking, you can create your own customized summary generator for any topic on Wikipedia.

Resources