Blog posts are a key component in learning new technologies accross the web. With OpenAI’s
GPT , we can take this a step further by summarizing blog posts into a few sentences.
Allowing quicker consumption while retaining key concepts.
In this article, we will learn how to scrape blog post content and summarize it using OpenAI’s
GPT via TypeScript.
Before getting started, please ensure you have an OpenAI account. If not, you can signup on their
website here . Once signed up, take note of your API Key as we’ll need it later.
This project will be using Node.js 18.x. You can check your Node.js version by
running node -v within the terminal.
Let’s begin by initializing our project and installing dependencies:
Now that we have our project setup, let’s create an .env file and place our secret OpenAI API Key
in there.
Next, create a tsconfig.json file at the root of our project. This will allow us to leverage
TypeScript within our project.
You can read about each of these options in more detail here .
Lastly, create an index.ts file at the root of our project. Within this file, we can load our
.env file and initialize the OpenAI API client.
In order to summarize a blog post, we must first capture its contents. Start by using the Fetch
API to get all HTML content from the blog post’s webpage.
We must also pass a User-Agent header to the request. Certain websites will block requests that do
not have this. We can generate one via the user-agents library.
Using Cheerio , the blog post content must be extracted from the the HTML. This is
essential as we only want to summarize the blog post content and not the entire page.
Web scraping can be tricky as each website is built different. You may need to adjust the
selectors below to fit your needs.
Also, OpenAI’s GPT has a max number of tokens that can be passed into it. This
means we must truncate our content to fit within this limit. Thankfully, we can leverage
gpt-3-encoder to do all the heavy lifting.
For this example, we will be capturing the first 8000 tokens. The max token count will vary based
on the GPT model your using. I recommend using GPT-4 as it has a much higher limit.
In very few lines of code, we were able to pull all content from a blog post and present a detailed
summary of that content. As mentioned earlier, this allows quick consumption while retaining key
concepts.
You can see this in action on my personal project feedjoy . Under each blog post exists a
summary of its content.