Finding The Most Important Sentences Using NLP and TF-IDF
We’re going to create a summary of BBC News Articles and place them at the top using a Firefox extension. This article is about the gnarly algorithm Term Frequency-Inverse Document Frequency (TF-IDF). We’re going to create a real-world usage in the form of a Firefox extension. I know what you’re thinking. “TF-IDF? Yawn 😴” but bare with me, it’s quite interesting!
When we’re done, it’ll look like this:
News article found here.
Red box highlights the most important sentences. The red box is not in the final product, is only used for illustration purposes.
Term frequency * Inverse Document Frequency
Don’t worry, the name of the algorithm makes me fall asleep every time I hear it said out loud too. This algorithm is 2 algorithms multiplied together. Let’s see how both of these work:
Term Frequency
Term frequency (TF) is how often a word appears in a document, divided by how many words there are.
Let’s say you’re reading a news article on Brexit. The word “Brexit” will appear a lot, so the term frequency of the word “Brexit” is high.
Quite often, we would want to build a dictionary (hashmap) of term frequencies alongside the term. Like {word: term frequency of that word} and then iterate through this dictionary to find out which word appears the most times.
Now, what if I told you that the term frequency dictionary would look a little like this:
{"and": 0.87, "the": 0.73}
You can see how these common English words aren’t useful to us. Of course, most English texts have these words in them, but we call English words like these stopwords. Stopwords usually refer to the most common words in a language, although there isn’t one singular definition. You have to choose stopwords per usage. You have to decide on what words to use. Before processing some text you’ll usually want to remove stopwords to better process the text.
Words with capital letters in them differ from words without capitals. In programming, “Africa” and “africa” are two different things. Because of this, we want to turn everything into lowercase or uppercase to better process our text. We’re going to turn all words into lowercase.
Given a string, we want to remove stop words and turn it into lowercase. Our extension will give us a string of all text on a BBC news article. Don’t worry about where we get the text from yet, that’s done later in the Firefox extension section. For now, assume we have text that looks like this:
… These are external links and will open in a new windowA neat feature of many modern laptops is the ability to power them up through the USB port. Unlike the rectangular USB ports of old, the newer type - USB-C - can carry enough power to charge your machine. That’s great news: it means …
The above text is shortened to prevent the reader from falling asleep. GitHub freaks out on this page when I try to insert the stopwords_array, so I had to split it up like this.
function prettify(document){
// Turns an array of words into lowercase and removes stopwords
const stopwords = ["a", "share", "linkthese", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any","", "are","aren't","as","at","be","because","been","before","being","below","between","both","but","by","can't","cannot","could","couldn't","did","didn't","do","does","doesn't","doing","don't","down","during","each","few","for","from","further","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","he's","her","here","here's","hers","herself","him","himself","his","how","how's","i","i'd","i'll","i'm","i've","if","in","into","is","isn't","it","it's","its","itself","let's","me","more","most","mustn't","my","myself","no","nor","not","of","off","on","once","only","or","other","ought","our","ours","ourselves","out","over","own","same","shan't","she","she'd","she'll","she's","should","shouldn't","so","some","such","than","that","that's","the","their","theirs","them","themselves","then","there","there's","these","they","they'd","they'll","they're","they've","this","those","through","to","too","under","until","up","very","was","wasn't","we","we'd","we'll","we're","we've","were","weren't","what","what's","when","when's","where","where's","which","while","who","who's","whom","why","why's","with","won't","would","wouldn't","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves", "this"];
// turn document into lowercase words, remove all stopwords
var document = document.replace(/[.,]/g, '');
let document_in_lowercase = document.split(" ").map(function(x){ return x.toLowerCase() });
return document_in_lowercase.filter( x => !stopwords.includes(x) );
}
This is the function which will “prettify” our documents. Line 3 is an array of stopwords I found on StackOverflow. I added “share” and “linkthese” since these are commons words in the news article we don’t want.
This is the function which will “prettify” our documents. Line 3 is an array of stopwords I found on StackOverflow. I added “share” and “linkthese” since these are commons words in the news article we don’t want.
Line 5 is Regex. The square brackets mean or. [,.] means “activate on a comma or a full stop ”. /g means global. Once you find one *‘,’ *or *‘.’ *don’t stop, continue searching the string. The empty string is what we’re replacing it with. If we find a full stop or a comma, replace it with nothing — delete it. This is because the words “Africa.” and “Africa” would be classified as two different words without this.
Line 4 splits the document up into separate words. The map function applies a function to every element in an array. Once the string is split up into an array of words we apply the toLowerCase() method to every element. It makes every word lowercase.
The arrows are Arrow Functions (Lambda / Anonymous functions).
We then return the lowercase words once we’ve filtered out stop words. Filter() creates a new array with only the elements for which the function inside returns True.
If a word is a stop word, it’ll result in True which means we’ll get a new array of only the stopwords in the document. We use the negation operator “!” to get the opposite, which is what we want. To return a list of words with no stopwords in it.
Now we want to count how many times each word appears in the document. This will be useful for both Term Frequency and Inverse Document Frequency. First, we want to get all the unique words from an array of words.
function uniqueWords(words){
const unique_words_set = new Set(words);
return unique_words = Array.from(unique_words_set);
}
We convert the array into a set because sets have no repetitions. This lets us get only the unique words in the array. Sets also don’t have an order, so we can’t use array indices to access elements. We need to turn it straight back into an array. For more about Set Theory, check out this article I wrote.
Okay, now time to count how many times a word appears in the words array.
function countWords(words){
// returns a dictionary of {WORD: COUNT} where count is
// how many times that word appears in "words".
const unique_words = uniqueWords(words);
let dict = {};
// for every single unique word
for (let i = 0; i <= unique_words.length - 1; i++){
dict[unique_words[i]] = 0
// see how many times this unique word appears in all words
for (let x = 0; x <= words_without_stopwords.length -1; x++){
if (unique_words[i] == words[x]){
dict[unique_words[i]] = dict[unique_words[i]] + 1;
}
}
}
return dict;
}
This function goes through every single unique word and counts how many times that word appears in the array of words. The Term frequency function is quite long, so I’m going to break it down.
function termFrequency(document){
// calculates term frequency of each sentence
words_without_stopwords = prettify(document);
// gets rid of trailing spaces
const sentences = document.split(".").map(item => item.trim());
sentences[0] = sentences[0].substring(146);
const TFVals = countWords(words_without_stopwords)
const unique_words = uniqueWords(words_without_stopwords);
// actually makes it TF values according to formula
for (const [key, value] of Object.entries(TFVals)){
TFVals[key] = TFVals[key] / words_without_stopwords.length;
}
Line 6 divides the document up into sentences. Sometimes sentences have white space before them. “Brandon. Dogs.” Has whitespace before “Dogs.” we apply the trim() method to each item to get rid of these trailing whitespace.
Regarding line 7, The first 146 characters of the first word is social media links. The rest of that word is a title or sub-title. Here, look:
Share this withEmailFacebookMessengerMessengerTwitterPinterestWhatsAppLinkedInCopy this linkThese are external links and will open in a new windowRyanair is tightening the rules on what passengers pay to take luggage onto the plane, in order to "speed up boarding".
This is annoying, as the title is an essential part of the story and needs to be taken into account. So we remove the first 146 characters of the first word to get:
Ryanair is tightening the rules on what passengers pay to take luggage onto the plane, in order to "speed up boarding"
Remember this formula?
The Term Frequency (TF) of a term, t, and a document, d.
The variable “TFVals” is calculating this formula. If we run the sentence “Hello, my name is Brandon. Brandon Brandon. The elephant jumps over the moon” through the term frequency function, we’ll get something that looks like this:
TF only returns 1 “Brandon”, so not every Brandon has a TF score.
For this example, bare in mind that TF only returns 1 Brandon because we are calculating the singular term (word) frequency. We have the term frequencies of words, but we want to calculate the most important sentences, not words. To do that, we go through every single sentence and see what words come up in that sentence which are in TFVals.
TF only returns 1 “Brandon”, so not every Brandon has a TF score (like in the image above). For this example, bare in mind that TF only returns 1 Brandon because we are calculating the singular term (word) frequency.
We just need to add them all up and divide by how many words we have. Since we’re only adding up the TF values of non stop words, it’s only fair if we divide by how many non stop words there are, instead of how many words there are in a sentence. If we don’t divide by how many words we have, long sentences have an advantage over shorter ones.
This is what line 20 onwards does below. We go through every single sentence and calculate the TF values of each sentence, just as we did above. New code starts at line 20.
function termFrequency(document){
// calculates term frequency of each sentence
words_without_stopwords = prettify(document);
// gets rid of trailing spaces
const sentences = document.split(".").map(item => item.trim());
sentences[0] = sentences[0].substring(146);
const TFVals = countWords(words_without_stopwords)
const unique_words = uniqueWords(words_without_stopwords);
// actually makes it TF values according to formula
for (const [key, value] of Object.entries(TFVals)){
TFVals[key] = TFVals[key] / words_without_stopwords.length;
}
// splits it up into sentences now
var TFSentences = {};
// for every sentence
for (let i = 0; i <= sentences.length - 1; i ++){
// for every word in that sentence
let sentence_split_words = sentences[i].split(" ");
// get the assiocated TF values of each word
// temp.add is the "TF" value of a sentence, we need to divide it at the end
let temp_add = 0.0;
let words_no_stop_words_length = prettify(sentences[i]).length;
for (let x = 0; x <= sentence_split_words.length - 1; x++){
// get the assiocated TF value and add it to temp_add
if (sentence_split_words[x].toLowerCase() in TFVals){
// adds all the TF values up
temp_add = temp_add + TFVals[sentence_split_words[x].toLowerCase()];
}
else{
// nothing, since it's a stop word.
}
}
// TF sentences divide by X number of items on top
TFSentences[sentences[i]] = temp_add / words_no_stop_words_length;
}
return TFSentences;
}
And that’s it. But we have a problem with using only Term Frequency. As you may have seen earlier, “Brandon Brandon” was the highest scoring TF out of all 3 sentences we looked at.
Popularity isn’t enough. We don’t want sentences that have the most keywords in them as they may not make sense, or they may be repetitions of each other. Such as in the “Brandon” Brandon” sentence. It has a high TF value but doesn’t hold much content.
It doesn’t contain much information and isn’t useful. We want a sentence that is both rare, unique and contains keywords common in the article. This is where inverse document frequency comes in.
Inverse document frequency
Term frequency is how common a word is, inverse document frequency (IDF) is how unique or rare a word is. The formula for IDF is:
t is the term and d is the documents. We have multiple documents, we’re treating each sentence as its own document.
IDF used over many documents, whereas TF is built for one document. You can decide what a document is. In this article, each sentence is its own document.
The first few steps of IDF is the same as TF. We prettify the document, count the words in the document and get all the unique words.
function inverseDocumentFrequency(document){
// calculates the inverse document frequency of every sentence
const words_without_stopwords = prettify(document);
const unique_words_set = uniqueWords(words_without_stopwords);
const sentences = document.split(".").map(item => item.trim());
sentences[0] = sentences[0].substring(146);
const lengthOfDocuments = sentences.length;
// prettifys each sentence so it doesn't have stopwords
const wordCountAll = countWords(words_without_stopwords);
// counts words of each sentence
// as each sentence is a document
wordCountSentences = [];
for (let i = 0; i <= lengthOfDocuments - 1; i ++){
wordCountSentences.push(countWords(prettify(sentences[i])));
}
// calculate TF values of all documents
let IDFVals = {};
// how many times that word appears in all sentences (documents)
wordCountSentencesLength = wordCountSentences.length;
// for every unique word
for (let i = 0; i <= unique_words_set.length - 1; i++){
let temp_add = 0;
// count how many times unique word appears in all sentences
for (let x = 0; x <= wordCountSentencesLength - 1; x++){
if (unique_words_set[i] in wordCountSentences[x]){
temp_add =+ 1;
}
}
IDFVals[unique_words_set[i]] = Math.log10(wordCountAll[unique_words_set[i]] / temp_add);
}
Lines 1–6 is nothing new. The for loop on line 17 loops through every sentence in the document. Since each sentence is a new ‘document’, we need to count the words of each sentence individually. We have to prettify them to get rid of the stopwords and turn them into an array of words. We push the wordcount object of each new sentence into wordCountSentences.
There are 3 sentences here. We don’t care how many times Brandon appears over the whole document. I tried to pick the words to give you something interesting without making the maths too large, but in the process I developed narcissism.
We’re now going to go through every single word and count how many times that word appears in every sentence and calculate the IDF score using the below formula.
I use log 10 here. Note: all logarithms are constantly scaled, so it doesn’t matter much on what logarithm you use.
Now we just do this for every non stop word.
New code starts on line 25.
function inverseDocumentFrequency(document){
// calculates the inverse document frequency of every sentence
const words_without_stopwords = prettify(document);
const unique_words_set = uniqueWords(words_without_stopwords);
const sentences = document.split(".").map(item => item.trim());
sentences[0] = sentences[0].substring(146);
const lengthOfDocuments = sentences.length;
// prettifys each sentence so it doesn't have stopwords
const wordCountAll = countWords(words_without_stopwords);
// counts words of each sentence
// as each sentence is a document
wordCountSentences = [];
for (let i = 0; i <= lengthOfDocuments - 1; i ++){
wordCountSentences.push(countWords(prettify(sentences[i])));
}
// calculate TF values of all documents
let IDFVals = {};
// how many times that word appears in all sentences (documents)
wordCountSentencesLength = wordCountSentences.length;
// for every unique word
for (let i = 0; i <= unique_words_set.length - 1; i++){
let temp_add = 0;
// count how many times unique word appears in all sentences
for (let x = 0; x <= wordCountSentencesLength - 1; x++){
if (unique_words_set[i] in wordCountSentences[x]){
temp_add =+ 1;
}
}
IDFVals[unique_words_set[i]] = Math.log10(wordCountAll[unique_words_set[i]] / temp_add);
}
Now we want to get the IDF Values of all the sentences, we use the same code from TF here but replace some things to make it work.
If I’m being truthful with you, I did a simple “find and replace” the variables. Instead of “TF” in the comments, I replaced them with IDF. Instead of “TFVals”, I replaced it with “IDFVals”. Nothing important has happened here, so feel free to skip this part. New code starts on line 38.
function inverseDocumentFrequency(document){
// calculates the inverse document frequency of every sentence
const words_without_stopwords = prettify(document);
const unique_words_set = uniqueWords(words_without_stopwords);
const sentences = document.split(".").map(item => item.trim());
sentences[0] = sentences[0].substring(146);
const lengthOfDocuments = sentences.length;
// prettifys each sentence so it doesn't have stopwords
const wordCountAll = countWords(words_without_stopwords);
// counts words of each sentence
// as each sentence is a document
wordCountSentences = [];
for (let i = 0; i <= lengthOfDocuments - 1; i ++){
wordCountSentences.push(countWords(prettify(sentences[i])));
}
// calculate TF values of all documents
let IDFVals = {};
// how many times that word appears in all sentences (documents)
wordCountSentencesLength = wordCountSentences.length;
// for every unique word
for (let i = 0; i <= unique_words_set.length - 1; i++){
let temp_add = 0;
// count how many times unique word appears in all sentences
for (let x = 0; x <= wordCountSentencesLength - 1; x++){
if (unique_words_set[i] in wordCountSentences[x]){
temp_add =+ 1;
}
}
IDFVals[unique_words_set[i]] = Math.log10(wordCountAll[unique_words_set[i]] / temp_add);
}
let IDFSentences = {};
// for every sentence
for (let i = 0; i <= lengthOfDocuments - 1; i ++){
// for every word in that sentence
let sentence_split_words = sentences[i].split(" ");
// get the assiocated IDF values of each word
// temp.add is the "IDF" value of a sentence, we need to divide it at the end
let temp_add = 0.0;
let words_no_stop_words_length = prettify(sentences[i]).length;
for (let x = 0; x <= sentence_split_words.length - 1; x++){
// if the word is not a stopword, get the assiocated IDF value and add it to temp_add
if (sentence_split_words[x].toLowerCase() in IDFVals){
// adds all the IDF values up
temp_add = temp_add + IDFVals[sentence_split_words[x].toLowerCase()];
}
else{
// nothing, since it's a stop word.
}
}
IDFSentences[sentences[i]] = temp_add / words_no_stop_words_length;
}
return IDFSentences;
}
Now that we have written code to calculate the IDF value of sentences; we have completed the formula from earlier. “is a dog” has the highest IDF value.
We now know how unique or rare a sentence is. This isn’t that useful since we want the sentence to be information-rich too. We want some way to combine the popularity of TF with the uniqueness of IDF. This leads us to our next section…
TF-IDF revisited
“Are we done yet? Can I wake up now?” Yes, very nearly done!
We now have TF and IDF functions implemented. The only thing left to do is to multiply them together.
The objects TF and IDF both stem from the same data, so TF isn’t going to contain something that isn’t in IDF. Because of this, we can iterate through one object and use the same key. We multiply the value in TFVals by the value from in IDFVals.
Our next step is to calculate the 3 most important sentences in our TF-IDF Object. Iterating over the [key, value] of the object with a couple of if statements works perfectly.
You’ll see at the bottom we return the formatted string. We format it so it looks nice when we insert it into the webpage. Each
is a line break, a space in the text. The black dots are bullet points. We’re now going to implement this algorithm into a Firefox extension. 🔥🦊
Getting & changing text in a BBC news article
Go to any BBC news article, right-click and press “inspect element”. You’ll see a nice box at the bottom of the screen. Use the element selector tool in the top left-hand corner and hover over the article. We can see that the whole article is encompassed within a CSS class of ‘story-body’.
If we go further in, we can see that all the actual text in the article is encompassed by paragraph tags, inside this CSS class.
We’re going to use JQuery to select the text.
This line selects all the
tags within the story-body class. Now we want to get the text, we do this by applying the method .text().
We want to add our text to the top of the article. JQuery has a method called prepend which lets us prepend data to the top of an object.
And we’re done! We can now identify the most important sentences in a BBC News article and display them right at the top. Just time to turn it into an extension.
Firefox extension basics
Firefox extensions have 2 main parts. The Javascript you wrote and the manifest.json file which tells Mozilla what your extension does. We’re going over manifest.json now.
manifest_version tells Firefox what version of manifest you are using. *Name *tells Firefox what the name of your extension is. *Version *tells Firefox what version number your extension is. These 3 are mandatory.
*description *tells Firefox what your extension does.
*content_scripts *tells Firefox what scripts to load when the URL matches what you have inputted. In order for the scripts you have specified to run, the current URL must match at least one of the URLs you have specified. You can use 2 special characters here:
- “*****” Matches zero or more characters. In this instance, I don’t know whether the user will load HTTP or HTTPS so I have it step to load both. I also don’t know what exact article the user will look at, so I have it set to activate on any article.
- “?” matches exactly one character.
For example: "*na?i"
would match "illuminati"
and "annunaki"
, but not "sagnarelli"
.
Since we’re going to be using jQuery, we’ll import the jQuery JS file as well into the website, before our script executes. You can grab the jQuery file from here. Copy and paste into a file named “jquery.js”.
Enter “about:debugging” into your Firefox URL to load this page:
From here, click *“Load Temporarary Add-on…” *and then click any of the files in the extension. Once you do, you should see this:
Mozilla have a nice article on the basics of Firefox extensions, here.
Now load any BBC news article to play with it!
Conclusion
You’ve now seen the awesome power of TF-IDF and a real-world application for it. This idea came to me because I have email anxiety. I get so nervous about reading emails that I wanted a quick summary of them to calm my thoughts. Alas, this is my first time ever writing Javascript. I started out on something easier like BBC news articles.
Here are some ways you can improve upon this code, if you so wish:
- Dynamically select how many sentences in a summary you want. You can find out the average TF*IDF value in the whole article and anything over X you can include in the summary. This makes it so long articles are treated as equally as shorter articles.
- Extending this to work on any other websites you wish.