A nice challenge over on Reddit for summarising text. It’s a nice simple idea. Take in some text, one or more paragraphs, and automatically pull out 2 or 3 key sentences from the text that give a good overall summary of the contents of that text.
The key is, after ignore a small list of common words like “and”, “or”, “I”, “of”, etc, to count the occurances of each unique word and then score each sentance based on the words it contains. Higher scoring sentences will naturally contain more higher scoring words and thus be more relevent and key to sumising the text than sentences with lower scoring words.
Here the Dictionary fuction built in to C# is a major boon. Previously I’d have to have manipulated 3D arrays and handled conversion between text and integers in order to represent the link between a word and it’s score:
//wordScores[index] = word, wordScores[index]= score int as string
String wordScores = new string;
One of the benifits of the Dictionary is that the two type – the index(key) and the value can be of different types. So in this problem my 3D array can be better represented as
Dictionary<String, int> wordScores = new Dictionary<String, int>();
This allows us to now search on the index for the string and simply add one to the value every time we find it:
So here my Dictionary is called parseWords. My input text is being split into individual words. Each word is then compared against the ignore list, and then checked against the dictionary. If it’s in the dictionary then we increas the dictionary count, and if it’s not then we simply add it to the dictionary.
This then allows us to quickly parse each sentence against the created parseWords dictionary and score it based on the found word count:
So, this is a method in a class that is storing the sentences and the sentence has been broken down into an array of words. All we’re saying here is find each word in the sentence in the dictionary and add that words score to the total sentence score.
This then allows us to resort the list of sentences based on their scores and pull out the top 2-3 scoring sentences: