What is it?
Among the many possibilities of text processing we can do through Natural Language Processing, there is text summarization. As it names says, the main idea is to extract the fundamental information and present it to the user so he has a summary with the main ideas . This is particularly interesting, and it is bound to have a high impact on our lifes.
For example, if you need to browse a big amount of articles to find a specific information, as it is the case in my actual kaggle competition about Covid.
There are two kinds of summarization methods. Let's have a look at them.
It selects words based on semantic understanding, even those words that did not appear in the source documents. It aims at profucing important material in a new way. It interpret and examine the text using advanced Natural Language Processing techniques in order to generate a new shorter text that conveys the most critical information from the original text. It can be correlated to the way humans read a text article or blog post and then summarizaes it in their own words.
INPUT DOCUMENT-> UNDERSTAND CONTEXT->SEMANTICS ->CREATE OWN SUMMARY
This methods, on another hand, perform a selection of a subset of words that retain the most important points. This approach weights the important part of sentences and uses the same to form the summary. Different algorithms and techniques are used to define the weights for the sentences and further rank them based on importance and similarity among each other.
INPUT DOCUMENT-> SENTENCES SIMILARITY -> WEIGHT SIMILARITY ->SELECT WITH HIGHER RANK
The limited study is availale for abstractive summarization as it requires a deeper understanding of the text as compared to the extractive approach.
Purely extractive summaries give better results compared to automatic abstractive summaries. This is because of the fact that abstractive summarization methods cope with problems such as semantic representation interference, and natural language generation, which is relatively harder than data-driven approaches such as sentence extraction. There are many techniques available to generate extractive summarization.
Unsupervised learning approach to find a sentenc similarity and rank it
This is an interesting idea that was developped by a co-worker on the Kaggle competition (link to the notebook, here).
The benefit is that there is not need to train and build a model prior to start using it for the project -> which is very characteristic of unsupervised learning.
The mathematic approach
It's good to understand cosine similarity to make the best use of this kind of code.
Cosine similarity is a measure similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between vectors.
Angle will be zero if sentences are similar.
Check here for more information about it.
Therefore the code schematics will be:
PUT ARTICLE-> SPLIT INTO SENTENCES-> REMOVE STOPWORDS -> BUILD A SIMILARITY MATRIX -> GENERATE RANK BASED ON A MATRIX ->PIC TOP N-SENTENCES FOR SUMMARY
As you can see, this method do not only have the advantages of unsupervised learning, but also and alost, it does a complete job by creating matrices so the summarization is more complete.