Step 3 in detail: Processing


As explained in this previous article the processing is quite a complicated and very important element of the project architecture, so I wanted to explain it in detail.  

What are the different elements of the processing? 

As said, there are four main elements: 

Keywords extraction

Keyword extraction is the automated process of extracting the most relevant words and expressions from the text. This step is therefore when we determine what are the words we want to maintain in our summaries. To correctly determine which are keywords is very important to correctly extract them and continue with the process. If we fail at this step, we will have errors all along the way. It is better, generally speaking, to go slow at first than to be in a hurry. Each part of the process is very important.

Keywords matching

Once the correct keywords are selected, what we must be sure of is that they match with the target keywords we are looking for. Indeed, as we are looking to create targeted summaries, it is particularly important to match keywords so the summaries are as specialized as possible.
The matching criteria are based on the queries the user will use to find the summaries.

This is particularly important as we try to make the better math as possible so the search is accurate.  

Name Entity Recognition

Once the keywords correctly match, we use a very common strategy in Natural Language Processing, Name Entity Recognition.

This allows making the summarization to be more precise.

Duplicates elimination

It will probably be a lot of duplicates due to the high amount of papers, therefore it is important to apply a strategy of duplicate elimination.

The  architecture of this part of the system can be seen below :