Personal description of the project. 

In this notebook, we try to answer the questions about the genetics of the Covid-19 illness virus (SARS-CoV-2, also called, at the beginning 2019-nCov, with n for novel ), its origins, and evolution. The idea is to use Natural Language Processing tools to help to organize better the documents for researchers so they can

access them easier and faster the relevant information. We will try to bring answers to every of the proposed subtasks and even we'll try to bring new approaches and tools. We find that having visual tools are more helpful, so we will try to develop some interesting scripts that way if that is possible and we have time enough. Finally, we will try to apport a personal point of view for new problematics and how to manage them. Programming Language: We will focus on Python.

Disclaimer: The programmer is still learning natural language processing through python, therefore, it will use a lot of the previous notebooks developed by the community. 

Below I leave you the description of this project as it can be found in the Kaggle repository. 

Dataset Description

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

Call to Action

We are issuing a call to action to the world's artificial intelligence experts to develop text and data mining tools that can help the medical community develop answers to high-priority scientific questions. The CORD-19 dataset represents the most extensive machine-readable coronavirus literature collection available for data mining to date. This allows the worldwide AI research community the opportunity to apply text and data mining approaches to find answers to questions within, and connect insights across, this content in support of the ongoing COVID-19 response efforts worldwide. There is a growing urgency for these approaches because of the rapid increase in coronavirus literature, making it difficult for the medical community to keep up.

A list of our initial key questions can be found under the Tasks section of this dataset. These key scientific questions are drawn from the NASEM's SCIED (National Academies of Sciences, Engineering, and Medicine's Standing Committee on Emerging Infectious Diseases and 21st Century Health Threats) research topics and the World Health Organization's R&D Blueprint for COVID-19.

Many of these questions are suitable for text mining, and we encourage researchers to develop text mining tools to provide insights on these questions.

We are maintaining a summary of the community's contributions. For guidance on how to make your contributions useful, we're maintaining a forum thread with the feedback we're getting from the medical and health policy communities.


Kaggle is sponsoring a $1,000 per task award to the winner whose submission is identified as best meeting the evaluation criteria. The winner may elect to receive this award as a charitable donation to COVID-19 relief/research efforts or as a monetary payment. More details on the prizes and timeline can be found in the discussion post.

Accessing the Dataset

We have made this dataset available on Kaggle. Watch out for periodic updates.

The dataset is also hosted on AI2's Semantic Scholar. And you can search the dataset using AI2's new COVID-19 explorer.

The licenses for each dataset can be found in the all _ sources _ metadata CSV file.


This dataset was created by the Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown University's Center for Security and Emerging Technology, Microsoft Research, IBM, and the National Library of Medicine - National Institutes of Health, in coordination with The White House Office of Science and Technology Policy.