What can data scientists contribute to the COVID-19 effort?

Health care professionals are the real heroes of the moment but there is an important role for data scientists to play in the fight against the pandemic that’s shaking the globe right now.

Your first port of call should be the COVID-19 Open Research Dataset — released on 20th March by the White House and a coalition of leading research groups. I recommend heading over to the associated Kaggle competition where you can collaborate and build upon the work of other data scientists who have already made progress on cleaning and understanding the potential of this dataset containing over 29,000 academic papers related to COVID-19, SARS-CoV-2, and other coronaviruses.

This looks like a natural language understanding problem to data scientists but it is essential that we also have the involvement of medical professionals to direct and sense check the work that’s ongoing.


Current tasks defined in the Kaggle competition include:

In order to prioritise tasks it’s important to understand both what can make the most impact to the HCPs on the ground but also what tasks the data can adequately support. Some work has been done on this already by scoring the relative tasks on a five-point scale:

The problem faced here is that the people who are voting seem to have quite divergent opinions on these ratings, it seems likely that people may only have partial knowledge and answering the question of data presence is a significant task in itself.


Decentralised coordination is currently happening in a few different groups. I’ll point to one group in particular as I see momentum there and I think there is value in centralising coordination to some extent in order to avoid duplication of effort. There are currently over 230 members on the Slack.

If you have data science or visualisation skills, or if you have a medical background and can contribute some time then please stop by the Slack and introduce yourself.

Other Data Sources

Some excellent work is being done all over the place at the moment and so I’ll link to a few other data sources that might be interesting.

Day level cases data by geography:

Stay tuned for a follow up post where I’ll set out what I see as the roadmap based on what we currently know, and point to the most useful work that’s been done so far.