Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔍 WANTED: We are looking for data and data curators #69

Open
ivan-aksamentov opened this issue Mar 21, 2020 · 24 comments
Open

🔍 WANTED: We are looking for data and data curators #69

ivan-aksamentov opened this issue Mar 21, 2020 · 24 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed s:data Scope: related to data retrieval, parsing, transformation, storage, update t:talk Type: discussion of the application or the science behind it

Comments

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Mar 21, 2020

Currently we are looking for case counts data and other statistical information from different countries as well as for people who can maintain this data (add, curate, update).

The entire process should be automated as much as possible. The README in the directory covid19_scenarios/data contains some information on how to get started:

https://github.com/neherlab/covid19_scenarios/data

It also contains the preprocessed data ready for the consumption by the build system of the app.

If you think you may know where to find the relevant data for a country, please let us know either in this thread, or open an issue. If you are ready to contribute, feel free to open a pull request.

Don't hesitate to ask if you have any questions or if you need something to get started!

cc @nnoll @rneher

@ivan-aksamentov ivan-aksamentov added help wanted Extra attention is needed t:talk Type: discussion of the application or the science behind it s:data Scope: related to data retrieval, parsing, transformation, storage, update labels Mar 21, 2020
@ivan-aksamentov ivan-aksamentov pinned this issue Mar 21, 2020
@ivan-aksamentov ivan-aksamentov changed the title WANTED: We are looking for data sources and data curators WANTED: We are looking for data and data curators Mar 21, 2020
@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented Mar 21, 2020

One way might be to crowdsource the search for data.

There are many COVID-19 and SARS-CoV-2-related projects on the web. Some of them may contain data, APIs or just interesting ideas that can help us to make our application better.

Here are some examples:

@noleti
Copy link
Collaborator

noleti commented Mar 21, 2020

https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide seems like a good data source? Example data: https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-2020-03-20.xlsx Data seems to be global and well structured. It only counts cases and deaths, though (no hospitalized, ICU, recovered)

@fazouane-marouane
Copy link

fazouane-marouane commented Mar 21, 2020

Would this be enough ? It’s data that’s refreshed daily at 9am EST https://www.tableau.com/covid-19-coronavirus-data-resources

csv
google sheets

@nonotest
Copy link
Contributor

https://coronadatascraper.com/ there's a fair bit of data available there as well

@mserranom
Copy link
Contributor

mserranom commented Mar 21, 2020

For Spain, this is a good data source, containing national and regional cases, deaths, ICU and recovered, updated on a daily basis: https://github.com/datadista/datasets/tree/master/COVID%2019

@noleti
Copy link
Collaborator

noleti commented Mar 21, 2020

I have a finished pull request for the ECDC dataset pending now, replacing the WHO data and parser.

@noleti
Copy link
Collaborator

noleti commented Mar 21, 2020

https://coronadatascraper.com/ there's a fair bit of data available there as well

There is an amazing amount of data on that API, but I guess it is not an official source. Should be easy to write a parser for, if required.

@nonotest
Copy link
Contributor

Good point!

I have checked a few of their scrappers, they all seem to be directed at government pages eg https://www.health.gov.au/news/health-alerts/novel-coronavirus-2019-ncov-health-alert/coronavirus-covid-19-current-situation-and-case-numbers for Australia
https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection.html for Canada and so on.
or github repos that are official sources like https://github.com/opencovid19-fr/data for France

If we were to go this road it shouldn't take too long to vet each source I guess.

@mserranom
Copy link
Contributor

Spain's data: neherlab/covid19_scenarios_data#11

@camjc
Copy link

camjc commented Mar 21, 2020

Hey all,

Have been working on https://coronadatascraper.com/ aka https://github.com/lazd/coronadatascraper in my own time, and also am a Sanity user professionally+personally.

We've been building scrapers over there only from official sources. No news, no aggregates, just governments directly (yes this is a pain since many governments like to have free-text press releases, sometimes with useful numbers written out like thirty-five).

If there are any sources on there that aren't primary sources (government depts), please raise an issue on that github and we'll work to sort it out.

There's a slack for that project too if anyone wants to jump on and chat with us that are working on it.

@noleti
Copy link
Collaborator

noleti commented Mar 21, 2020

I wrote a first parser for the coronadatascaper.com now (in my forked repo). In the latest version, it should also contain correct entries for regions such as USA-OK-Love County. Everything is stored in a global .tsv (and json as well).

Re: source quality of coronadatascaper: Germany's numbers are pulled of the app of a tabloid... I don't think it will be possible to vet sources of such an API, as they can change things as they see fit.
/edit: To be more precise: Germany's numbers are themselves aggregated, from the official sources (RKI) and newspapers, of which at least one is more or less a tabloid (Morgenpost)

@aschelch
Copy link
Contributor

Hi, thanks you all for the work.
Here is my little contribution : I added data for France (neherlab/covid19_scenarios_data#18)
Take care

@ivan-aksamentov ivan-aksamentov added help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Mar 23, 2020
@ManuelB
Copy link
Contributor

ManuelB commented Mar 27, 2020

I have collected a lot of data for Germany:

https://github.com/ManuelB/covid-19-vis/tree/gh-pages/germany

It is used to run a full simulation for 417 districts in Germany and runs on the command line.

Details what I am doing in described here:
https://youtu.be/lwUDvNfVeEo

If the data is integrated into the data repository it would show more than 400 items in the select box. I would think this is too much.

@nnoll
Copy link
Collaborator

nnoll commented Mar 27, 2020

That's really cool @ManuelB. Thanks for sharing.

@ShubhamPandey-Engineer
Copy link

I can provide you an API that gives all country data regarding COVID-19 .It also get updated frequently

fetch('https://corona.lmao.ninja/all')

Hope it will help you guys.

@ivan-aksamentov ivan-aksamentov changed the title WANTED: We are looking for data and data curators 🔍 WANTED: We are looking for data and data curators Apr 8, 2020
@pauloangelo
Copy link

pauloangelo commented May 4, 2020

For Brazil, I saw that the data available at https://brasil.io/dataset/covid19/ have been used. Great!
However, some data is outdated. For example, today, the last record in "BRA-Distrito Federal.tsv" is for 2020-04-30.
Who is working with the Brazilian data? I'm willing to help if needed.
In the opportunity, I would like to thank the project's team! We used Covid Scenarios in a publication [1] that had a relevant local repercussion.

[1] https://1b9b1300-1a94-40d8-b9ca-402057f9520f.filesusr.com/ugd/c4c6aa_762877bf2fc54d1e94aa60dd8ea7a074.pdf

@noleti
Copy link
Collaborator

noleti commented May 4, 2020

For Brazil, I saw that the data available at https://brasil.io/dataset/covid19/ have been used. Great!
However, some data is outdated. For example, today, the last record in "BRA-Distrito Federal.tsv" is for 2020-04-30.
Who is working with the Brazilian data? I'm willing to help if needed.

Hi @pauloangelo , thanks for highlighting this. The data needs to be update manually by the maintainers of this project, and that has just not been done in the last 3 days. I am sure they will do this soon!

@pauloangelo
Copy link

Thank you @noleti . I'm available to help, if needed. Thank you all for this remarkable initiative!

@nnoll
Copy link
Collaborator

nnoll commented May 4, 2020

Hey @pauloangelo, sorry for the delay. We will update the date now and re-release soon!

@pauloangelo
Copy link

Thank you @nnoll !

@rneher
Copy link
Member

rneher commented May 4, 2020

If you compile population sizes for Brazilian regions and their hospital capacities, we can add them as presets.

@ivan-aksamentov ivan-aksamentov added the good first issue Good for newcomers label May 20, 2020
@pauloangelo
Copy link

Hi all,

The counts for "BRA-Distrito Federal" are including the cases from other regions detected at Distrito Federal. The Brasil.io dataset registers external cases as "Importados/Indefinidos". I suggest to count just the local cases. For example, for 29-May-2020 there are 142 local deaths, while the TSV counts 154.

Best regards,

PA

@pauloangelo
Copy link

If you compile population sizes for Brazilian regions and their hospital capacities, we can add them as presets.

Hi @rneher , I will have a look at it. For the hospital capacities, unfortunately, we don't have a reliable data. The government are varying this information. For the population sizes, R0, etc, I believe we can provide, at least for "BRA-Distrito Federal". Follows below the link/data that we have been using in our weekly report.

Weekly reports created by our observatory (parameters are also motivated here)
https://www.prepidemia.org/boletins-quinzenais-prepidemia .

Link/data for "BRA-Distrito Federal":
https://covid19-scenarios.org?q=~%28ageDistributionData~%28data~%28~%28ageGroup~%270-9~population~389784%29~%28ageGroup~%2710-19~population~439454%29~%28ageGroup~%2720-29~population~514225%29~%28ageGroup~%2730-39~population~465517%29~%28ageGroup~%2740-49~population~344853%29~%28ageGroup~%2750-59~population~218714%29~%28ageGroup~%2760-69~population~118042%29~%28ageGroup~%2770-79~population~56949%29~%28ageGroup~%2780%2A2b~population~22622%29%29~name~%27Custom%29~scenarioData~%28data~%28epidemiological~%28hospitalStayDays~8~icuStayDays~10~infectiousPeriodDays~2.2~latencyDays~5.2~overflowSeverity~1~peakMonth~0~r0~%28begin~3.7~end~5.55%29~seasonalForcing~0%29~mitigation~%28mitigationIntervals~%28~%28color~%27%2A23c9f4e5~name~%27D40509~timeRange~%28begin~%272020-03-12T15%2A3a00%2A3a00.000Z~end~%272020-05-31T15%2A3a00%2A3a00.000Z%29~transmissionReduction~%28begin~10~end~10%29%29~%28color~%27%2A23b98d4d~name~%27D40539~timeRange~%28begin~%272020-03-19T15%2A3a00%2A3a00.000Z~end~%272020-04-15T15%2A3a00%2A3a00.000Z%29~transmissionReduction~%28begin~60~end~60%29%29~%28color~%27%2A2332cac5~name~%27H%2Ae1bitos%2A20de%2A20higiene%2A20e%2A20distanciamento~timeRange~%28begin~%272020-03-19T15%2A3a00%2A3a00.000Z~end~%272020-12-31T15%2A3a00%2A3a00.000Z%29~transmissionReduction~%28begin~40~end~40%29%29~%28color~%27%2A230a1ab5~name~%27Impacto%2A20equivalente%2A20ao%2A20atual~timeRange~%28begin~%272020-05-17T15%2A3a00%2A3a00.000Z~end~%272020-12-31T15%2A3a00%2A3a00.000Z%29~transmissionReduction~%28begin~50~end~50%29%29~%28color~%27%2A2339984a~name~%27Impacto%2A20equivalente%2A20ao%2A20D40509~timeRange~%28begin~%272020-05-31T15%2A3a00%2A3a00.000Z~end~%272020-12-31T15%2A3a00%2A3a00.000Z%29~transmissionReduction~%28begin~10~end~10%29%29~%28color~%27%2A2384c772~name~%27D40539%2A20com%2A20flexibiliza%2Ae7%2Af5es~timeRange~%28begin~%272020-04-15T15%2A3a00%2A3a00.000Z~end~%272020-05-10T15%2A3a00%2A3a00.000Z%29~transmissionReduction~%28begin~55~end~55%29%29~%28color~%27%2A2346a750~name~%27D40539%2A20com%2A20mais%2A20flexibiliza%2Ae7%2Af5es~timeRange~%28begin~%272020-05-10T15%2A3a00%2A3a00.000Z~end~%272020-05-17T15%2A3a00%2A3a00.000Z%29~transmissionReduction~%28begin~50~end~50%29%29%29%29~population~%28ageDistributionName~%27Custom~caseCountsName~%27BRA-Distrito%2A20Federal~hospitalBeds~2570160~icuBeds~2570160~importsPerDay~0~initialNumberOfCases~20~populationServed~2570160%29~simulation~%28numberStochasticRuns~20~simulationTimeRange~%28begin~%272020-02-27T15%2A3a00%2A3a00.000Z~end~%272020-12-31T15%2A3a00%2A3a00.000Z%29%29%29~name~%27Distrito%2A20Federal%29~schemaVer~%272.0.0~severityDistributionData~%28data~%28~%28ageGroup~%270-9~confirmed~5~critical~5~fatal~30~isolated~0~severe~1%29~%28ageGroup~%2710-19~confirmed~5~critical~10~fatal~30~isolated~0~severe~3%29~%28ageGroup~%2720-29~confirmed~10~critical~10~fatal~30~isolated~0~severe~3%29~%28ageGroup~%2730-39~confirmed~15~critical~15~fatal~30~isolated~0~severe~3%29~%28ageGroup~%2740-49~confirmed~20~critical~20~fatal~30~isolated~0~severe~6%29~%28ageGroup~%2750-59~confirmed~25~critical~25~fatal~40~isolated~0~severe~10%29~%28ageGroup~%2760-69~confirmed~30~critical~35~fatal~40~isolated~0~severe~25%29~%28ageGroup~%2770-79~confirmed~40~critical~45~fatal~50~isolated~0~severe~35%29~%28ageGroup~%2780%2A2b~confirmed~50~critical~55~fatal~50~isolated~0~severe~50%29%29~name~%27China%2A20CDC%29%29&v=1

@ivan-aksamentov
Copy link
Member Author

@pauloangelo I created a separate issue for this, let's continue there
#718

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed s:data Scope: related to data retrieval, parsing, transformation, storage, update t:talk Type: discussion of the application or the science behind it
Projects
None yet
Development

No branches or pull requests