🔎 What is data integration?

Data integration is assessing the overlap between data external to CoronaNet and recoding data that is currently missing from CoronaNet into the CoronaNet taxonomy, while giving proper attribution to the original coders. To this end, we have compiled a pool of external data to integrate from the following datasets up until September 10, 2021:

There is in total around 150,000 policies that we have identified in the external data which we can potentially integrate into the CoronaNet dataset. We have done our best to map this external data to the CoronaNet taxonomy but there are some things that we cannot control and we need your help to work through:

Why are we integrating external data?
How much external data are we planning to integrate?

While there are around 150,000 observations in the external data, that doesn’t mean we aim to increase the CoronaNet database by 150,000 observations! This is for a variety of reasons:

How do I get started?

📌 You can access this external data here

📌 In order to integrate data from other datasets, please follow the following steps (or check out video for the instructions in video format) :

1 Identify data that other datasets have documented but which are not currently in the CoronaNet dataset

You should fill in this information in the overlap_assess column in the CoronaNet data Integration sheet and the following table explains how to code this variable:

How to code the
overlap_assess variable
What to do with the information in the overlap_assess variable
If the overlap_assess value reads ‘Yes’, then this observation is already in the CoronaNet data and you can find the corresponding CoronaNet record_id in the match_record_id column next to it In the matched_record_id column in the CoronaNet Data Integration sheet, please copy and paste the CoronaNet record_id that matches this record. Then you’re done! No need to move onto Step 2.
If the overlap_assess value reads ‘No’, then this observation is not in the CoronaNet data yet but should be included Please recode this data into the CoronaNet taxonomy, see step 2. below for more info
If the overlap_dum value reads ‘NA’, then based on our automated assessment of the overlap between the two datasets, it was not possible to make a judgement as to whether this observation is in CoronaNet or not You should manually check to see whether this observation should be coded as ‘Yes’ or ‘No’. If ‘Yes’, then please also record the corresponding CoronaNet ID in the match_record_id column. If ‘No’, then please recode this data into the CoronaNet taxonomy, see step 2. below for more info.

How to get credit for overlap assessment:

  • Please make sure to write your RA id number in the ra_id_overlap column in order to get credit for your work in assessing the overlap between the CoronaNet and external data. You can find your RA id number here
    • Feel free to also fill out the ra_name_overlap column since we aren’t robots and it makes it easier for your regional manager to see what you’ve been up to! However, because it can be very difficult to give people proper credit based on just their name (e.g. computers are case-sensitive/very unforgiving with alternative spellings) we can only give credit when the ra_id_overlap is properly filled out

Note before moving onto Step 2:

  • Make sure that you have at least skimmed through all of the descriptions for the entire external dataset for your country or subnational region.
  • To the extent possible, it is better to assess the overlap for the entire external dataset for your country or subnational region before recoding the data.
    • This is because the external data is not completely clean and it is better to have an entire overview first in order to identify potential miscodings before moving on to the second step.
2 Recode the data that is currently not in CoronaNet

In order to recode this data, RAs should follow the following steps:

    🗹 2.1. Identify policies that are not currently in the CoronaNet dataset by searching for rows that have the value ‘No’ in the overlap_assess column
    🗹 2.2. Click on the ‘link’ or ‘pdf_link’ for that observation and read through the information in the raw data source.
    🗹 2.3. Code the information that you find in that link into the CoronaNet Qualtrics survey. You can use the other column information which maps in the CoronaNet Integration Sheet as a guide for how you can code certain fields.

    🗹 2.4 At the end of the CoronaNet Qualtrics Survey, you will be asked two questions:

    • [collab] “If (one of) the sources that you used to document this policy came from another dataset, please note which dataset’ Information about the dataset that you are integrating will be found in the ‘data’ column in the CoronaNet Data; if you used a source/link that you yourself found, please choose ‘I found this source myself’ instead.
    • [collab_id] ‘Please copy and paste the unique id of the record that you used from the other dataset in the text entry below’ If you use a source from an external dataset, in this field, copy and paste the ‘unique_id’ found in the ‘unique_id’ column for this observation found in the CoronaNet Data Integration sheet.

    🗹 2.5. In the ‘integrated’ column in the CoronaNet Data Integration sheet, please choose one of the following:

    • ‘Integrated’; this means you have identified a policy that was in another dataset and recoded it into the CoronaNet taxonomy.

    • ‘Integrated with additional original research’: You may have to do some additional research for any number of reasons. E.g. the information that you receive from the link or pdf_link of the external dataset may be unclear or require additional context/knowledge to code well. In such cases, please note what additional research you had to do in the ‘Notes’ column and click this option.

    • ‘Integrated with additional work to find a new link’ means that the original link for the policy as noted in the CoronaNet Data Integration sheet is dead but that the RA was able to find a new link that corroborates the information described in the ‘description’ column. In this case please choose ‘I found this source myself’ option in 2.4 and click this option in the data integration sheet. Note that if you were able to find the information from the original link using the Way Back Machine then choose the ‘Integrated or ‘Integrated with additional original research’ option as appropriate.

    • ‘Integrated with additional original research AND with additional work to find a new link’: means fulfilled both the criterion under: ‘Integrated with additional original research’ and ‘Integrated with additional work to find a new link’. See above for more information.

    • ‘Duplicated policy’: this means that there were multiple external policies that were duplicates of each other. In this case, please only integrate one of them (and choose one of the ‘integrated’ options above). When you click this option, this means you do not integrate this particular policy because it is a duplicate. If the data is already in CoronaNet, pick one policy to mark as ‘Yes’ in the overlap_assessment and find the corresponding record_id to paste into the matched_record_id column. In general, for duplicated data, the overlap_assessment should be ‘No’.

    • ‘Not a relevant policy’: this means that after having taken a closer look at the link for the observation is not one that we would code in CoronaNet. The corresponding overlap_assessment should be ‘No’ in this case.

    • ‘Link dead, no other link found’ means that the original link for the policy as noted in the CoronaNet Data Integration sheet is dead and the RA was unable to i) use the Way Back Machine to find the original data ii) find another link to corroborate this information. In this case, please do not recode this data.

How to get credit for integration:

You can get credit for integration when you:

  1. Make sure to write your RA id number in the ra_id_integrate column in order to get credit for your work in assessing the overlap between the CoronaNet and external data. You can find your RA id number here

    • Feel free to also fill out the ra_name_integrate column since we aren’t robots and it makes it easier for your regional manager to see what you’ve been up to! However, because it can be very difficult to give people proper credit based on just their name (e.g. computers are case-sensitive/very unforgiving with alternative spellings) we can only give credit when the ra_id_integrate is properly filled out
  2. Enter in the integrated data through qualtrics. We’ll pull your information from the survey directly!

Note:

  • If you only fill out the ra_id_integrate column but do not actually enter in the external data through qualtrics, you will not get any credit for integration. Make sure that you’ve copy and pasted the correct unique_id of the integrated data to Qualtrics to get credit for you work!
  • If you enter in the external data through qualtrics but do not fill out the ra_id_integrate column, you will get credit for the qualtrics integration, but not the ra_id_integrate column. Be sure to fill out both!
Worried you might have made a mistake?

Check out the Data Integration Errors sheet to see if you might have made an error and get credit for your work when the mistake is fixed!

Problems that can occur include:

💡 Tips and things to watch out for:

We have spent the summer piloting data integration1 for a number of countries and as such have come up with a list of issues to watch out for when integrating data and strategies for how to deal with them:


OXCGRT-specific tips:


1 Big shout out to the following people for their work on piloting our data integration efforts: Marco Waldbauer, Rohan Bhavikatti, Isaac Bravo, Joseph Shim, Natalia Filkina Spreizer, Audrey Firrone, Silvia Biagioli, Mayuiri A.,Maanya Cheekati, Laura Eckoff, Fiona Valad, Katelyn Thomas, Humza Q, Amy Nguyen, Rawaf al Rawaf, Sella Devita, Paula Ganga, Tim Bishop, Jaimi Plater, Rose Pasty, Natalie Ellis, Maryam AlHammadi, Shreeya Mhade, Shrajit Jain, Kyle Oliver, Shaila Sarathy, Alisher Shariyazdanov,Emma Baker, Jurgen Kadriaj, Celine Heng and Augusto Teixeira

2 Technically OXCGRT also collects subnational data for Canada but since their data largely comes from CIHI, we do not include their subnational data for Canada in our pool of policies.