Data Diving in Washington DC

René Clausen Nielsen
Apr 7, 2013


130 volunteers from the DataKind network, several experts from international organizations and the team from Global Pulse swarmed the World Bank Headquarters in Washington DC last month to take part in a "Data Dive" co-hosted by the World Bank, UNDP, the Qatar Computing Research Institute (QCRI), and UN Global Pulse. Our collective goal with this event was to unearth key questions related to programme efficiency, poverty and anti-corruption, bring awareness of the possibilities for using new data sets to solve those questions, and give global development practitioners the opportunity to sit side-by-side with data scientists to actually "dive in" and explore the utility of data science for their lines of work.

A DataDive is an event that brings together field experts and socially conscious data scientists, surrounding them with their two favorite things: research questions, and data sets containing answers. The experts bring contextual understanding of the problem and the data, the data scientists bring their understanding of the latest analytical methods and technology tools to reveal insights from the data.

As Jake Porway, co-founder of DataKind, explained at the beginning of the event, data-diving is a bit like scuba-diving. In both, you should:

  • Go with a friend - because it's an adventure you want to share;
  • Go slowly - that way, you won't get the bends;
  • Be careful - it's exciting, but serious stuff too, and needs to be treated with caution;
  • Be ready to explore - you are in unknown territory, go find the hidden treasures!

With this in mind, the divers approached an intensive weekend of work. Experts from the participating organizations helped frame several questions focusing on programme efficiency, poverty, and anti-corruption (see the video below for more context).

The data scientists divided into teams and analyzed data sets provided by the World Bank and the United Nations to find answers to those questions. Scroll down to review the different questions which were tackled and some of the insights which emerged:

Measuring Socioeconomic Indicators in Arabic Tweets

This team had a very unique dataset at hand: 100 million Arabic language tweets from 2012. Not only was the dataset one of the biggest of the weekend, but it was also in Arabic, so it surely wans't your typical UN or World Bank dataset to work with. The challenge was to dig into the Tweets to determine whether useful socioeconomic indicators could be extracted. We have undertaken similar exploratory projects here at Global Pulse - one of them being "Twitter and Perceptions of Crisis-Related Stress" - and it is a question we continue to explore in Pulse Lab Jakarta, so we were eager to see what the team could come up with to enhance our understanding.

The team started by looking at subsets of tweets at words indicative of common goods, economic terms, and positive/negative sentiments, and later moved on to see when people tweeted, about which places they tweeted, and who was related to whom. Tweets were analyzed by people interested in truly big data sets with both volume and velocity, social media analytics, and not least - Hadoop clusters!

The group found that Twitter should indeed be investigaged further, as it does seem to have potential as a tool for measuring socio-economic indicators.

Read notes and documentation about the "Measuring Socioeconomic Indicators in Arabic Tweets" project team here.

Can we predict small-scale poverty measures from night illumination?

This project needed a lot of different skill sets, so the team working on it turned out to be very big and with a variety of skills: data visualization, predictive modeling, GIS expertise -- they had it all.

Through comparing the 2001 and 2005 poverty levels of Bangladesh at every country administration level provided by the World Bank, and satellite imagery freely available from the National Oceanic and Atmospheric Administration (NOAA), the group sought to explore whether average nighttime illumination could serve as a reasonable poverty measurement proxy. At the end of the DataDive, the jury was still out, but the team found that average light intensity was a fairly good predictor of poverty level based on the Bangladesh data from 2001. The data from 2005 was less convincing, which could potentially mean that comparatively more poor people got access to electricity between 2001 and 2005.

Read notes and documentation from the 'Predicting Small-Scale Poverty Measures from Night Illumination' project team here.

Is it possible to get real-time food price information by scraping data from websites?

This is a project near and dear to our hearts at Global Pulse as it builds on some research we have been working on for a while with our partners. The DataDiving team found a great dataset of crowdsourced daily price data from Kenya through a service called m-farm, as well as data from an Indonesian supermarket chain, to work with. The exploration showed a lot of potential in scraping current rice prices and historical data from Indonesian supermarket pages and historical data from the supermarket's pages on Wayback Machine:

The findings from the analysis were posted on a GitHub page, and Max Richman - the group's Data Ambassador - has written a great post about the work in the group on InterMedia, "Data Volunteering with International Organizations at the World Bank".

Read notes and documentation from the "Scraping Websites to Collect Consumption and Price Data' project team here. 

What can the World Bank's supplier profiles tell us about risk management?

The World Bank contracts an enormous amount of companies, and is always looking for better and more efficient ways to do background checks. Can the World Bank and other agencies analyze publicly available data to gain a broader, more comprehensive understanding of their suppliers and use the information as proxies for risk management? This group seemed to suggest that the answer is yes.

The team also found some interesting trends that could be explored further. One example was that the World Bank Board of Directors are most busy approving projects in June, at the end of the fiscal year. The team had a theory that the board may not have the necessary time to research each proposal, opening the possibility for more corruption to slip through the cracks. An extra incentive to look into this was found in the data visualization below, which shows that the approvals were increasingly more likely to take place in June in recent years:

Francis Gagnon, who worked on this exercise, wrote two great blog posts about the efforts of this team: "It starts with the data" & "Diving with a view".

Read notes and documentation from the project teams here and here.

Can budgets reveal what collection of staff skill sets and resources work best for UNDP?

UNDP brought a unique data set to the data dive, and asked the volunteers whether an analysis of staffing and programme budget data could help reveal what combination of skill sets make the organization's country offices most effective. After a bit of data munging (converting data into the appropriate format) and definition exploration, the team visualized some of the basic data (using R and SAS) that was necessary to get on with the task at hand. The most impressive visualization was probably this one:

The team found that key drivers of efficiencty were mostly related to characteristics of the project, and not of the staff working on it. However, analyzing the staff characteristics in the data, the team found that the probability of a project's staff to spend more than their allocated budget was directly proportional to the average number of years of services of the staff, the total number of staff involved, and the novelty of the project.

Read notes and documentation from the "UNDP Resource Allocation" project team here.

Can mobile surveys provide useful data about poverty in Latin America?

Looking at data from both traditional household surveys and from mobile surveys conducted in Honduras and Peru, this group tried to identify if the two types of data could be compared in terms of quality. Household surveys are the only true recognized method for data collection by many development organizations, so the possibility to compare the results collected through them with data collected through the (often competing) alternative method - mobile surveys - was intruiging. At Global Pulse we've looked at the feasibility of conducting rapid mobile surveys to get periodic "snapshot" results about well-being from communities, and thus were eager to see what the volunteers came up with in their analysis.

The group found that mobile surveys do provide interesting results. For instance, more respondents answer when approached by a real person, but there are differences in the answers: around three times more people admit that something bad has recently happened in their life in a follow-up mobile survey than in a face-to-face interview.

The group mentioned two possible explanations:

  1. Selection bias: people who have had a bad thing happen to them are more likely to answer.
  2. Shame/Privacy concern: people under-report in a face-to-face interview.


The team also found that mobile, impersonal follow-up surveys may provide more frequent and better data than face-to-face interviews, however people who had bad experiences may be more likely to answer those surveys than others. Ideally, further testing should be done, asking identical questions via different methods and then analyzing the answers. 

Read notes and documentation from the "Latin America Poverty Analysis from Mobile Surveys" project team here.

Can you use social networking analysis tools to forecast project risk?

This project team was focused on diving into World Bank project data. They started by defining what elements make a project successful, and then dividing the projects according to sectors and themes. The projects were then visualized in a way that displayed their size and their perceived success, suggesting that social network analysis tools could indeed provide useful insights to forecast project risk.

Read more about the analysis and findings from this team in renoust's slideshow, Network Analysis Applied to Project Risks Identification, below. 


Can You Use Simple Heuristic Auditing to Sniff Out Discrepancies in Expenditure Data?

What do you do when you have the information but don’t know if it contains signals about potential fraud and corruption related risk? You ask a team of data scientists to look into it - and they will develop, in two days, a prototype tool to match documents to conduct the fraud analysis at the field level!

Read notes and documentation from the "Can You Use Simple Heuristic Auditing to Sniff Out Discrepancies in Expenditure Data?" project team here.

All in all, the DataDive was inspiring, and the insight provided was invaluable. We at Global Pulse are keen to follow up on some of the newly unearthed ideas and methodologies and test them further in our labs. As noted by one World Bank staffer at the end of the weekend, after all the final presentations were made: what was achieved in this one weekend was equal to a year's worth of work.

We are truly grateful to all those who partcipated in this exploration with us and our colleagues from the World Bank and UNDP. In closing, we'll again share the words of Jake Porway from DataKind and Neil Fatom:


René Clausen Nielsen is big data analytics strategist, working with Global Pulse and UNDP/Millenium Campaign on a social media analysis project.

Add comment