A Big Data Exploration Project : Mining online data for insights on child marriage
This is Part II of a series of posts on a collaboration between UN Global Pulse and the Packard Foundation. The collaboration emphasized the “learning by doing” process of designing a big data project, and tested the value of leveraging new sources of digital data to understand public perceptions. Stay tuned for Part III of this series, which will recount the key learnings and reflections from the experience.
UN Global Pulse recently completed a project, with the support and guidance of the Packard Foundation, to explore the value of various online data sources as a way of gaining new insights on child marriage in Ethiopia and India. The project also benefitted from valuable contributions from colleagues in UNICEF and UNFPA as subject matter experts.
The goal was to determine to what extent digital data analyses can shed new light onto traditional practices, such as child marriage. The collaboration focused not only on the data analysis and results, but also on distilling the process of designing a “big data for development” project from start to finish.
This blog outlines the design process: the steps taken to identify the topic area for exploration, and in developing the structure for the project execution.
I. Choosing a Topic
When deciding to embark on a big data for development project, a few questions need to be answered. What are the existing data gaps related to the issue you are tackling? What are possible alternative data sources that could fill those gaps or serve as proxies? As guidance, it can also be helpful to look at examples of what has been previously researched.
We started with an ideation workshop together with program staff from the Packard Foundation, and shared different inspirational examples related to their fields of activity. Individual sessions were organized to understand and pinpoint issues such as pain points, data gaps or data-driven aspirations.
We explored the following questions in each individual session:
- What are your programmatic priorities? What questions keep you awake at night?
- What are the data sources and metrics currently used in your programmatic area(s)? What are your data gaps? Is there information you wish you could gather/monitor faster, cheaper or more accurately?
- What kinds of quantitative data would be helpful to your work/programmes? What is the ideal timeline / granularity level for your data to be useful? (Examples: transaction/purchase records, mobile phone calling patterns, financial transactions)
- What additional social information or qualitative data would be helpful to your work/programmes? (Examples: Awareness of a safe sex campaign by young women, perceptions of stability in a specific area, impressions about level of service at a clinic)
- How could this potentially benefit your programmes?
To conclude the ideation session, each team worked on drafting the skeleton of several project ideas. Here is an example we shared with them:
Figure 1: Template for quick conceptualization of a problem
The workshop yielded 17 proposals and we set out to determine which project to pursue based on their feasibility, potential utility and innovation potential. Selection criteria included:
- Solved real problems defined/requested by development partners
- Data and partners access and availability
- Strong probability of near-term success
- Strong probability of adoption
- High potential impact
- Breakthrough potential (i.e. testing something new)
Based on the above-mentioned criteria, we chose the topic of child marriage. One of the main reasons was the limited availability of existing official data, including rates, public perceptions, and policy change impacts. While there is a great deal of digital data worth exploring on the subject - from news media content to Wikipedia entries, or websites with legislative documents – digital information on child marriage is not always easy to access or effectively analyse in ways that could be relevant for programme staff.
With these considerations in mind, the project was framed as exploratory, trying to answer the question, “Can easily accessible digital data provide new insights about child marriage?” Data innovation, particularly when testing the utility of big data for gaining new insights on a global development topic, is an exploratory process that may not always yield concrete results.
Initial discussions are crucial in shaping expectations and structuring ideas for a big data for development project. This first step creates an understanding of the specific needs of partners and gives the opportunity for providing initial feedback based on the lessons learned from previous data innovation projects. For this collaboration, we selected the topic of child marriage because the project focused on testing something new that could pave the way for inexpensive, easy to execute data innovation projects in almost any programmatic area in the future.
II. Planning the Project
Once a topic is selected, it is recommended to a) conduct a quick assessment of the landscape of digital data relevant to the topic at hand, and b) perform a short feasibility test to determine whether there is any relevant “signal” in the chosen data source(s).
In our case, we did a cursory overview of the digital landscape on child marriage. How many tweets are sent on the topic? How many Facebook users discuss issues related to child marriage in the two selected countries? What are the trend lines from Google Trends for keywords like "child marriage" and "Ethiopia"? We found that both globally and for India, there was more information on the topic than expected. However, in the case of Ethiopia, data was scarce which could in part be attributed to the lack of information on the topic in Amharic, the official language in the country.
Once the digital landscaping phase was completed, we began planning the project. The overall process of planning a project like this includes the following broad stages:
- Engage topical experts who understand the problem thoroughly
- Create lists of relevant keywords to look for in text-based data sources
- Explore the data
- Assess potential utility of each source.
III. Engaging Stakeholders
In order to develop a big data project, there are three types of roles or profiles engaged in the process:
- Domain Experts: The domain expert has sound sectorial expertise, but might not have a pre-existing understanding of the possibilities in working with big data. The domain expert is critical in the analysis phase of a project, as their understanding of the topic will be crucial in deciphering any trends or insights revealed in the data.
- Research Managers, or “Translators”: The research manager understands the constraints and opportunities faced by both the domain expert and the data scientist and is able to ensure that a project is progressing smoothly.
- Data Scientists and Engineers: The data scientist understands how to use and maximize the analytical tools available to work with big data, but might not have a strong background in the particular sector being investigated.
Although the aim of the project was to explore the potential of new data sources in providing insights into child marriage, getting child marriage experts involved was essential in understanding the issue and learning the vocabulary of the subject.
Our go-to domain expertise came from the Packard Foundation’s program staff. We also reached out to UN colleagues in UNFPA and UNICEF. In doing so, we engaged some of the foremost child marriage experts in the world.
Throughout the course of the consultations, we asked the domain experts the following questions to help refine the project, and hone in on the type of data we should be exploring:
- In your opinion, what are the 3-5 most important countries and regions when it comes to gaining more information/insight about child marriage? Is it possible to prioritize or rank these countries/regions?
- In your experience, what medium do girls and more importantly, child brides use to find or share information regarding their situation?
- Who are the main stakeholders in the dynamic system that either enables child marriage, or are working to prevent child marriages from happening (e.g., politicians, religious leaders, non-profit organizations)?
- What 3 insights and/or new datasets would help you the most in your work?
- What 3 insights and/or new datasets would help the most at the policy level?
- Do you know of popular online sources in our chosen countries (India and Ethiopia respectively), such as online news, discussion forums, etc. that publicly discuss the practice of child marriage?
In addition to domain expertise, it can also be valuable to identify data science and analytical support partners. A few ways to source this type of extra capacity include:
- Partnering with a university course or team of students
- Pro-bono data science or data visualization expert volunteers through networks such as DataKind or the Tableau Service Corps
- Interns and volunteers
For this project, through the UN Online Volunteer platform, we received assistance from a great independent partner, Akash Shetye, on much of the coding for analysing Wikipedia data.
Even relatively simple big data innovation projects need thorough planning and contingency plans. It is important to remember that many different stakeholders may need to be involved in one or more of the steps of a project. Ongoing consultations are necessary to evaluate whether changes in approach, data sources or expectations are needed. Even a well-drafted timeline may encounter delays. Therefore, when planning a project and identifying the various stakeholders that should be involved, the understanding that timeframes and methods might need to be slightly adapted should be made clear.
IV. Digging into the Data: Create a List of Keywords
Because this particular project was focused on exploring text-based data, building a keyword taxonomy was a critical step. One of the main obstacles in text analysis is language, meaning:
- The linguistic system of language. The project intended to analyse four languages: English, Hindi, Marathi, and Amharic
- The descriptive language. What words do people use in talking about child marriage? Do experts use the same terminology as non-experts. Are there local variations
Thanks to our group of experts, a baseline of relevant, topical words for English was quickly created. Expert translations for Marathi, spoken predominantly in Western India, and Amharic, the official language in Ethiopia, took longer to obtain.
Figure 2: Child marriage keywords in English, Hindi, Marathi and Amharic
Preparations for the text analysis needed to query a large database for documents/posts/articles about particular issues, may take half of the time allocated for a project. Querying requires topical knowledge, local knowledge, and linguistic proficiency.
V. Identifying Data Sources and Running Analyses
Keywords in hand, the next step was to identify what data sources were accessible, available and relevant. The following sources of data were explored for this particular project:
- Google Trends: Google Trends provided information on ‘when’ most searches were made regarding child marriage. Being an easy to use tool, it was selected as the starting point.
- Wikipedia: Being one of the richest text-based datasets in the world, Wikipedia was analyzed through a programmatic approach, meaning specific code was written to reveal insights about child marriage from the Wikipedia page for the four mentioned languages. As an Internet encyclopaedia, Wikipedia is one of the most visited websites in the world for knowledge gathering. The child marriage experts involved in the project also confirmed most child protection officers use Wikipedia. Because it is crowd-sourced and changes are logged, its edits and changes can also be analysed to pinpoint times where a topic is publicly debated or negotiated.
- News Media: For an analysis on news media content, we leveraged Quid, a platform that looks at English language news and automatically groups news articles by topic. While regional or local newspapers are not always captured in Quid, the global –English language media was a valuable resource.
Social media information was not prioritized for this particular project as a key data source. While sources like Facebook and Twitter can be good data sources for measuring the effectiveness of advocacy campaigns (there are plenty of social media campaigns dedicated to raising the issue of child marriage that have been headed by our colleagues at UNFPA, UNICEF and others), the aim of this project was to gain new insights on the causes and instances of child marriage, and to test less popular data sources.
The process of accessing, preparing and analysing different data sources requires time and expertize. Therefore, when starting a big data exploration project it is recommended to rank the expected utility of data sources based on the needs of the project and then prioritize the ones with the most potential value first. Often, ease of access, geographic scope, and value of information will be central. For this project, ease of access and global scope were the main prerequisites, while value of information was the desired outcome.