Accurate estimates of population demographics are vital in order to understand social and economic inequalities, and are essential to UNICEF’s work, as knowing where the most vulnerable children and families live is important for resource allocation. Traditional methods of collecting such estimates, however, are both time-consuming and expensive. Here, we explore a complementary approach.
How is data traditionally gathered?
The traditional method for collecting data is using surveys. Let’s say we want to have data regarding how wealth is distributed in a country, where do poor families live, and where do rich families live? A classical solution to this problem is to send out investigators who go from household to household, typically visiting thousands of households, interviewing the inhabitants asking them about their spending and consumption patterns, the assets they own, and many other factors.
This data collection procedure takes a lot of time. First, you have to train your interviewers to ask the correct questions. As an example, asking an individual “how much money do you earn?” is a bad question as people usually never disclose the correct amount. Instead, a much better question is: “how much money did you spend on food last week?” Then you have to decide which households the interviewers should visit — visiting too few poor households will bias your estimation. The survey team has to settle on a correct mix of poor, middle class, and rich households. Naturally we would get the best estimates if we visited all households in a country, but this infeasible to do. Third you need to wait for the interviewers to visit all these houses; this usually takes a couple of weeks or months. Eventually, after having waited a couple of months you get an answer to your question: where do poor families live?
This gives us a static snapshot of the country, but say we want to know how this changes from year to year. Well, we will have to repeat the procedure all over again. Surveys generate amazingly rich data, but there must be a way of speeding up the process? There must be a different, faster, and less invasive approach. Can we use already existing technology to help us bridge the gap between surveys in order to keep development estimates up-to-date?
One solution is to use mobile phones to collect data, so instead of physically going from household to household, we could instead call and ask people questions over the phone (phone surveys). However, think of the last time you actually took time to complete one. This idea isn’t viable. As such, at UNICEF Innovation we are exploring alternative approaches.
Experimenting with new data sources
At the Office of Innovation, we are testing out approaches that will allow us to keep development estimates up-to-date. We are currently engaged in testing 3 different methodologies to estimate and map development indicators. One approach looks into using data from mobile phones — this has previously been documented to work well in specific settings (see more here), however, mobile phone data can be biased and it is difficult to get access to. The second approach focuses on satellite images which enable us to get a complete picture of entire countries; nonetheless, analyzing images with deep learning algorithms can sometime leave you with more questions than answers (understanding why a deep learning algorithm returns a specific value is not always easy) . The third approach is based on some of our previous work with social media data (see more here). Social media data is generally easy to get access to but is typically biased towards the most wealthy. Nevertheless, we have previously shown that it can be applied to map unemployment. All three approaches are currently being tested out in our MagicBox initiative.
The question we are asking here is: can we use social media traces to map the wealth of regions? Partnering with researchers from the MIT Media Lab we set out to explore whether aggregated and anonymized social media data from Twitter can be applied to estimate the human development (HDI) index of 9 countries. More specifically, can we use Twitter to estimate subnational HDI of the following set of highly diverse countries: Brazil, Colombia, Costa Rica, Indonesia, Mexico, Nepal, Nigeria, Pakistan, and Poland?
We recently presented this work at the International Conference on Computational Social Science, showcasing how a few key variables extracted from social media data can be used to estimate HDI to a good degree. The variables we looked into are very straightforward — we use the popularity of the technology, i.e. how large a fraction of the residents living within a region are Twitter users. We complement this with activity data describing when people from different regions use the platform, and with regional mobility data, quantifying travels between municipalities (see below figure for an example). Our results further show that this approach works even for countries where Twitter is not popular. For instance in Nepal where Twitter has few users the most important variable is popularity; more users means more smartphones, which in turn means more wealth. For a country like Poland it is not only about popularity, but also patterns of usage (when and how you tweet) that are important. Overall we see many cultural differences.
As previously mentioned this is only one approach to estimating socio-economic indicators — we are experimenting with multiple others (see more here). In the future, we plan to test the robustness of these methods, compare the various approaches to each other, and test the possibility of merging them into one large model.
UNICEF needs to understand where the most vulnerable families and children live so that we can prioritize resource allocation to where it is most needed. If we are successful, operationalizing such models will enable us to bridge the data gap between surveys giving policy-makers and humanitarian organizations access to continuously up-to-date data.