Thoughts on the Google Flu Trends debate

Alex Rutherford
Apr 3, 2014
Back in 2008 launched Google Flu Trends, a tool which used records of the number of times Google search engine users searched for certain terms, to predict outbreaks of the flu. This was remarkable for several reasons. First of all this was one the the first high profile uses of Big Data (Google search data easily qualifying under the 3 V’s; velocity, variety and volume) for public health. What’s more, the signal from Google’s search queries had the advantage of a more timely warning of a flu outbreak compared to the Centre for Disease Control’s usual method of detection, collecting the number of physician visits. The developers of the tool published their method and findings in Nature and developed an online tool for others to see flu predictions. Follow-up work has used Google search data to understand Dengue fever outbreaks, patterns in drug use, economic indicators (and also here) and unemployment.

Fast forward 5 years, however and a recent article in Science reappraised the Google Flu Trends methodology and found that it displayed a seasonal inaccuracy, consistently overestimating the flu prevalence. Many, especially within the Public Health community, have used the failings of Google Flu Trends to criticise Big Data more generally for its biases and lack of rigor (The Atlantic listed some of the more hyperbolic headlines in their Defense of Google Flu Trends).


Changing conditions + static modeling = less accuracy

Is this a timely riposte to Big Data hubris? Will this be the beginning of the end for Big Data? I don't think so - it is clear that data science is here to stay. But we can use this opportunity to learn some valid lessons, as practitioners of data science. 
To gather some insights we should examine what went wrong. Firstly, given the seeming robustness of the methodology, why did the recent model predictions fare so badly?
The answer is that while the model stayed the same, the reality it sought to describe did not. To understand this, we have to look more closely at exactly which variables the model predicts and which variables it uses to do so. 
The input is the proportion of all searches for the 45 flu-related search terms which best predict the data, and the output is the proportion of patients visiting physicians that display influenza-like illnesses. So if, over time, the total number of physician visits changes dramatically, or the total number of Google searches changes dramatically without a corresponding change in visits or searches due to flu in flu season, the model will start to make false predictions.
Of course these kinds of systematic changes are likely to crop up. Over time, as people rely on 'Dr Google', the related traffic through its search engine increases. As our original post on this subject back in 2011 noted, Google Flu Trends was better at predicting nonspecific flu-like respiratory illnesses than flu, so increasing public awareness of symptoms might prompt changes that create biases. Or perhaps with changes in healthcare legislation, it becomes easier for people to visit their doctor. New guidelines might lead doctors to diagnose the same symptoms differently. The point is that with any of these drifts in the underlying behaviour of the population over time, the model’s accuracy also drifts.
Herein lies another common way that biases and inaccuracies can creep into data driven modeling. Often an input or output variable is not a direct measure of something but rather a proxy for it. Whether it is a good proxy or not is a question of judgement and context and many traditional surveying methods implicitly make similar assumptions, but the distinction between direct and indirect measures must always be borne in mind. The Google Flu Trends model attempted to predict the proportion of people visiting physicians displaying flu-like symptoms. Assuming the underlying propensity for doctors to diagnose symptoms as flu and for people to visit a doctor remain the same, then the model will continue to make accurate predictions. 

Reproducibility matters

These shortcomings were somewhat exacerbated by the fact that the exact terms chosen to be used in the model were not described in the paper, meaning that researchers wishing to reproduce, verify and potentially build upon the work were not able to do so. Currently, reproducibility of scientific results is a hot topic (see The Economist’s leader from last year) and several recent high profile cases of misleading and even fraudulent results have led to more rigorous standards for transparency and reproducibility.
While Lazer et al definitively show that current predictions from Google Flu Trends are best not trusted, does this mean that the emergent promise of Big Data should be abandoned? Resolutely no, and by the authors’ own admission Google Flu Trends and Big Data more generally are of great societal value. What is more, using Google Flu Trends alongside traditional CDC data restores predictive accuracy.

Lessons to learn & a research code of ethics

There are a number of lessons to learn from Google Flu trends, which also dovetail with the research code of ethics to which we subscribe at Global Pulse.
Firstly, we want our research to be implemented and put into action by development practitioners, but ongoing monitoring and evaluation still needs to be performed to assess efficacy. Rigor is crucial when critical decisions need to be made. Imagine that Google Flu Trends had been seen as a tool to be used indefinitely with a dedicated team to monitor it, rather than a one-off deployment representing a proof of concept, it is likely that the modeling would have been updated and it would have stayed out of the headlines.
Secondly, there needs to be more open-source outputs. New tools and methodologies should be as accessible as possible to the people who can make the best operational use of their insights, and to do this software and materials should be easily shared. Transparency in data analysis should also be paramount, since users of data analysis tools must have full confidence in any conclusions or insights drawn.
Of course transparency in raw data only extends as far as individual privacy can be protected and we advocate strongly for protection of personally identifiable data. Likewise, private companies that are taking part in data philanthropy or working with NGOs or UN agencies as part of a CSR program can reasonably be expected to be wary of sharing commercially sensitive data.
Finally, big data is no panacea. Just because social media data exhaust is available does not mean that it can or should replace traditional sources of data. 
As the authors of 'The Parable of Google Flu' note, the most effective use of socially generated data is in supplementing existing official sources, in this case the CDC data. They write: ‘If you are 90% of the way there, at most you can gain that last 10%, what is more useful is to understand the prevalence of flu at very local levels, which is not practical for the CDC to widely produce.’ 
In this example and in many other instances, the granularity of social data can be a helpful addition to official sources of information.
We recognise this at Global Pulse and so before starting a big data innovation project with a partner or UN agency, we conduct a feasibility study to appraise the suitability of Big Data sources in tackling the particular problem statement. In most cases the expectation is not that the big data will replace the official sources of information, but that it may provide an opportunity to collect information cheaper, or faster - or all together provide a new layer of insight on the topic. 
So these new sources of information should not automatically be proposed as a replacement to traditional data, but rather big data should complement traditional sources  (Patrick Meier addresses this very well in the disaster response context). And Lazer et al show that the combination of lagged CDC data and Google Flu Trends data outperforms either one by itself.
Big Data continues to show great promise in the development, as well as other sectors. Global Pulse welcomes such studies offering vital assessments of how and when it should be used, and practical lessons for best practise.  

Google Flu Trends, three questions to ask:


1. Was the original Google Flu Trends model flawed?

The original Google Flu Trends methodology and validation was sound. The method began by examining nine different locations in the US and comparing all Google searches from those areas with the proportion of physician visits reported to be from patients displaying Influenza-like illnesses, as reported by the CDC with a lag of some weeks. By examining the time series of weekly search data and physician data from 2003-2007, the researchers were able to learn which terms that people searched for, out of 50 million, correlated strongly with the flu. The model findings could fairly be judged robust since they are based on a long time period and across 9 different locations. The more widely applicable a finding is, the better.
The selection of the search terms to go into the model represents one of the first stumbling blocks in Big Data research; when confronted with a large number of variables, it is possible to find seemingly significant correlations which simply occur by chance! (this is a point made very well by Nasim Taleb) However the authors were careful to compile a list of terms which predicted the data well, but were also logically connected with the flu outbreak. Crucially, however, the exact terms were not specified, making the findings unreproducible.

2. Could the problems with Google Flu Trends have been avoided with better testing of the model?

The model testing was robust. Once the Google researchers had found the search terms which best fit the data, they tested their model on new unseen data for 2007-8; this type of ‘blind test’ is known as a validation set in machine learning. This avoided another pitfall in models learning from data; fitting too closely to known data so that your model doesn’t perform well when applied to new data that looks slightly different.

3. How could Google Flu Trends been deployed differently to continue to make accurate predictions?

One mistake that was made was to leave the Google Flu Trends algorithm unchecked for several years. The online landscape is a dynamic one; the way online services are used is liable to change considerably in the space of a few years and Google constantly makes changes to its platform. Therefore such findings shouldn’t be considered as fundamental and unchanging such as those discovered in physics or mathematics.
Instead the model should have been systematically re-calibrated to ensure it was still making accurate predictions (tools such as the Kalman filter allow for continuous updating of the parameters). Perhaps the continuous underlying changes over time, such as growth in overall searches, could have been incorporated into the model with an extra linear term. However it is likely that single one-off changes to the underlying search engine algorithm or platform would require a complete recalibration.
Alex Rutherford is a data scientist at Pulse Lab New York.
Image: A flu shot by Greg Hinson, via creative commons

Add comment