Forecasting Word Model: Twitter-based Influenza Surveillance and Prediction

Abstract

Because of the increasing popularity of social media, much information has been shared on the internet, enabling social media users to understand various real world events. Particularly, social media-based infectious disease surveillance has attracted increasing attention. In this work, we specifically examine influenza: a common topic of communication on social media. The fundamental theory of this work is that several words, such as symptom words (fever, headache, etc.), appear in advance of flu epidemic occurrence. Consequently, past word occurrence can contribute to estimation of the number of current patients. To employ such forecasting words, one can first estimate the optimal time lag for each word based on their cross correlation. Then one can build a linear model consisting of word frequencies at different time points for nowcasting and for forecasting influenza epidemics. Experimentally obtained results (using 7.7 million tweets of August 2012 – January 2016), the proposed model achieved the best nowcasting performance to date (correlation ratio 0.93) and practically sufficient forecasting performance (correlation ratio 0.91 in 1-week future prediction, and correlation ratio 0.77 in 3-weeks future prediction). This report reveals the effectiveness of the word time shift to predict of future epidemics using Twitter.

COLING 2016 paper (acceptance rate: 134/1,039 = 12.9 % (oral))

Forecasting Word Model: Twitter-based Influenza Surveillance and Prediction

Hayate ISO, Shoko WAKAMIYA, Eiji ARAMAKI

Dataset

click here to download dataset

Code

See our code release in Github (coming soon). Our code is a CountVectrizer wrapper on sklearn, so you can easily apply our method for other task.

Forecasting Words

Fever

Social Computing Lab.
8916-5 Takayama-cho, Ikoma, Nara 630-0192, JAPAN
Hayate ISO
hyate.iso@gmail.com