In a widely cited study, Johan Bollen, Huina Mao and Xiao-Jun Zeng claim
that
… collective mood states derived from large-scale Twitter feeds are correlated to the value of
the Dow Jones Industrial Average (DJIA) over time. … We find an accuracy of 87.6% in predicting
the daily up and down changes in the closing values of the DJIA …
The media have responded to this study with a mix of adulation and credulity. Perhaps the
narrative presented by Bollen et. al is appealing because it assuages our suspicions
that Twitter is a frivolous waste of time;
or perhaps it fits with the ‘needle in the haystack’ technophiliac fantasy;
or perhaps it empowers the hoi polloi, who now claim the mantle of controlling the Dow Jones Index by
the vagaries of their mood.
Whatever the appeal of the paper, the story continues to resonate in the internet echo chamber, with ripples
still appearing now, some 18 months after the original publication. Among those reporting this paper,
without any hint of skepticism are
The Telegraph,
The Daily Mail,
USA Today,
The Atlantic,
Wired Magazine,
Time Magazine,
CNBC,
CNN,
All Things Considered,
On the Media,
and a long tail of blogs, newspapers, etc.
Given the entirely unskeptical reception the Bollen paper
has received, there is a clear need for a critical evaluation of it, expressed in terms that
can be understood by those with no formal statistical training.
The principal problems with this paper are:
- The authors exhibit a level of sloppiness that taints the integrity of the results. From
basic accounting mistakes to appalling methodological flaws, these errors call into question whether
any of their results can be trusted.
- The advertised results, e.g. the purported forecast accuracy of their system, are biased ‘by
selection.’ Effectively, the authors have picked winners after the race is run, citing the results
of the race as unbiased estimates of true merit, without untangling the effects of luck.
- The advertised 86.7% forecast accuracy is suspect in vacuo, since it would
yield the greatest quantitative strategy ever discovered. That this would have
been discovered by newcomers to the world of quantitative finance, and that the strategy
depends only on two- to six-day old public information beggars the imagination. There is no
sensible physical model of how such a large effect could exist, nor any reason it
would have passed undetected until now.
There are roughly three parts of this paper beyond the introductory material:
- The sentiment analysis tools employed are not insane, in that they correctly detect that
people are, in general, happy around Thanksgiving, and uneasy before an election, for example.
- Some of the mood scores found by the sentiment analysis tools are purportedly correlated
with changes in the DJIA, according to a ‘Granger causality’ analysis.
- The raw mood scores are turned into forecasts of daily DJIA movements. Accuracy of the
system is ‘confirmed’ by looking at some extra data (a ‘hold out’ set) which
was not used in the training of the predictive models.
I will tackle the last two findings in turn; the first finding is mostly absent of any specific predictive
claims.1
Because the paper gives only loose technical details, and the data used are not widely available
(collecting all Twitter feeds over a 1 year period is a technically challenging feat),
it is impossible to definitively refute the claims; rather they can only be cast into serious doubt.
the Granger Causality tests and Table II
The second ‘finding’ of Bollen et. al. is that of purported statistical significance in a
Granger causality test. This is supposed to establish
the ability of raw Twitter mood data to forecast changes in the DJIA index. There are
numerous technical reasons why such an analysis might malfunction. However, none of them need be
invoked here because the authors make a much more basic statistical blunder, that of not correcting
for multiple hypothesis testing.
A classical statistical test spits out a ‘p-value’, which is something like a probability when assuming
some condition that you would like to rule out. A p-value balances the amount of evidence and its
strength. If the resultant p-value is indeed small, smaller than some ‘sacred value’, usually taken to
be 0.05, one claims that the ‘null hypothesis’, the condition assumed as part of the test, is unlikely,
or is ‘rejected’. In the Granger causality analysis being performed here, the ‘null hypothesis’, the
hypothesis the authors wish to reject, is that the Twitter based signal has no forecasting ability on
DJIA, and is effectively independent from it.
Bollen et. al., in Table II of their paper, commit the statistical sin of performing
many such tests (49 of them), and then attributing statistical significance to those that have a
small p-value
(they display the p-values in boldface if they are less than 0.10, attach two stars to those
less than 0.05, etc.).
If one were to perform ten million such tests, and the null hypothesis were true
(i.e. if Twitter did not predict DJIA in any way), one would expect to have one million
resultant p-values less than 0.1, printed in boldface in one’s enormous table. Similarly,
one would expect to have one million p-values between 0.317 and 0.417, a hundred thousand
between 0.8349 and 0.8449, etc. The presence of many small p-values in this scenario is
simply due to chance ‘bad luck’ under the null hypothesis.
For comparison, here is a plot of the
empirical distribution of the
p-values from Table II.
Under the null hypothesis, as one performs more and more statistical tests one expects the
p-values to be ‘uniformly distributed’, and thus the empirical CDF plot would fall on the \(y=x\) line,
plotted in red here. If the null were violated, i.e. if the Twitter mood data exhibited
‘causality’ on the DJIA movement, we should see a lot of p-values on the left side of the plot,
and the empirical CDF would hug the left side and top of the plot, bowing away from the diagonal.
However, by my eye, the data are consistent with the null hypothesis, and the
7 p-values less than 0.10 are no more remarkable
than the 13 that are greater than 0.90.

Performing a Bonferroni correction
for multiple tests, none of the p-values from Bollen’s Table II are considered
statistically significant at the 0.10 level. For the layman, the conclusion to be drawn
is that the evidence is not inconsistent with all the Twitter moods and lags being independent
from movements of the DJIA, and some of them looking better than others due to chance.
the Forecast Model
The third, and perhaps most galvanizing, ‘finding’ of Bollen et. al. is of an
“accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA.”
This is formulated in terms of cross-validation of a Neural Net model,
using training and test (or ‘hold out’) sets of data.
The goal is to simulate how this model would be used in the real world, trading real money:
- Train the model using all the data you have up to this very minute;
- Going forward, input each day’s new Twitter data into the model to get predictions to make trades.
- Repeat this process, retraining the model as is expedient or necessary, and trading the forecasts
every day.
Typically when one trains a model on data, the model’s own estimate of how well it understands or
can predict that data is optimistic. This is why one tests a model methodology by training a
model, then validating it’s predictive ability on data that was not used in building the model.
This is commonly accepted practice. However, Bollen’s finding is broken in so many ways:
They got the number wrong. They report an accuracy of 87.6% in the abstract and twice in
the paper; they report the same figure as 86.7% twice, including in Table III.
Since the accuracy estimates are based on 15 (!) days of test data, the correct value
is the smaller one, 86.7% corresponding to the fraction 13 / 15.
The incorrect figure is widely quoted in the media, and was used by Johan Bollen during his
interview with CNN. Not that it
matters, because …
The forecast accuracy is reported with far too many significant figures. If the model had
correctly predicted 12 or 14 days’ directions, instead of
the 13 it did, the number would change by plus or minus
7 percent. For the technically minded, the
standard error on the accuracy figure
is around 9%, and a 95% lower confidence interval on the
accuracy figure is 72%. For the layman, the upshot is that
it is not inconceivable that the accuracy of the system is as small as
72%, but it looked better in this experiment simply due to
random luck. In all, reporting two significant figures is unwarranted, much
less three. The effect is perhaps minor, but it does not instill confidence in the
authors’ attention to detail.
The accuracy figure is biased upward.
The reported 86.7 % accuracy is
the maximal accuracy achieved for the 8 models listed in Table III of the
paper. As in Part II, where the smallest p-values were reported as ‘significant’, when they could
be explained due to chance, here there is an (upward) bias in the sample accuracy numbers when
selecting based on those same quantities.2
As an analogy, imagine if the 8 models listed in Table III truly
had no predictive ability, and thus a forecast accuracy of 50%. You can view them as fair coins.
The probability that a single fair coin would land heads 13 or more times out
of 15 is 0.37%. This probability is so small it makes us
doubt the assumption that the models in Table III are really non-predictive.
However, if one were to flip 8 fair coins, the probability that
at least one of them would land heads 13 or more times out of
15 is 2.9%.
While this is still small, it is less damning of the assumption of non-predictive models.
A similar problem exists with selecting the ‘best’ model based on some sample
statistic, then using that same sample statistic as an estimate of a population parameter.3
Here the forecast accuracy of 86.7% is inflated by the fact that we selected the model based
on the estimate.
And this is only the bias that we can observe from the paper. There is the very real possibility
of unobservable bias, i.e. datamining bias and
publication bias. That is, the authors might have
tried numerous different data treatments and algorithms, evaluating the purported out-of-sample
accuracy, before settling on one where the results were considered sufficiently ‘interesting.’
Continuing the coin flip analogy, if one were to
flip 50 fair coins 15 times, the probability that one of them
would land heads 13 times is
17%. Now the results seem much less interesting.
One cannot prove that the authors biased their results in this way. It just provides a
plausible alternative explanation for the observed ‘effect.’ The authors also did themselves
no favors by using such a tiny sample size: if their model had correctly predicted the
direction of the DJIA on 130 out of 150 days instead,
the possible effect of this kind of bias is lessened.
The model accuracy seems high compared to the Granger causality results.
The forecast accuracy of 86.7% seems rather high compared to the unconvincing p-values
reported in Bollen’s Table II.
To test this, I perform some
Monte Carlo experiments. For one realization of
the Monte Carlo experiment, I take the returns of the DJIA index
over the period February 28, 2008 to November 3, 2008, and spawn a random -1/+1 random variable which
has the sign of the next day’s DJIA log return with probability \(13 / 15\).
I then feed it to R’s grangertest function, with 2 lags, and record the p-value. I repeat this experiment
200 times. The point of this experiment is to get some kind of feeling for what a binary signal
with the purported accuracy would yield in a Granger analysis.
The maximum p-value from 200 Monte Carlo realizations is 2e-05.
Compare this to the smallest of the 49 p-values reported in Table II,
0.013. This is something of an apples-to-oranges comparison because, in
general you cannot just compare p-values, and the Neural Net model can capture non-linear relationships
that the inherently linear Granger model does not. However, it is very suspicious to me that such an
accurate forecast could be made from raw data about which the Granger tests were so ambivalent.4
An 86.7% forecast accuracy on DJIA’s daily movement would represent the
greatest quantitative strategy ever discovered.
As an illustration, here I perform a Monte Carlo simulation of the historical performance of a
system with the purported forecasting ability. With probability
\(13 / 15\), the strategy gains the absolute
return of DJIA, and otherwise loses that amount. It trades at 1x leverage on the DJIA
from 1970-01-02 to 2012-04-13. Here are the performance plots showing, respectively,
the cumulative return, the daily return, and the drawdown from peak.

Note that under the random seed chosen here, the simulation is on the wrong side of
Black Monday, and thus the results are mildly
pessimistic. However, the annualized Sharpe ratio
of this backtest is \(9.2\mbox{yr}^{-1/2}\), with 95% confidence interval
\([8.9\mbox{yr}^{-1/2},9.5\mbox{yr}^{-1/2}]\).
It doubles its money every 26 weeks.
For the layman, the Sharpe ratio is the metric (other than ex post returns!) by which
trading strategies are measured. To put these figures into context, an achieved
(i.e. in real trading, not backtesting) Sharpe ratio of
\(1\mbox{yr}^{-1/2}\) is considered ‘good’; an achieved
value of \(2\mbox{yr}^{-1/2}\) is ‘excellent’; anything north of \(3\mbox{yr}^{-1/2}\)
is the stuff of legend.5
I have read dozens of papers on quantitative strategies and market timing6, and, to
the best of my recollection, have never seen one claim a Sharpe ratio higher than \(4\mbox{yr}^{-1/2}\).
Shen’s analysis of timing strategies, for example,
lists ‘successful’ market-timing Strategies with Sharpe ratios on the order of
\(0.5\mbox{yr}^{-1/2}\) to
\(0.7\mbox{yr}^{-1/2}\).
None of the tin-foil hat purveyors of market timing signals one can find on the web claim Sharpe ratios
higher than \(2\mbox{yr}^{-1/2}\), nor do they promise 100% returns in 26 weeks.
Bollen was apparently unaware he had found the philosopher’s stone when he was
quoted as saying:
“… we are hopeful to find … better improvements for more sophisticated market models,” i.e. we
hope to make the model even better.
The putative mechanism for the forecast defies all common sense.
Part of the authors’ argument is that the ‘Calm’ signal from Twitter is predictive of the DJIA
at two to six day lag, and thus they use lagged data from this signal as input to their
forecast model. Somewhat paradoxically, the one day lag of ‘Calm’ does not give significant
Granger p-values in Table II. Somehow, we are to believe, the information content
‘skips a day’ (or more). This is contrary to common sense, and common practice of downweighting
older observations as less relevant. It is particularly hard to imagine how using
two- or three-day old tweets would give one the best market timing model of all time.
Furthermore, because the daily movements of the DJIA are ‘high frequency’
(autocorrelation would be ‘arbed out’), a gap such as this could cause the signal to appear
‘out of sync’.
For example, let P and C stand for ‘panic’ and ‘calm’ in the Twitter ‘Calm’ signal, and let
+ and - mean up and down days for the DJIA. Imagine the following stream of days, where the
DJIA moves exactly as suggested by the ‘Calm’ signal two (market) days prior.7
Calm: P P C C P C P P C ...
DJIA: ... - - + + - + - ...
Because of the delay effect and DJIA’s high frequency nature, in this example the DJIA often
has down days when the ‘Calm’ signal is calm, and up days when it panics, meaning market participants pay more
attention to how the Twitterverse felt two or three days ago than how it feels today.
Are we to believe that Twitter users are trading on how they felt two or three (but not one) days prior,
and thus moving the market? Or are they predicting the state of the world two or three days ahead of time,
without being able to predict tomorrow? Both of these models are nonsensical.
A more reasoned interpretation of the results is that the two- to six-day lags in the
‘Calm’ signal looked better due to datamining bias,
and any justification for their existence (I have seen none) is ex post story telling.
Moreover, given that the putative effect leads to the best market timing model of all time, and
the signal is based on people’s expression of mood, one would think that people, in general, would
be good at market timing, i.e. do significantly better than random. There is no evidence that this
is the case.
The form of the accuracy claim is almost impossibly general.
The forecast accuracy is quoted in terms of the predictive accuracy of the
“daily up and down changes in the closing values of the DJIA,” full stop.
Are we to accept this accuracy claim holds both in bear and bull markets?
In periods of high volatility and low? Regardless of whether tomorrow’s DJIA
return is, in absolute value, 2 percent or 0.05 percent? It is not clear how
such a broad claim could be extrapolated from performance during
15 trading days in December 2008.
Employing Hanlon’s Razor, I am to conclude that
Bollen, Mao and Zeng are statistical naifs. This is consistent with the egregious methodological
flaws evidenced in their paper. If it were merely a matter of the authors’ reputation, we could
agree that mistakes were made and move on. However, Bollen and Mao
have teamed up with a hedge fund to
‘capitalize’ on this market timing model.8 Thus unsuspecting real investors can lose real money
if the advertised forecast accuracy fails to exist in the real world. Moreover, given the fee
structure of hedge funds, investors in said fund are probably signing up for ‘random walk
minus costs’, which seems like a bad deal.
It would be too simple to fault the media’s fawning reaction to this paper. After all,
the whole story is stuffed full of new-technology-catnip, and there has not apparently
been an accessible critical debate of its merits. In my opinion, the peer-review process
has failed miserably here, and journalists can choose only to either re-report the finding as
gospel fact or ignore it entirely.
Disclosure author has no holdings in Twitter, holds broad market ETFs which intersect with the
DJIA, has never made money in market-timing, and would short the Twitter hedge fund if shorting costs
were possible.