Over a year ago, I wrote a letter to the editor of the Journal of Computational Sciences, urging the retraction of Bollen, Mao, and Zeng’s paper, “Twitter Mood Predicts the Stock Market.” Since JoCS is an Elsevier journal, one does not simply email the editor. Rather, one has to register with the Elsevier author system, and submit LaTeX source code of a letter, along with supporting documents, author bio, etc. The rules of this journal limit letters to the editor to be four pages, so I distilled the main arguments into two: first, that the Granger causality tests presented in BMZ’s paper are consistent with datamining, and present no evidence for a connection between Twitter and the Dow Jones Index; and that the quoted predictive accuracy of the forecast model is so high, it would clearly violate the Efficient Market Hypothesis, invalidating the experiences of thousands of researchers and practitioners over the last 60 years, and so this forecast accuracy is likely to be erroneously reported. I included references to BMZ’s failed attempts to commercialize their patented techniques with Derwent.
Following the strictest protocol, the editor of JoCS duly sent this letter to reviewers for review. After roughly seven months, the editor sent an email informing me that the reviewers have suggested publication of my letter, after I revise my manuscript. Included in the email are the reviews of Reviewer #2 and Reviewer #3. Apparently deciding whether to publish a letter to the editor asking to retract a paper was too much for Reviewer #1 to stomach.
The reviewers’ comments were more than fair. If my arguments were unclear, I was more than happy to reword them and provide additional evidence to get my point across. So I edited my letter to the editor, and re-sent it. I had a glimmer of hope that the editor would read my letter to the editor.
The editor then sent my letter to Reviewer #4, and, within two months or so (the equivalent of overnight in journal-time), the editor sent me a rejection notice with their review, quoted below. This review—this review is sensational. As one afflicted with Hamlet Syndrome, I admire Reviewer #4’s conviction. As someone too often in search of the right phrase to dismiss a crap idea, I take delight in Reviewer #4’s acid pen: I have never seen a reviewer so viciously shit-can a paper before. Reviewer #4 tore my letter to pieces, then burned the pieces. Then poured lye on the ashes. Then salted the earth where the lye sizzled. Then burnt down the surrounding forest, etc.
My admiration is conflicted, though, and for more than the obvious reason, for I have come to suspect that Reviewer #4 is none other than—drumroll, please—Johan Bollen! Consider the following: the editor of JoCS receives an official letter suggesting the Journal retract its most widely cited article. Before undertaking such a decision, the editor might decide to contact the original authors to see if they had made a mistake. The original authors are incensed at the idea and ask to write a rebuttal. From the editor’s point of view, problem solved: let the two parties write nastygrams to each other, and err on the side of conservatism. Call it a day!
Moreover, Reviewer #4
- has taken an awful lot of time to write a review of a letter to the editor, much more than would any disinterested party, and has much more intimate knowledge of the BMZ paper than some random reviewer. Would some random reviewer know, to three decimal places, the number of citations of a paper? Would they have followed the history of the pre-print of the paper?
- has no uncertainty regarding their decision. It would be hard to find so biased a reviewer among disinterested parties.
- makes no mention of markets or quantitative finance. No mention is made of the EMH, or how the efficiency faithful can have gotten it so wrong all these years.
- has only a shaky understanding of statistics beyond what they found in a Google search for ‘Bonferroni’, and is disigenuously dismissive of datamining bias—even citing the silly papers by Preiss et. al. as positive evidence that regressing the kitchen sink against the Dow Jones is a productive activity.
- writes fawningly of the original reviewers and the editor.
Taken as a whole, this sounds an awful lot like Johan Bollen, struggling to defend his paper from retraction, and himself ignominy. And this is a shame, because I have never thought much of his work, but have to admit he can sink a paper like a pro.
So that is where the story ends.
In the spirit of pseudo-anonymity, I have changed all mentions of my name to ‘Lowly Worm’, which Reviewer #4 manages to misspell:
The author Lowly Werm, hereafter referred to as LW, critiques the “Twitter mood predicts the stock market” paper by Bollen, Mao and Zeng hereafter respectively referred to as BMZ paper and BMZ on the basis of what seem to be 3 arguments:
LW is of the opinion that corrections should have been made for multiple hypothesis testing, and if they had been made, he claims the results would not have achieved statistical significance.
LW claims that some of the BMZ results leads to “impossible” or implausible (hypothesized) outcomes, and therefore they must be wrong.
LW claims that attempts to commercialize the “technology” or “system” failed, hence the BMZ results are invalid.
On this narrow basis LW (1) demands that the paper be retracted, (2) questions the judgement of the BMZ authors, the original expert reviewers, the JoCS editor, and all those who have positively cited the BMZ paper, and (3) makes near libelous accusations of junk science.
Below we show that the letter’s arguments are either specious or irrelevant to the actual BMZ paper. The demands for retraction are entirely exorbitant. The letter is not suitable for publication in a peer-reviewed journal such as JoCS.
1 “Multiple hypothesis testing”
First, we respectfully ask the reviewers and editor to imagine an alternate reality where the BMZ authors had reported the same p-values, but spread out across multiple papers. The criticism of multiple hypothesis testing and LW’s main argument against the BMZ paper would have been moot. However, the published p-values would have been identical. How can this be a reasonable argument against the BMZ paper?
Second, a correction for multiple hypothesis testing is not warranted by the actual BMZ methodology and results. Although all p-values are shown together in Table 2 for the reader’s convenience, BMZ did not attempt to reject the general, and frankly not very informative, null-hypothesis that “all Twitter mood data is independent of
future movements of DJIA” as LW claims. BMZ tested the effects of 6 mood dimensions (columns of table 2) that are part of an established psychological model of human mood states. These mood dimensions were selected in advance because they are known to have distinct effects on human performance and decision-making . Furthermore, all 4 p-values< 0.05 occur in the same column of Table 2, the one corresponding to the “Calm” mood dimension. The pattern is perfect and highly unlikely to occur by chance. BMZ outline this methodological approach from the very start of the paper, and draw careful, certainly not exaggerated, conclusions from the results which they verify in subsequent sections of the paper.
Third, we can refer the reviewers and editor to a number of highly cited articles published in respected journals that oppose the use of Bonferroni corrections [5,6]. Perneger (1998) advocates “… that Bonferroni adjustments are, at best, unnecessary and, at worst, deleterious to sound statistical inference.” Some of the counter-examples in these papers nearly exactly match the situation of the BMZ paper (please see Section 7 “Reading Highlights”). Indeed, many experts strongly object to the use of corrections for multiple hypothesis testing such as the Bonferroni correction, because of (1) the logical inconsistencies that they entail, (2) the increased odds of a type-II error, which like the type I error that LW presupposes, is equally a serious error, and finally (3) because of their damaging effects on the ability of researchers to publish precisely those results that are the most informative and detailed.
Indeed it is truly ironic that this particular criticism is possible only because BMZ acted in good faith and published complete results, in keeping with the best practices of the computational science domain:
BMZ conducted a thorough investigation of a small number of well-chosen and highly relevant variables (6 psychological mood dimensions) across a small range of lead times (7 days) - certainly not a fishing expedition!
BMZ report all exact p-values.
BMZ draw careful, not exaggerated, conclusions from these results and conduct additional analysis to confirm whether an effect is actually present or not. (Please read section 2.5 and conclusion.)
2 “Impossible outcomes”
The BMZ paper is not proposing, promoting, or analyzing a trading system or technology in any shape or form. LW confabulates a hypothetical trading strategy that he erroneously claims is “implicit” from the BMZ paper. From this he argues that the simulated behavior and outcomes of this hypothetical strategy are “impossible”, relative to common assump- tions and expectations in finance. This argument has no bearing on the validity of the actual BMZ methodology and result. It is a “reductio ad absurdum” combined with a “red herring”, common logical fallacies that have no place in scientific discourse.
LW’s argument is akin to demanding the retraction of a paper which shows indications that certain substances could kill 86.7% of certain cancer cells in vitro, merely from making the observation that an imagined cancer treatment device that he chooses to hypothesize based on 1 of the paper’s most salient results and various news reports thrown together would be too successful to be plausible.
3 Presumed “application of technology” failed
LW assumes - citing press reports!- that the failure of a start-up hedge fund in London, which the authors were reported to collaborate with well after publication of the BMZ paper, retro-actively has any bearing on the validity of the results described in this paper. This type of argument from a supposed business application, or “real-world experiment”, is appropriate nor relevant in a discussion about the validity of a scientific result.
Should all papers investigating the relative effectiveness of tubular solar panels be retracted because Solyndra went bankrupt?
Furthermore, LW simply can not know which technology may have been used, how it may have been used, by whom, and under what conditions. This point is, at any rate, entirely moot because the BMZ paper does not describe a trading technology or system to begin with.
4 “Extravagant and implausible claims”
LW’s critique is to some degree based on the perceived “unreasonableness”, “inexplicability”, or “implausibility” of the BMZ results. From this he draws the conclusion that somehow an error must have been made, without identifying what that error might be. This is a well-know logical fallacy known as an “Argument from incredulity” (http://rationalwiki.org/ wiki/Argument_from_incredulity).
First, we do not think that scientists should limit themselves to only publishing results that LW deems “plausible”. BMZ published their paper precisely because it applied a novel computational science approach to measuring various social mood dimensions from social media data (section 2.2 and 2.3) and because some of these measurements exhibited interesting correlations with the financial markets.
Second, to anyone familiar with the computational science literature of the past 5 years the BMZ result is not nearly as implausible or extra-ordinary as LW suggests. The BMZ paper largely follows the same methodological framework as its predecessors, e.g. Gilbert (2010) , who in fact report very similar results. Since 2011 there have been numerous publications that show similar results; some from the BMZ themselves . In the past year alone, Nature Scientific Reports published several papers on this very topic which all found significant predictive effects of various social media indicators [4,7]. The predictive value of social media sentiment or chatter with regards to other socio-economic indicators, such as box office receipts, elections, etc. has also been demonstrated in the computational science literature.
It is true that the BMZ paper is positioned in the context of a young domain that is still largely in an exploratory phase, and in which few formal, theoretical, or causative models have yet been proposed. Many of these observations may therefore be difficult to accept, explain, and apply from the viewpoint of traditional financial analysis.
The BMZ paper nevertheless introduces a new method of using social media to measure aspects of collective mood, and may have found an unexpected and intriguing connection to the financial markets. Many if not most of the 477 studies that presently cite the BMZ paper do so in a positive manner, because their authors (some absolute authorities) deem the BMZ paper a valid and significant contribution to this emerging field. So did the original reviewers, the many readers, and the many online commentators of the pre-print of this study which has now been publicly available for nearly 3 years and has stood up to considerable public and academic scrutiny.
We show above that the letter’s arguments against the BMZ are either confused, irrelevant, or specious with respect to the actual methodology and claims of the BMZ paper. The letter should not be accepted for publication.
 Eric Gilbert and Karrie Karahalios. Widespread worry and the stock market. In Fourth International AAAI Conference on Weblogs and Social Media, pages 58-65, Washington, DC, 2010.
 Johan Mao, Huina and Counts, Scott and Bollen. Predicting Financial Markets: Comparing Survey, News, Twitter and Search Engine Data.
D McNair, M Loor, and L Droppleman. Profile of Mood States, 1971.
 Helen Susannah Moat, Chester Curme, Adam Avakian, Dror Y Kenett, H Eugene Stanley, and Tobias Preis. Quantifying Wikipedia Usage Patterns Before Stock Market Moves. Sci. Rep., 3, May 2013.
 S Nakagawa. A farewell to Bonferroni: the problems of low statistical power and publication bias. Behavioral Ecology, 15(6):1044-1045, 2004.
 T V Perneger. What’s wrong with Bonferroni adjustments. BMJ (Clinical research ed.), 316(7139):1236-8, April 1998.
 Tobias Preis, Helen Susannah Moat, and H Eugene Stanley. Quantifying Trading Behavior in Financial Markets Using Google Trends. Sci. Rep., 3, April 2013.
7 Reading highlights
Thomas V Perneger (1998) Whats wrong with Bonferroni adjustments.
BMJ. 1998 April 18; 316(7139): 12361238. PMCID:PMC1112991
"When more than one statistical test is performed in analysing the data from a clinical study, some statisticians and journal editors demand that a more stringent criterion be used for "statistical significance" than the conventional P< 0.05. Many well meaning researchers, eager for methodological rigour, comply without fully grasping what is at stake. Recently, adjustments for multiple tests (or Bonferroni adjustments) have found their way into introductory texts on medical statistics, which has increased their apparent legitimacy. This paper advances the view, widely held by epidemiologists, that Bonferroni adjustments are, at best, unnecessary and, at worst, deleterious to sound statistical inference."
Shinichi Nakagawa (2004) A farewell to Bonferroni: the problems of low statistical power and publication bias.
Behavioral Ecology 15 (6): 1044-1045: http://beheco.oxfordjournals. org/content/15/6/1044.full
"…Imagine that we conduct a study where we measure as many relevant variables as possible, 10 variables, for example. We find only two variables statistically significant. Then, what should we do? We could decide to write a paper highlighting these two variables (and not reporting the other eight at all) as if we had hypotheses about the two significant variables in the first place. Subsequently, our paper would be published. Alternatively, we could write a paper including all 10 variables. When the paper is reviewed, referees might tell us that there were no significant results if we had appropriately employed Bonferroni corrections, so that our study would not be advisable for publication. However, the latter paper is scientifically more important than the former paper. For example, if one wants to conduct a meta-analysis to investigate an overall effect in a specific area of study, the latter paper is five times more informative than the former paper. In the long term, statistical significance of particular tests may be of trivial importance (if not always), although, in the short term, it makes papers publishable. Bonferroni procedures may, in part, be preventing the accumulation of knowledge in the field of behavioral ecology and animal behavior, thus hindering the progress of the field as science. … Therefore, the use of Bonferroni corrections and the practice of reviewers demanding Bonferroni procedures should be discouraged (and also, researchers should play their part in carefully selecting relevant variables in their study)."
"Even more worryingly, though, it doesn’t seem to make much sense to deem a result significant or not contingent on what other results you were examining. Consider two experimenters: one collects data on three variables of interest from the same group of subjects while a second researcher collects data on those three 6 variables of interest, but from three different groups. Both researchers are thus running three hypothesis tests, but they’re either running them together or sep- arately. If the two researchers were using a Bonferroni correction contingent on the number of tests they ran per experiment, the results might be significant in the latter case but not in the former, even the two researchers got identical sets of results. This lack of consistency in terms of which results get to be counted as real will only add to the confusion in the psychological literature."
Disclaimer The information provided does not constitute investment advice.