## Oct 21

### Bollen: 1; Worm: 0.

Over a year ago, I wrote a letter to the editor of the Journal of Computational Sciences, urging the retraction of Bollen, Mao, and Zeng’s paper, “Twitter Mood Predicts the Stock Market.” Since JoCS is an Elsevier journal, one does not simply email the editor. Rather, one has to register with the Elsevier author system, and submit LaTeX source code of a letter, along with supporting documents, author bio, etc. The rules of this journal limit letters to the editor to be four pages, so I distilled the main arguments into two: first, that the Granger causality tests presented in BMZ’s paper are consistent with datamining, and present no evidence for a connection between Twitter and the Dow Jones Index; and that the quoted predictive accuracy of the forecast model is so high, it would clearly violate the Efficient Market Hypothesis, invalidating the experiences of thousands of researchers and practitioners over the last 60 years, and so this forecast accuracy is likely to be erroneously reported. I included references to BMZ’s failed attempts to commercialize their patented techniques with Derwent.

Following the strictest protocol, the editor of JoCS duly sent this letter to reviewers for review. After roughly seven months, the editor sent an email informing me that the reviewers have suggested publication of my letter, after I revise my manuscript. Included in the email are the reviews of Reviewer #2 and Reviewer #3. Apparently deciding whether to publish a letter to the editor asking to retract a paper was too much for Reviewer #1 to stomach.

The reviewers’ comments were more than fair. If my arguments were unclear, I was more than happy to reword them and provide additional evidence to get my point across. So I edited my letter to the editor, and re-sent it. I had a glimmer of hope that the editor would read my letter to the editor.

The editor then sent my letter to Reviewer #4, and, within two months or so (the equivalent of overnight in journal-time), the editor sent me a rejection notice with their review, quoted below. This review—this review is sensational. As one afflicted with Hamlet Syndrome, I admire Reviewer #4’s conviction. As someone too often in search of the right phrase to dismiss a crap idea, I take delight in Reviewer #4’s acid pen: I have never seen a reviewer so viciously shit-can a paper before. Reviewer #4 tore my letter to pieces, then burned the pieces. Then poured lye on the ashes. Then salted the earth where the lye sizzled. Then burnt down the surrounding forest, etc.

My admiration is conflicted, though, and for more than the obvious reason, for I have come to suspect that Reviewer #4 is none other than—drumroll, please—Johan Bollen! Consider the following: the editor of JoCS receives an official letter suggesting the Journal retract its most widely cited article. Before undertaking such a decision, the editor might decide to contact the original authors to see if they had made a mistake. The original authors are incensed at the idea and ask to write a rebuttal. From the editor’s point of view, problem solved: let the two parties write nastygrams to each other, and err on the side of conservatism. Call it a day!

Moreover, Reviewer #4

1. has taken an awful lot of time to write a review of a letter to the editor, much more than would any disinterested party, and has much more intimate knowledge of the BMZ paper than some random reviewer. Would some random reviewer know, to three decimal places, the number of citations of a paper? Would they have followed the history of the pre-print of the paper?
2. has no uncertainty regarding their decision. It would be hard to find so biased a reviewer among disinterested parties.
3. makes no mention of markets or quantitative finance. No mention is made of the EMH, or how the efficiency faithful can have gotten it so wrong all these years.
4. has only a shaky understanding of statistics beyond what they found in a Google search for ‘Bonferroni’, and is disigenuously dismissive of datamining bias—even citing the silly papers by Preiss et. al. as positive evidence that regressing the kitchen sink against the Dow Jones is a productive activity.
5. writes fawningly of the original reviewers and the editor.

Taken as a whole, this sounds an awful lot like Johan Bollen, struggling to defend his paper from retraction, and himself ignominy. And this is a shame, because I have never thought much of his work, but have to admit he can sink a paper like a pro.

So that is where the story ends.

## The Review

In the spirit of pseudo-anonymity, I have changed all mentions of my name to ‘Lowly Worm’, which Reviewer #4 manages to misspell:

The author Lowly Werm, hereafter referred to as LW, critiques the “Twitter mood predicts the stock market” paper by Bollen, Mao and Zeng hereafter respectively referred to as BMZ paper and BMZ on the basis of what seem to be 3 arguments:

1. LW is of the opinion that corrections should have been made for multiple hypothesis testing, and if they had been made, he claims the results would not have achieved statistical significance.

2. LW claims that some of the BMZ results leads to “impossible” or implausible (hypothesized) outcomes, and therefore they must be wrong.

3. LW claims that attempts to commercialize the “technology” or “system” failed, hence the BMZ results are invalid.

On this narrow basis LW (1) demands that the paper be retracted, (2) questions the judgement of the BMZ authors, the original expert reviewers, the JoCS editor, and all those who have positively cited the BMZ paper, and (3) makes near libelous accusations of junk science.

Below we show that the letter’s arguments are either specious or irrelevant to the actual BMZ paper. The demands for retraction are entirely exorbitant. The letter is not suitable for publication in a peer-reviewed journal such as JoCS.

## 1 “Multiple hypothesis testing”

First, we respectfully ask the reviewers and editor to imagine an alternate reality where the BMZ authors had reported the same p-values, but spread out across multiple papers. The criticism of multiple hypothesis testing and LW’s main argument against the BMZ paper would have been moot. However, the published p-values would have been identical. How can this be a reasonable argument against the BMZ paper?

Second, a correction for multiple hypothesis testing is not warranted by the actual BMZ methodology and results. Although all p-values are shown together in Table 2 for the reader’s convenience, BMZ did not attempt to reject the general, and frankly not very informative, null-hypothesis that “all Twitter mood data is independent of future movements of DJIA” as LW claims. BMZ tested the effects of 6 mood dimensions (columns of table 2) that are part of an established psychological model of human mood states. These mood dimensions were selected in advance because they are known to have distinct effects on human performance and decision-making [3]. Furthermore, all 4 p-values< 0.05 occur in the same column of Table 2, the one corresponding to the “Calm” mood dimension. The pattern is perfect and highly unlikely to occur by chance. BMZ outline this methodological approach from the very start of the paper, and draw careful, certainly not exaggerated, conclusions from the results which they verify in subsequent sections of the paper.

Third, we can refer the reviewers and editor to a number of highly cited articles published in respected journals that oppose the use of Bonferroni corrections [5,6]. Perneger (1998) advocates “… that Bonferroni adjustments are, at best, unnecessary and, at worst, deleterious to sound statistical inference.” Some of the counter-examples in these papers nearly exactly match the situation of the BMZ paper (please see Section 7 “Reading Highlights”). Indeed, many experts strongly object to the use of corrections for multiple hypothesis testing such as the Bonferroni correction, because of (1) the logical inconsistencies that they entail, (2) the increased odds of a type-II error, which like the type I error that LW presupposes, is equally a serious error, and finally (3) because of their damaging effects on the ability of researchers to publish precisely those results that are the most informative and detailed.

Indeed it is truly ironic that this particular criticism is possible only because BMZ acted in good faith and published complete results, in keeping with the best practices of the computational science domain:

1. BMZ conducted a thorough investigation of a small number of well-chosen and highly relevant variables (6 psychological mood dimensions) across a small range of lead times (7 days) - certainly not a fishing expedition!

2. BMZ report all exact p-values.

3. BMZ draw careful, not exaggerated, conclusions from these results and conduct additional analysis to confirm whether an effect is actually present or not. (Please read section 2.5 and conclusion.)

## 2 “Impossible outcomes”

The BMZ paper is not proposing, promoting, or analyzing a trading system or technology in any shape or form. LW confabulates a hypothetical trading strategy that he erroneously claims is “implicit” from the BMZ paper. From this he argues that the simulated behavior and outcomes of this hypothetical strategy are “impossible”, relative to common assump- tions and expectations in finance. This argument has no bearing on the validity of the actual BMZ methodology and result. It is a “reductio ad absurdum” combined with a “red herring”, common logical fallacies that have no place in scientific discourse.

LW’s argument is akin to demanding the retraction of a paper which shows indications that certain substances could kill 86.7% of certain cancer cells in vitro, merely from making the observation that an imagined cancer treatment device that he chooses to hypothesize based on 1 of the paper’s most salient results and various news reports thrown together would be too successful to be plausible.

## 3 Presumed “application of technology” failed

LW assumes - citing press reports!- that the failure of a start-up hedge fund in London, which the authors were reported to collaborate with well after publication of the BMZ paper, retro-actively has any bearing on the validity of the results described in this paper. This type of argument from a supposed business application, or “real-world experiment”, is appropriate nor relevant in a discussion about the validity of a scientific result.

Should all papers investigating the relative effectiveness of tubular solar panels be retracted because Solyndra went bankrupt?

Furthermore, LW simply can not know which technology may have been used, how it may have been used, by whom, and under what conditions. This point is, at any rate, entirely moot because the BMZ paper does not describe a trading technology or system to begin with.

## 4 “Extravagant and implausible claims”

LW’s critique is to some degree based on the perceived “unreasonableness”, “inexplicability”, or “implausibility” of the BMZ results. From this he draws the conclusion that somehow an error must have been made, without identifying what that error might be. This is a well-know logical fallacy known as an “Argument from incredulity” (http://rationalwiki.org/ wiki/Argument_from_incredulity).

First, we do not think that scientists should limit themselves to only publishing results that LW deems “plausible”. BMZ published their paper precisely because it applied a novel computational science approach to measuring various social mood dimensions from social media data (section 2.2 and 2.3) and because some of these measurements exhibited interesting correlations with the financial markets.

Second, to anyone familiar with the computational science literature of the past 5 years the BMZ result is not nearly as implausible or extra-ordinary as LW suggests. The BMZ paper largely follows the same methodological framework as its predecessors, e.g. Gilbert (2010) [1], who in fact report very similar results. Since 2011 there have been numerous publications that show similar results; some from the BMZ themselves [2]. In the past year alone, Nature Scientific Reports published several papers on this very topic which all found significant predictive effects of various social media indicators [4,7]. The predictive value of social media sentiment or chatter with regards to other socio-economic indicators, such as box office receipts, elections, etc. has also been demonstrated in the computational science literature.

It is true that the BMZ paper is positioned in the context of a young domain that is still largely in an exploratory phase, and in which few formal, theoretical, or causative models have yet been proposed. Many of these observations may therefore be difficult to accept, explain, and apply from the viewpoint of traditional financial analysis.

The BMZ paper nevertheless introduces a new method of using social media to measure aspects of collective mood, and may have found an unexpected and intriguing connection to the financial markets. Many if not most of the 477 studies that presently cite the BMZ paper do so in a positive manner, because their authors (some absolute authorities) deem the BMZ paper a valid and significant contribution to this emerging field. So did the original reviewers, the many readers, and the many online commentators of the pre-print of this study which has now been publicly available for nearly 3 years and has stood up to considerable public and academic scrutiny.

## 5 Conclusion

We show above that the letter’s arguments against the BMZ are either confused, irrelevant, or specious with respect to the actual methodology and claims of the BMZ paper. The letter should not be accepted for publication.

## References

[1] Eric Gilbert and Karrie Karahalios. Widespread worry and the stock market. In Fourth International AAAI Conference on Weblogs and Social Media, pages 58-65, Washington, DC, 2010.

[2] Johan Mao, Huina and Counts, Scott and Bollen. Predicting Financial Markets: Comparing Survey, News, Twitter and Search Engine Data.

[3]D McNair, M Loor, and L Droppleman. Profile of Mood States, 1971.

[4] Helen Susannah Moat, Chester Curme, Adam Avakian, Dror Y Kenett, H Eugene Stanley, and Tobias Preis. Quantifying Wikipedia Usage Patterns Before Stock Market Moves. Sci. Rep., 3, May 2013.

[5] S Nakagawa. A farewell to Bonferroni: the problems of low statistical power and publication bias. Behavioral Ecology, 15(6):1044-1045, 2004.

[6] T V Perneger. What’s wrong with Bonferroni adjustments. BMJ (Clinical research ed.), 316(7139):1236-8, April 1998.

[7] Tobias Preis, Helen Susannah Moat, and H Eugene Stanley. Quantifying Trading Behavior in Financial Markets Using Google Trends. Sci. Rep., 3, April 2013.

### Thomas V Perneger (1998) Whats wrong with Bonferroni adjustments.

BMJ. 1998 April 18; 316(7139): 12361238. PMCID:PMC1112991

"When more than one statistical test is performed in analysing the data from a clinical study, some statisticians and journal editors demand that a more stringent criterion be used for "statistical significance" than the conventional P< 0.05. Many well meaning researchers, eager for methodological rigour, comply without fully grasping what is at stake. Recently, adjustments for multiple tests (or Bonferroni adjustments) have found their way into introductory texts on medical statistics, which has increased their apparent legitimacy. This paper advances the view, widely held by epidemiologists, that Bonferroni adjustments are, at best, unnecessary and, at worst, deleterious to sound statistical inference."

### Shinichi Nakagawa (2004) A farewell to Bonferroni: the problems of low statistical power and publication bias.

Behavioral Ecology 15 (6): 1044-1045: http://beheco.oxfordjournals. org/content/15/6/1044.full

"…Imagine that we conduct a study where we measure as many relevant variables as possible, 10 variables, for example. We find only two variables statistically significant. Then, what should we do? We could decide to write a paper highlighting these two variables (and not reporting the other eight at all) as if we had hypotheses about the two significant variables in the first place. Subsequently, our paper would be published. Alternatively, we could write a paper including all 10 variables. When the paper is reviewed, referees might tell us that there were no significant results if we had appropriately employed Bonferroni corrections, so that our study would not be advisable for publication. However, the latter paper is scientifically more important than the former paper. For example, if one wants to conduct a meta-analysis to investigate an overall effect in a specific area of study, the latter paper is five times more informative than the former paper. In the long term, statistical significance of particular tests may be of trivial importance (if not always), although, in the short term, it makes papers publishable. Bonferroni procedures may, in part, be preventing the accumulation of knowledge in the field of behavioral ecology and animal behavior, thus hindering the progress of the field as science. … Therefore, the use of Bonferroni corrections and the practice of reviewers demanding Bonferroni procedures should be discouraged (and also, researchers should play their part in carefully selecting relevant variables in their study)."

"Even more worryingly, though, it doesn’t seem to make much sense to deem a result significant or not contingent on what other results you were examining. Consider two experimenters: one collects data on three variables of interest from the same group of subjects while a second researcher collects data on those three 6 variables of interest, but from three different groups. Both researchers are thus running three hypothesis tests, but they’re either running them together or sep- arately. If the two researchers were using a Bonferroni correction contingent on the number of tests they ran per experiment, the results might be significant in the latter case but not in the former, even the two researchers got identical sets of results. This lack of consistency in terms of which results get to be counted as real will only add to the confusion in the psychological literature."

Disclaimer The information provided does not constitute investment advice.

## Aug 29

### No Limits to Garbatrage

It is generally recognized that creating a profitable quantitative trading strategy is difficult. Most putative strategies are likely not profitable even without considering trading costs. Writing buggy code, conducting questionable simulations, or performing faulty statistical tests, on the other hand, are easily achieved. For this reason, I estimate that a majority of proposed quantitative trading strategies are type I errors.

I present here an incomplete catalogue of the various pitfalls of quantitative strategy development, which I collectively refer to as Garbatrage. Unlike arbitrage, garbatrage has no apparent limits. For a junior quant, there is no skill more important than spotting garbatrage.

1. Time Traveling This is the blanket term I use for out-and-out non-causality within trading simulation. Unless your trading plan includes actual time travel, your achieved performance will not match your simulations. An obvious example would be that an off-by-one error means your simulated strategy can view tomorrow’s prices today. Less obvious examples include:

• Classical Selection bias, which encompasses backfill bias and survivorship bias. Backfill bias is where positive information from a later date makes it more likely that a stock is included in one’s trading universe in the past; surivorship bias is the opposite, where negative future information has made a company less likely to appear. For example, suppose your trading universe consists of all stocks in the S&P 500 universe as of today. In the distant past many of these stocks were small cap, and are likely to outperform other small caps, which are not included in your universe. Often data providers will commit these errors for you, since they tend to backfill missing companies’ data into their databases upon customer request.
• Allowing your strategy to see backwards-adjusted prices. To deal with splits and dividends, retail price data often includes “adjusted close” prices, which match the close at the current point in time. Companies which are succesful over time tend to have very small backwards-adjusted prices in the past, an effect one does not observe in real time.
• Trading on stale prices. In general, it is safe to delay, or embargo, data which will be used by your strategy. If you can observe the July closing price of butter in Bangladesh today, then you will still be able to use that information tomorrow. This is not kosher, however, for fill prices: if you have no price information for butter, and your simulation paves over the NA with previous values, you are trading on yesterday’s prices today.
• Thank you, may I have another? This is a pernicious error common in machine-learning strategy development, which I have seen independently invented by several quants, including myself. It works as follows: you train some kind of machine learning model with features known through today, aligned with leading price returns, including returns reflecting tomorrow’s prices. You then use this very fresh model to trade today. This error causes better apparent performance when training over shorter time history (so the future information is a larger proportion of all data), and when one retrains the model more often (preferrably every day). I suspect that Johan Bollen’s twitter predictor is largely built upon this error.

• Trading on stale prices, as discussed above.
• Step 1: Steal Shorts! Quants with no market experience quickly learn about shorting a stock, but often with no understanding of the mechanism involved. (On day 2 of the first quant fund I helped launch, our broker phoned me to ask for a “locate”. I replied, “I’m right here.”) This mechanism includes paying a (sometimes large) fee to borrow the stock. This is often not included in simulations because the data is not widely available.
• Gimme an epsilon! I first saw this error in a paper by Preis, Moat, and Stanley, but suspect it has been independently invented by others. This error consists of treating geometric returns of longs and shorts as symmetric; in reality, it is arithmetic returns which are symmetric. The tailwind this gives simulations is modest unless rebalancing less frequently than, say, semimonthly.
• You bought what? Not all time series can be bought or sold. Nevertheless, I have seen more than one putative strategy which trades the VIX index. The VIX has a long term mean reversion, and a short term negative autocorrelation, and one could easily construct a profitable strategy if one could buy and sell the index. Alas, this is not possible, since long baskets of options leak time value.
• Broken cost models, beyond misunderstanding your broker’s commissions and exchange fees. Simulating the effects of an additional market participant in historical data is difficult, moreso at higher participation rates. Most impact models are inaccurate or even systematically biased. This problem is widely recognized, however, and “unforeseen impact” is the quant’s “not me” gremlin, blamed first when real trading underperforms simulations.
3. Are We There Yet? An error commonly seen in statistical work is that different models are tried sequentially on the same data until “significance” is found. This problem is especially pernicious in trading strategies since we collect new data at a rate of one day per day, and any expected trading edge may take years to appear profitable. To complicate matters, often quants debug their code and debug their ideas simultaneously; failure to trade some strategy is not an acceptable option for a quant fund; quants typically do not keep good records of their work, so the degree of overfitting cannot be estimated. While there are some methods for estimating the “datamining bias”, they are largely technical solutions to a social problem, and typically cannot be applied when the number of twiddled knobs is unknown.

4. A common tactic to prevent datamining bias is to split one’s data into “test” and “holdout”, or “in-sample” and “out-of-sample” sets, with all overfitting done on one set, and estimation of future performance done on the other. This is almost as effective a defense as the Maginot Line, since no quant fund will launch a strategy which looks questionable in the “out-of-sample” period. In reality, there is no “in-sample” and “out-of-sample”, there is only “in-sample” and “trading real money on it”.

Disclaimer The information provided does not constitute investment advice.

## Aug 02

### Moon Patrol, Bone Cancer, Shelby GT 500, Debt, and Garbatrage

A recently released paper by Challet and Ayed, Predicting financial markets with Google Trends and not so random keywords, examines the methodology of the 'Google Trends predicts the Market' paper by Preis, Moat and Stanley. I had previously found that the Preis study was suspect merely on grounds of datamining bias.

Challet and Ayed replicate the Preis study, but using three different entirely uninformative sets of Google Trends search terms to ‘predict’ the DJIA, viz the names of 200 common ailments, 100 classic cars, and 100 classic video games. While Preis et al found that changes in Google searches for debt were a leading indicator of market movement, Challet and Ayed find similar market timing ability of bone cancer, Shelby GT 500, and Moon Patrol. While it is easy to concoct a just-so story regarding Joe Sixpack Googling ‘debt’ and the direction of the American economy, it takes rather more imagination to weave a similar tale about Moon Patrol (and I say this as a huge fan of Moon Patrol).

This would appear to close the book on the Preis study. But then, somewhat improbably, and with no details given, Challet and Ayed claim that the ‘intuition of Preis et al.' can be applied to 'suitable assets' to 'yield robustly profitable strategies,' referring to a yet-to-be-published internal report of Encelade Capital written by none other than Challet. That is, after essentially dragging Preis through the datamining mud, Challet and Ayed emerge unbesmirched to offer a completely sanitary trading strategy developed using the same methodology as Preis. Whether this confuses or excites you, you will have to wait until September for Challet's followup study.

Disclaimer The information provided does not constitute investment advice.

## Jul 11

In a previous post, I voiced some skepticism regarding the paper, Quantifying Trading Behavior in Financial Markets Using Google Trends, by Preis, Moat and Stanley. The authors of that study recently made their data and code available, a victory for transparency which I applaud.

The objectives of this reanalysis are:

1. To correct for the suspicious accounting around short sales.
2. To assess whether the results are simply due to data mining.
3. To test drive some methods for assessing strategy performance.
4. To show off some R code.

Here I load the data from github, then separate the trends data and DJIA data into separate xts objects:

require(xts)
require(SharpeR)
# Read data file; if you have it locally, do this: dat <- read.csv('PreisMoatStanley_ScientificReports_3_1684_2013.dat',sep=' ') else, workaround for curl/ssl download from github (see
# http://stackoverflow.com/a/4127133/164611 )
temporaryFile <- tempfile()
dat <- read.csv(temporaryFile, sep = " ")

# peel off DJIA data, and make into an xts
DJIA.data <- dat[, names(dat) %in% c("DJIA.Date", "DJIA.Closing.Price")]
dat <- dat[, !(names(dat) %in% names(DJIA.data))]
# add +1 day to the date to round up to midnight.
DJIA.xts <- xts(DJIA.data[, "DJIA.Closing.Price"], order.by = as.POSIXct(DJIA.data[, "DJIA.Date"]) + 86400)

# peel off dates and make into an xts
dat <- dat[, !(names(dat) %in% names(google.dates))]

# clean up


## Generate the signal

Now, for every search term used, adjust the search frequency for the previous delta.t weeks’ values. This follows the methodology of the original authors, but is applied by a vectorized function. I spot check my work. After this, I break ties arbitrarily. In this case, any centered term less than 1e-5 in absolute value is moved to that value. Later we will use the sign of the signal to determine whether to take long or short positions, so this has the effect of taking a long position in the case of a tie. This is arguably a mild headwind given the long bias of the market, but actually has little effect. If you would like, you can change the TIE.BREAKER and TIE.LIMIT values to explore.

# this function takes a vector, and returns the difference between a value and the mean value over the previous boxwin values.
running.center <- function(x, lag = 10) {
x.cum <- cumsum(x)
x.dif <- c(x.cum[1:lag], diff(x.cum, lag))
x.df <- pmin(1:length(x.cum), lag)
x.mu <- x.dif/x.df
x.ret <- c(NaN, x[2:length(x)] - x.mu[1:(length(x) - 1)])
return(x.ret)
}
# follow the authors in using a 3 week window:
delta.t <- 3

# make the detrended 'signal'
signal.xts <- xts(apply(search.terms.xts, 2, running.center, lag = delta.t), order.by = time(search.terms.xts))
# at this point, do a spot check to make sure our function worked OK
my.err <- signal.xts[delta.t + 5, 10] - (search.terms.xts[delta.t + 5, 10] - mean(search.terms.xts[5:(delta.t + 4), 10]))
if (abs(my.err) > 1e-08) stop("fancy function miscomputes the running mean")
# chop off the first delta.t rows
signal.xts <- signal.xts[-c(1:delta.t)]
mkt.xts <- DJIA.xts[-c(1:delta.t)]  # and for the market

# trading signal; the original authors 'short' the trend:
# break ties arbitrarily. anything smaller than a certain absolute value gets moved to the tie-breaker.
TIE.BREAKER <- 1e-05
TIE.LIMIT <- abs(TIE.BREAKER)


## Backtest the signal

Now I ‘backtest’ the signal. This is a very simplistic backtesting function, and should not be used for real evaluation of strategies (insert standard legal disclosure here). For the purposes of evaluating a weekly-rebalancing, single instrument strategy where the cost to short is essentially zero, market impact is low, etc., this is a reasonable, if slightly optimistic, estimate of trading performance. It does assume you can trade on the index (instead of an ETF), and also assumes your positions are perfectly sized, and you pay no commissions. If a strategy looked profitable based on this backtest, you would want to go to the next finer level of backtest fidelity, although I doubt it will be warranted in this case. If you would like to test the signal as a magnitude, you can uncomment one line below.

# braindead 'backtest' function
dumb.bt <- function(sig.xts, mkt.xts) {
if (dim(sig.xts)[1] != dim(mkt.xts)[1])
stop("wrong row sizes")
mkt.lret <- diff(log(mkt.xts), lag = 1)
mkt.rret <- exp(mkt.lret) - 1
mkt.rret <- as.matrix(mkt.rret[-1])  # chop the first
sig.xts <- sig.xts[-dim(sig.xts)[1]]  # chop the last
bt.rets <- xts(apply(sig.xts, 2, function(v) {
v * mkt.rret
}), order.by = time(sig.xts))
return(bt.rets)
}
# backtest the sign:
bt.lrets <- log(1 + bt.rets)  # compute log returns
bt.mtm <- exp(apply(bt.lrets, 2, cumsum))


## Evaluate performance

Now I take the log returns from the 98 tested search terms’ backtests, and perform t-tests on each of them. I am testing against a two-sided alternative. This seems reasonable, since the trading scheme is so oddly defined: take the sign of the centered search data, then short it. I suspect that the more obvious version of going long this signal was first tested, and found to be lacking. That is, we can assume that one would be happy to either trade on any of these strategies if they looked profitable, or short any of them if doing so also looked profitable. Thus a two-sided alternative. Just using the vanilla t-test ignores possible autocorrelation and heteroskedasticity, which tend to inflate the achieved type I rate. This is not of great concern, since I suspect we will not reject the null anyway. I then Q-Q plot the p-values from the 98 t-tests against a uniform law. Under the null hypothesis that the Google trends data is independent of future DJIA returns, (and ignoring the fact that the backtests are correlated with each other!), the Q-Q plot should fall along the $$y=x$$ line, which I plot in red here. To my eye, this just looks like data mining (the bad kind).

# first: apply a t-test to every column, get the p-values
ttest.pvals <- apply(bt.lrets, 2, function(x) {
t.res <- t.test(x, alternative = "two.sided")
p.v <- t.res$p.value }) # function for Q-Q plot against uniformity qqunif <- function(x, xlab = "Theoretical Quantiles under Uniformity", ylab = NULL, ...) { if (is.null(ylab)) ylab = paste("Sample Quantiles (", deparse(substitute(x)), ")", sep = "") qqplot(qunif(ppoints(length(x))), x, xlab = xlab, ylab = ylab, ...) abline(0, 1, col = "red") } qqunif(ttest.pvals)  ## Compared to random data Here I spawn an equal number of totally random strategies, backtest them in the same way, perform a t-test then Q-Q plot the p-values. I expect the results to look just like the above plot. Indeed they do. This suggests that the mild deviation from the $$y=x$$ line seen above is ‘normal’. set.seed(12345) # remind me to change my luggage combo ;) rand.xts <- xts(matrix(rnorm(prod(dim(trade.xts))), nrow = dim(trade.xts)[1]), order.by = time(trade.xts)) # backtest the sign: rand.rets <- dumb.bt(sign(rand.xts), mkt.xts) rand.lrets <- log(1 + rand.rets) # compute log returns ttest.rand.pvals <- apply(rand.lrets, 2, function(x) { t.res <- t.test(x, alternative = "two.sided") p.v <- t.res$p.value
})
qqunif(ttest.rand.pvals)


## Evaluation via contingency tables

Here I perform another kind of ‘backtest’: For a given signal, I construct the $$2\times 2$$ contingency table based on the sign of the centered Google Trends signal, and the sign of the leading DJIA weekly return. I then perform an odds-ratio test, and compute the p-value. Again, ignoring correlation across search terms, under the null these p-values should fall near the $$y=x$$ line when Q-Q plotted versus uniformity. Which they do.

# perform oddsratio tests:
require(epitools)
dumb.odds.bt <- function(sig.xts, mkt.xts) {
if (dim(sig.xts)[1] != dim(mkt.xts)[1])
stop("wrong row sizes")
mkt.lret <- diff(log(mkt.xts), lag = 1)
mkt.rret <- exp(mkt.lret) - 1
mkt.rret <- as.matrix(mkt.rret[-1])  # chop the first
sig.xts <- sig.xts[-dim(sig.xts)[1]]  # chop the last
bt.rets <- apply(sig.xts, 2, function(v) {
or.tst <- oddsratio(x = factor(sign(v)), y = factor(sign(mkt.rret)))
or.tst$p.value[2, "fisher.exact"] }) return(bt.rets) } # odd-ratio backtest the sign: bt.odds <- dumb.odds.bt(sign(trade.xts), mkt.xts) qqunif(bt.odds)  ## Evaluation via Hotelling’s test To deal with possible correlation among the returns of the various search terms’ implied strategies, I use Hotelling’s test, which is the multivariate generalization of the t-test. Essentially I am testing whether the 98-vector of daily log returns is mean zero (as in the zero vector). If this were the case, then all linear combinations (i.e. portfolios) of the implied strategies would also be zero mean. There is a fascinating connection between Markowitz optimization, Sharpe ratio, and Hotelling’s test, but I digress. In this case the sample optimal Markowitz portfolio has in-sample Sharpe of around 4.5 $$\mbox{yr}^{-½}$$, with corresponding T2 value of around 140. The corresponding p-value under the null of zero mean is around 0.35, meaning there is little evidence to suggest the returns are not zero mean. 95% confidence intervals on the population-maximal Sharpe (essentially inverting the non-central F distribution for non-centrality parameter) contain zero. Using the Kubokawa-Robert-Saleh (‘KRS’) method to estimate the population optimal Sharpe yields a value of around 0.9 $$\mbox{yr}^{-½}$$. Note however that one cannot be certain to achieve this Sharpe because of mis-estimation of the Markowitz portfolio. # under the latest github version of the package, this is legit, but bonks under current CRAN version: srs <- as.sr(bt.lrets) sharpe.test <- sr_test(srs,alternative='two.sided') this gives the same # plot as the t-test plot above, so skip it. qqunif(sharpe.test$p.value)

# Hotelling's test
big.sr <- as.sropt(bt.lrets)
print(big.sr)

##        SR/sqrt(yr) T^2 value Pr(>T^2)
## Sharpe         4.5       142     0.35

print(confint(big.sr))

##      2.5 % 97.5 %
## [1,]     0  2.567

print(inference(big.sr, type = "KRS"))

##        [,1]
## [1,] 0.8805


## Evaluation via simple cross-validation

One way to evaluate the (backtested) returns of a bunch of strategies is to split the historical data into an ‘in-sample’ and ‘out-of-sample’ period, and see how consistent the performance is across the divide. The rationale is that one would, at the cut time, observe the in-sample data, select a portfolio of the strategies to trade upon and experience the returns of the out-of-sample period. Rather than get fancy, here I split the data into two equal-sized epochs, and scatter the Sharpe ratio in the in- and out-of-sample periods. I suspect this test is really no different than the Hotelling T2 test, but is expressable in terms more easily understood by quant practitioners (or their bosses). In this case, the cross-validation scatter is a blob; regression from in-sample to out-of-sample does not have a significant slope. If we selected the strategy with the highest in-sample Sharpe (around 2.0 $$\mbox{yr}^{-½}$$), we would have been disappointed with its performance out of sample (Sharpe of -0.5 $$\mbox{yr}^{-½}$$). I should also note that the best in-sample performance is associated with the search term home, while the worst is associated with the term fond. Perhaps someone with a better imagination than I have can spin a story around these; they certainly are not as suggestive as the term debt, which gives the best performance in the entire sample.

# split em.
n.row <- dim(bt.lrets)[1]
n.split <- floor(n.row/2)
srs.is <- SharpeR::as.sr(bt.lrets[1:n.split, ])
srs.oos <- SharpeR::as.sr(bt.lrets[(n.split + 1):n.row, ])
i.v.o <- lm(srs.oos$sr ~ srs.is$sr)
print(summary(i.v.o))

##
## Call:
## lm(formula = srs.oos$sr ~ srs.is$sr)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -1.3642 -0.3194  0.0019  0.3320  1.0863
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.1527     0.0581    2.63     0.01 **
## srs.is$sr 0.1117 0.1116 1.00 0.32 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.498 on 96 degrees of freedom ## Multiple R-squared: 0.0103, Adjusted R-squared: 3.57e-05 ## F-statistic: 1 on 1 and 96 DF, p-value: 0.319   plot(srs.is$sr, srs.oos$sr) abline(i.v.o)   cat(sprintf("best in-sample strategy: '%s'\n", rownames(srs.is$sr)[which.max(srs.is$sr)]))  ## best in-sample strategy: 'home'  cat(sprintf("worst in-sample strategy: '%s'\n", rownames(srs.is$sr)[which.min(srs.is\$sr)]))

## worst in-sample strategy: 'fond'


## Evaluation via Sharpe equality tests

Here I use the Leung and Wong test and the Wright, Yam, Yung variant. These test the null that all 98 implied strategies have equal Sharpe ratios. Both of these tests reject the null.

# these do *not* have equal SR; but the test can reject for weird reasons...
all.eq <- sr_equality_test(as.matrix(bt.lrets), type = "F")
print(all.eq)

##
##  test for equality of Sharpe ratio, via F test
##
## data:  as.matrix(bt.lrets)
## T2 = 243.7, contrasts = 97, p-value = 5.141e-05
## alternative hypothesis: true sum squared contrasts of SNR is not equal to 0

all.eq <- sr_equality_test(as.matrix(bt.lrets), type = "chisq")
print(all.eq)

##
##  test for equality of Sharpe ratio, via chisq test
##
## data:  as.matrix(bt.lrets)
## T2 = 243.7, contrasts = 97, p-value = 1.317e-14
## alternative hypothesis: true sum squared contrasts of SNR is not equal to 0


I do not have a lot of experience with these tests (and may not have implemented them correctly!), but suspect they can reject for reasons which are not interesting. On the other hand, these tests might actually be very powerful, but suggest a difference so small that we are unlikely to capture it in the real world (again, due to error in selecting the optimal portfolio). As a sanity check, I feed in the returns due to 98 randomly generated signals, as constructed above, and find that, again, the null is rejected. This casts some suspicion on the test itself (or my implementation!)


# also compare the random returns!
rnd.eq <- sr_equality_test(as.matrix(rand.lrets), type = "F")
print(rnd.eq)

##
##  test for equality of Sharpe ratio, via F test
##
## data:  as.matrix(rand.lrets)
## T2 = 243.1, contrasts = 97, p-value = 5.452e-05
## alternative hypothesis: true sum squared contrasts of SNR is not equal to 0

rnd.eq <- sr_equality_test(as.matrix(rand.lrets), type = "chisq")
print(rnd.eq)

##
##  test for equality of Sharpe ratio, via chisq test
##
## data:  as.matrix(rand.lrets)
## T2 = 243.1, contrasts = 97, p-value = 1.554e-14
## alternative hypothesis: true sum squared contrasts of SNR is not equal to 0


# Conclusions

The tests conducted here suggest there is no detectable predictive ability of Google Trends search data on the future returns of the DJIA when processed in the form suggested by Preis et al. The results seen by those authors are entirely consistent with data-mining bias.

Disclaimer The information provided does not constitute investment advice.

## May 31

### Replicating Backtests

Over at Quantopian, there is an ongoing collaborative effort to replicate the Google Trends market timing paper by Preis, Moat, and Stanley. Even when using the fishy accounting for short sales that the original authors do, there has been no success to date in this endeavour.

One of my hopes is that some day Quantopian, or some other system like it—preferrably in R—obviates the need of market macrostructure researchers to invent new, terrible, inauditable means of backtesting putative violations of the Efficient Market Hypothesis. Rather, they will use a standard open source backtester which prevents the most primative forms of time-traveling, and make their code and data freely available. For those researchers primarily interested in the intellectual exercise, this kind of transparency should be a prerequisite for publication of their findings; for those interested in landing a deal with a hedge fund, don’t call us, we’ll call you.

Disclaimer The information provided does not constitute investment advice.

## May 26

### What Doesn’t Predict the DJIA?

My previous review of a market-timing paper by Preis, Moat, and Stanley 1, was a bit uncharitable. I am relieved to see that the authors were not discouraged, and have teamed up with Chester Curme, Adam Avakian and Dror Y. Kenett to make similarly indefensible conclusions about Wikipedia usage and market timing in a new paper, Quantifying Wikipedia Usage Patterns Before Stock Market Moves, with Moat as principal author.

As with the previous paper on Google trends, this paper lacks specific quantitative claims about predictive accuracy, say, and wraps the qualitative conclusions in weasel words. As such it is not as repugnant (or falsifiable) as Bollen’s paper on market timing via Twitter mood; after all, who can argue with the conclusion of their abstract:

… online data may allow us to gain new insight into early information gathering stages of decision making.

As in the previous paper by Preis, Moat and Stanley, this paper uses incorrect accounting for short sales. In the authors’ fantasy stock market, short sales can result in unlimited gains and only limited losses. Of course, the opposite is actually the case: if you borrow a stock at 10 dollars, sell it, and the price goes to 20, you have lost all of your money; if it goes to 25 before you get a margin call, your log return (what the authors call ‘cumulative return’) is undefined! On the other hand, if the price drops to zero, you have ‘only’ made 100%, and can make no more.

As I noted previously, this erroneous accounting procedure results in a gain on the order of $$\left(1 - p(t+1)/p(t)\right)^2$$ when shorting. Given that the strategy trades weekly, this squared windfall can be somewhat large— looking at the past 116 years of DJI data, it is on the order of 7 bps a week. If the strategy is short approximately half the time, it gains around 1.5 to 2% a year for free from this ‘bank error in your favor.’

In this paper, as in the previous paper by Preis et al., the putative market timing strategy is rather oddly defined: if Wikipedia views increase week over week, one sells the stock. A rather ad-hoc explanation is made for this choice in the discussion section, even dragging behavioral finance into this mess, but I would guess that this definition converts the natural increase over time in Wikipedia views into fake profits via the bad accounting of short sales. 2

I should also point out that there are problems with the random strategy mechanism that the authors compare their strategies against. I agree that the ‘coinflip trader’ who decides whether to long or short the market based on a fair coin is ‘uninformed’. However, I would argue that the coinflip trader is a poor benchmark because they will lose all their money asymptotically: while the arithmetic returns of the coinflip trader are mean zero, 3 the geometric, or log returns, are no greater than the equivalent relative returns, with equality only at zero. Applying monotonicity of the expectation operator, we find that the coinflip trader has negative mean log returns. Using daily DJI data 4 from 1896 until today, the expected daily log returns of the coinflip trader are equivalent to a drag of around 1.5% a year. Commissions and borrow costs on a Dow Jones ETF (like DIA) will futhermore cost the coinflip trader around 1% a year. This leaves some uncomfortable wiggle room in which a putative trading strategy could best the coinflip trader and yet still not be profitable.

Given the poor quality of the previous paper on market timing with Google Trends, I did not dig too much further into this paper. Figure 3 indicates to me that much of the backtested returns of the Wikipedia views strategy are gained in 2008, when the Dow Jones lost some 30% of its value. This suggests a short bias in the strategy, caused perhaps by a general increase in Wikipedia page views over time, or by changes in sampling procedures which generate the pageview data, or even by a process not unrelated to the financial markets, but which is ‘uninteresting’. To be sure there could be some latent process that affects the financial markets and which also affects things like unemployment, market volatility, volume of trading, and Wikipedia views. It is not clear if, however, the quantity of Wikipedia views has any incremental value beyond these other macroeconomic and technical signals. In any case, given the small sample period, I doubt much can be said about that incremental value with any certainty.

Disclaimer The information provided does not constitute investment advice.

1. I dare not call it the ‘PMS paper’!

2. In any case, I imagine the ‘momentum’ version of the strategy was tried as well, and found lacking.

3. Incidentally, I find it odd that the authors attempted to verify this via simulation. Under their suspect rules around short accounting, the coinflip trader has exactly zero mean log return. Any experimental deviation from this is attributable to sampling variation (or a broken PRNG).

4. Why do academics try to predict the Dow Jones index? Have they not heard of the S&P 500?

## Apr 30

### Piled Higher and Deeper

The business press is reporting on a recently published paper, Quantifying Trading Behavior in Financial Markets using Google Trends, by Tobias Preis, Helen Susanannah Moat, and H. Eugene Stanley. This paper has not had as much impact as that of Bollen et al., probably because it does not make such outlandish claims, but likely also because Google Trends is not as sexy as Twitter.

The Preis et al. paper has the dubious distinction of being the worst paper I’ve read in the last month. Here are the problems I found with this paper before giving up on it:

1. It is not entirely clear that Google Trends data is causal: the historical data you retrieve now may not represent what one would have (or even could have) observed at point in time. Google’s help pages make some vague reference to data normalization, but neither confirm nor deny causality. If time trends are removed using all the data, the entire exercise is utterly pointless.
2. The authors do not understand how shorting works! They claim that the changes in ‘cumulative returns’ from a short position are $$\log(p(t)) - \log(p(t+1))$$. Under this formulation, a short position can experience unlimited gains but limited losses, when, in fact, the opposite is the case. The proper expression is $$\log(2 - p(t+1)/p(t))$$, which could be undefined if $$p(t+1)/p(t)$$ is two or larger. This bungled backtest accounting introduces a positive bias of order $$(1 - p(t+1)/p(t))^2$$, which can be large for the weekly hold periods considered in the paper. The upshot is that short-biased strategies get a tailwind which is pure ‘backtest arb’.
3. I am unable to replicate the backtest presented in Figure 2. Note the paper is ambiguous regarding how one should act if the change in trend data is exactly zero (this occurs around 5% of the time for ‘debt’ data, using a three week normalization window), but breaking the tie in any of the three ways, and backtesting using the ‘corrupt’ method for shorts and a correct method never gives 326% cumulative returns as quoted in the paper. The ‘corrupt’ method does indeed boost total returns and Sharpe ratio. However, under none of the tested configurations, including the suspect ones, does the Sharpe ratio achieve 95% significance.
4. If I am to understand Figure 3 correctly, the ‘debt’ signal achieves returns which are 2.31 standard deviations above the returns of a ‘random strategy’. Presumably the random strategies do not have the shorting bias that the ‘debt’ signal does. However, given that approximately 100 different search terms are tested, a 2.3 sigma event is not statistically significant when a Bonferroni correction is applied.
5. Since the authors (or the paper’s reviewers, if there indeed were any) are apparently aware of the pitfalls of multiple hypothesis testing, they do not draw much attention to the 2.3 sigma event. Rather, they compute the mean ‘Sharpe’ over the 98 strategies, then quote the t-statistic (a whopping 8.6) and p-value. Back in the eighties when professional statisticians bemoaned the coming availability of statistical software which would allow hoi polloi to misuse statistical techniques, this is what they were warning us about. Because the search term time series measure latent ‘interest’ with correlated errors, and because they are all backtested on the same Dow Jones time history, the errors in the 98 backtests’ returns are correlated. One cannot perform a t-test on the aggregate statistics without dealing with this correlation, otherwise one is rejecting the (composite) null for the wrong reason: i.e. because independence of errors is violated. (The Leung and Wong test for paired Sharpe seems more appropriate in this case.)

In all, this paper teaches me nothing about the world other than the low standards of the journal ‘Scientific Reports’, which, I am horrified to find, is somehow associated with the journal ‘Nature’.

Disclaimer The information provided does not constitute investment advice.

## Mar 11

### Did you ever think your tweets might predict the future?

Not wanting to get be left behind on all this ‘social media’ stuff, Fox Business News trotted out Johan Bollen for an interview regarding his research. Bollen notes that his system is designed for hedge funds looking for a little extra alpha, not retail clients. This displays shrewd market positioning on his part, since Derwent’s experiment with bringing social media trading to the masses appears to have deflated—their recent ‘innovative’ self auction earned a non-binding bid of 120K GBP for the company, a ROI of perhaps negative 65 percent on the initial 350K invested.

I would like to believe that Bollen is giving me a shoutout at 2:11, when he notes:

It’s absolutely clear that there’s communities out there whose purpose is simply to spread misinformation or to … throw a wrench into … the gears of this algorithm.

Disclaimer The information provided does not constitute investment advice.

## Feb 07

### The Sentiment Trading Platform is for Sale

Derwent Capital, the former hedge fund turned retail broker announced that they are auctioning themselves to the highest bidder. At the moment, the highest bid id 100K GBP, far lower than the 350K over/under number for profitability, according to Paul Hawtin, Derwent’s CEO. The ‘guidance figure’ (read: anchor) is 5M GBP, and as part of the deal you take ownership of the ‘Sentiball’ trademark.

As Hawtin notes:

The beauty of an auction is that you get a true valuation of the company.

And so I will be greatly amused for the next ten days.

Disclaimer The information provided does not constitute investment advice.

## Jul 25

### You had me at the third significant digit

I have, in the past, been rather harsh on Bollen, Mao and Zheng for their Twitter paper, which boggles the imagination with its naïveté. However, to their credit, theirs is not clearly the most ridiculous ‘quant’ paper I have ever seen. A recent contender for that distinction is Limited Attention, Salience, and Stock Returns, by A. Subrahmanyam, J. Wei, and H-Y. Yu, dated March 25, 2012. Here is the abstract:

We show that a long-short portfolio based on stocks that have just arrived to and left from extreme winner and loser deciles materially outperforms a conventional momentum portfolio. A 6-month-ranking and 6-month-holding portfolio based on the standard Jegadeesh and Titman (1993, 2001) momentum strategy commands an average monthly return of 1.20% and a Sharpe ratio of 0.262 over the past four decades; the corresponding numbers for our long-short portfolio are 10.30% and 1.035, respectively. For the 2001-2010 period, our monthly return is even higher at 16.38%, compounding to an annual return of 517.36%. The sheer size of these profits poses a further, significant challenge to the asset pricing literature and the market efficiency hypothesis. We propose that arrival to an extreme decile is a salient signal that attracts retail investor attention, and stimulates strong buying, boosting returns. Supporting this explanation, we show that there is significantly abnormal buying pressure in extreme decile arrivals that reverses in the longer run.

This paper was formerly posted at SSRN, but was mysteriously removed less than two weeks after the publication date (and after receiving some attention at CXO Advisory.)

Some relevant facts about their purported “challenge” to the Efficient Markets Hypothesis which are omitted from the abstract: their strategy rebalances monthly; they delay their signal by a month; the quoted Sharpe ratio numbers are monthly; no mention is made of leverage.

So as a recap, the claim is that if one trades once a month, on a month-old signal, based on a 12 month moving average of publicly available price and volume data, on U. S. equities, one can capture a Sharpe around 3.5$$\mbox{yr}^{-1/2}$$ and annualized returns over 500 percent. Moreover, the returns have been measured with no fewer than five significant digits.

If your error-checking sense is not properly calibrated, you should have goosebumps right now. If so, I am going to marsh your mellow by revealing that these results are, indeed, too good to be true. There is no conceivable way such a large effect could have lurked, unnoticed, within the landscape of technical strategies for five years, much less for four decades. Moreover, to suggest that the returns of U.S. equities could be predicted with such certainty based on a month-old highly autocorrelated signal is ludicrous.

Luckily for the world, someone must have notified the authors of their mistake, and the paper went down the memory hole. The alternative explanation is that Derwent inked a deal with Subrahmanyam, Wei, and Yu to license their technology, and they went into stealth mode.