How To Predict Market Direction On Any Timeframe Using QDA Model?

In the last post, I disclosed why I have decided to become a quant trader. In this post, I will discuss in detail whether we can predict the market direction with any degree of accuracy on any timeframe. We start our analysis on the intraday timeframe and try to predict the market using logistic regression model, linear discriminant model and quadratic discriminant model. We also calculate the predictive accuracy of our forecasted result and see if we have an edge as a quant trader in predicting the market. Our language of choice will be R. Soon I will also start posting python code as well. For now we only use R code. You should have R and RStudio installed. So let’s start!

Data Prepossessing

First we need to prepare the data. This is the most difficult and tedious step. This step is also known as data prepossessing. If we prepare the data incorrectly, our results will also be incorrect as simple as that. We download GBPUSDM30 csv file from MT4 history file. You should know how to do it. If you don’t know follow these simple steps. On MT4, click Tools > History Center > GBPUSD > M30 > Export. This will download GBPUSD M30 csv file to your hard drive.

> # Import the csv file
> quotes <- read.csv("E:/MarketData/GBPUSD30.csv", header=FALSE)
> 
> 
> 
> # load quantmod package
> library(quantmod)
> 
> #convert the data frame into an xts object
> quotes <- as.ts(quotes)
> 
> 
> #convert time series into a zoo object
> quotes1 <- as.zoo(quotes)
> 
> #calculate simple returns
> sr <- diff(quotes) / lag(quotes, k = -1, na.pad=TRUE) 
> 
> #calculate log returns
> 
> lr <- diff(log(quotes))
> 
> #number of rows in the dataframe
> x <-nrow(sr)
> 
> # lag the data
> x1 <- lag(sr, k=-1, na.pad=TRUE)
> x2 <- lag(sr, k=-2, na.pad=TRUE)
> x3 <- lag(sr, k=-3, na.pad=TRUE)
> x4 <- lag(sr, k=-4, na.pad=TRUE)
> x5 <- lag(sr, k=-5, na.pad=TRUE)
> 
> # combine all the above matrices into one matrix having close prices
> CQuotes <- cbind (x1[ ,6], x2[ ,6], x3[ ,6], x4[ ,6], x5[ ,6],
+                   
+                   sr[,6])
> 
> 
> Direction <- ifelse(CQuotes[, 6] >=0, 1, 0)
> 
> CQuotes <- cbind(CQuotes, Direction)
> 
> #name the columsn Open, High, Low, Close, Volume
> colnames(CQuotes) <-c('Lag1','Lag2','Lag3','Lag4','Lag5', 'Today', 'Direction')

In the above series of commands, we first told R to read GBPUSDM30.csv file. After that we changed the dataframe into a zoo object so that we can calculate returns. We can calculate the simple returns as well the log returns. Log returns are almost the same as simple returns. Log returns are supposed to have better statistical properties. But for now we use simple returns sr. After calculating the simple returns sr, we lag them and combine them into a matrix with 5 columns Lag1, Lag2, Lag3, Lag4 and Lag5. The sixth column is the simple return sr column. The next command, then adds another column Direction that has 1 if return is positive meaning price closed above last candle’s close price meaning the candlestick was bullish. Similarly if the candlestick is bearish, Direction will have 0. We have done this so that we can predict the direction of candle with 1 meaning bullish and 0 meaning bearish. Below is a summary of the data!

> summary(CQuotes)
      Lag1                Lag2                Lag3                Lag4          
 Min.   :-0.006589   Min.   :-0.006589   Min.   :-0.006589   Min.   :-0.006589  
 1st Qu.:-0.000315   1st Qu.:-0.000315   1st Qu.:-0.000315   1st Qu.:-0.000315  
 Median : 0.000006   Median : 0.000006   Median : 0.000006   Median : 0.000006  
 Mean   :-0.000007   Mean   :-0.000007   Mean   :-0.000007   Mean   :-0.000007  
 3rd Qu.: 0.000314   3rd Qu.: 0.000314   3rd Qu.: 0.000314   3rd Qu.: 0.000314  
 Max.   : 0.006732   Max.   : 0.006732   Max.   : 0.006732   Max.   : 0.006732  
 NA's   :5           NA's   :5           NA's   :5           NA's   :5          
      Lag5               Today             Direction    
 Min.   :-0.006589   Min.   :-0.006589   Min.   :0.000  
 1st Qu.:-0.000315   1st Qu.:-0.000315   1st Qu.:0.000  
 Median : 0.000006   Median : 0.000006   Median :1.000  
 Mean   :-0.000007   Mean   :-0.000007   Mean   :0.507  
 3rd Qu.: 0.000314   3rd Qu.: 0.000314   3rd Qu.:1.000  
 Max.   : 0.006732   Max.   : 0.006732   Max.   :1.000  
 NA's   :5           NA's   :5           NA's   :5      
> table(CQuotes[,7])

   0    1 
4622 4754

You can see above 4622 candles are down and 4754 candles are up in the above data.

Logistic Regression Model of GBPUSDM30 data

> #divide the data into train and test
> 
> CQUotes=as.data.frame(CQuotes)
> 
> train=as.data.frame(CQuotes[10:(x-1000), 1:7])
> test= as.data.frame(CQuotes[(x-1000):x, 1:7])
> 
> 
> 
> #fit a logistic regression function
> fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5,
+           data=train ,family=binomial)
> 
> 
> summary(fit)

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, family = binomial, 
    data = train)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.555  -1.183   1.036   1.167   1.473  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.01919    0.02190   0.876    0.381    
Lag1        -129.00378   31.17855  -4.138 3.51e-05 ***
Lag2          -0.98273   30.91010  -0.032    0.975    
Lag3         -75.37446   30.97746  -2.433    0.015 *  
Lag4           0.03483   30.87976   0.001    0.999    
Lag5          -6.87901   30.85312  -0.223    0.824    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11598  on 8366  degrees of freedom
Residual deviance: 11575  on 8361  degrees of freedom
AIC: 11587

Number of Fisher Scoring iterations: 3

> 
> coef(fit)
  (Intercept)          Lag1          Lag2          Lag3          Lag4          Lag5 
   0.01918816 -129.00377665   -0.98272571  -75.37446211    0.03482652   -6.87900911 
> 
> summary(fit)$coef
                 Estimate  Std. Error      z value     Pr(>|z|)
(Intercept)    0.01918816  0.02189976  0.876181016 3.809316e-01
Lag1        -129.00377665 31.17855133 -4.137580841 3.509868e-05
Lag2          -0.98272571 30.91010441 -0.031793025 9.746371e-01
Lag3         -75.37446211 30.97745859 -2.433203547 1.496589e-02
Lag4           0.03482652 30.87975563  0.001127811 9.991001e-01
Lag5          -6.87900911 30.85312071 -0.222959913 8.235667e-01
> 
> #predict the market direction
> probs =predict (fit, test, type ="response")
> 
> pred=rep (0 ,1000)
> pred[probs > 0.5]=1
> table(pred , test[ ,7])
    
pred   0   1
   0 196 209
   1 282 313

R was able to do all the above calculation in less than 30 seconds. This is something good if we want to use this model on lower intraday timeframes of 30 minute and 15 minutes for making the predictions. The prediction accuracy of logistic regression model for the above GBPUSD M30 data is 313/(313+282)=52%. 52% prediction accuracy is no more good than random guessing. So Logistic Regression Model in our case is no more good than random guessing. Watch the video below that explains what is Linear Discriminant Analysis and what is Quadratic Discriminant Analysis.

Linear Discriminant Analysis of GBPUSDM30 data

Let’s try Linear Discriminant Analysis a.k.a LDA model. LDA assumes that the underlying class probability distribution is Gaussian which may or may not be correct. Then it makes a few simplifying assumptions about the variance of the class distribution.

> #let's perform a Linear Discriminant Analyais now
> 
> library (MASS)
> lda.fit=lda(Direction~Lag1+Lag2+Lag3+Lag4+Lag5 ,data=train)
> lda.fit
Call:
lda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data = train)

Prior probabilities of groups:
        0         1 
0.4949205 0.5050795 

Group means:
           Lag1          Lag2          Lag3          Lag4          Lag5
0  2.704735e-05 -6.872896e-06  1.376536e-05 -6.709646e-06 -3.265967e-06
1 -3.776003e-05 -4.546141e-06 -2.453348e-05 -4.477171e-06 -7.537555e-06

Coefficients of linear discriminants:
               LD1
Lag1 -1210.4217560
Lag2    -8.6427758
Lag3  -709.3823866
Lag4     0.6858706
Lag5   -65.0400596
> 
> #make LDA predictions
> lda.pred=predict (lda.fit , test)
> 
> lda.class =lda.pred$class
> table(lda.class ,test[ ,7])
         
lda.class   0   1
        0 196 210
        1 282 313

You can see LDA model and the Logistic Regression Model have the same results. Both models are no better than random guessing.

Quadratic Discriminant Analysis

Now let’s use the Quadratic Discriminant Analysis a.k.a QDA model. This model also assumes that the underlying class probability distribution is Gaussian. But it calculates the variance of the classes as not being equal to each other. This assumption makes QDA different than LDA. Also QDA classifier uses a quadratic function rather than a linear function of the predictors. Whatever, you don’t need to go too deep into the mathematical details of the model. We just need to know how to correctly apply the model.

> #make LDA predictions
> lda.pred=predict (lda.fit , test)
> 
> lda.class =lda.pred$class
> table(lda.class ,test[ ,7])
         
lda.class   0   1
        0 196 210
        1 282 313
> qda.fit=qda(Direction~Lag1+Lag2+Lag3+Lag4+Lag5 ,data=train)
> qda.fit
Call:
qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data = train)

Prior probabilities of groups:
        0         1 
0.4949205 0.5050795 

Group means:
           Lag1          Lag2          Lag3          Lag4          Lag5
0  2.704735e-05 -6.872896e-06  1.376536e-05 -6.709646e-06 -3.265967e-06
1 -3.776003e-05 -4.546141e-06 -2.453348e-05 -4.477171e-06 -7.537555e-06
> qda.class =predict(qda.fit ,test) $class
> table(qda.class )
qda.class
  0   1 
425 576 
> qda.pred=predict(qda.fit, test)
> qda.class =qda.pred$class
> table(qda.class )
qda.class
  0   1 
425 576 
> 575/(576+425)
[1] 0.5744256
> #let's now do the Quadratic Discriminant Analysis
> qda.fit=qda(Direction~Lag1+Lag2+Lag3+Lag4+Lag5 ,data=train)
> qda.fit
Call:
qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data = train)

Prior probabilities of groups:
        0         1 
0.4949205 0.5050795 

Group means:
           Lag1          Lag2          Lag3          Lag4          Lag5
0  2.704735e-05 -6.872896e-06  1.376536e-05 -6.709646e-06 -3.265967e-06
1 -3.776003e-05 -4.546141e-06 -2.453348e-05 -4.477171e-06 -7.537555e-06
> 
> 
> #make predictions
> qda.pred=predict(qda.fit, test)
> qda.class =qda.pred$class
> table(qda.class )
qda.class
  0   1 
425 576

In the above model we get a predictive accuracy of 576/(576+425)=57% which is much better than 52% predictive accuracy of the last 2 models but still not that better. The important question is: can we increase the predictive accuracy of our model to above 70%. So let’s do some change to the inputs. Instead of 5 lags, let’s try 2 lags and see how it changes the predictive accuracy of the model.

> #let's now do the Quadratic Discriminant Analysis
> qda.fit=qda(Direction~Lag1+Lag2+Lag3+Lag4+Lag5 ,data=train)
> qda.fit
Call:
qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data = train)

Prior probabilities of groups:
        0         1 
0.4949205 0.5050795 

Group means:
           Lag1          Lag2          Lag3          Lag4          Lag5
0  2.704735e-05 -6.872896e-06  1.376536e-05 -6.709646e-06 -3.265967e-06
1 -3.776003e-05 -4.546141e-06 -2.453348e-05 -4.477171e-06 -7.537555e-06
> 
> 
> #make predictions
> qda.pred=predict(qda.fit, test)
> qda.class =qda.pred$class
> table(qda.class )
qda.class
  0   1 
425 576 
> qda.fit=qda(Direction~Lag1+Lag2, data=train)
> qda.fit
Call:
qda(Direction ~ Lag1 + Lag2, data = train)

Prior probabilities of groups:
        0         1 
0.4949205 0.5050795 

Group means:
           Lag1          Lag2
0  2.704735e-05 -6.872896e-06
1 -3.776003e-05 -4.546141e-06
> #make predictions
> qda.pred=predict(qda.fit, test)
> qda.class =qda.pred$class
> table(qda.class )
qda.class
  0   1 
328 673 
> 673/(328+673)
[1] 0.6723277

Now the predictive accuracy is 67% which is much better than 57% that we have achieved last time. We are close to 70%. Now we can use this model for trading forex binary options on M15 and M30 timeframe. We can also use this model to find the direction of the market on the daily and weekly timeframe. There are a few steps that can increase the predictive accuracy of the model to 77% which is very close to 80%. But I keep those steps for myself as it gives me the edge to trade the markets. 67% predictive accuracy is still much better than random guessing. Let’s do some quick calculations for the daily timeframe and check how much predictive accuracy we can achieve with QDA model.

> # Import the csv file
> quotes <- read.csv("E:/MarketData/GBPUSD1440.csv", header=FALSE)
> 
> 
> 
> # load quantmod package
> library(quantmod)
> 
> #convert the data frame into an xts object
> quotes <- as.ts(quotes)
> 
> 
> #convert time series into a zoo object
> quotes1 <- as.zoo(quotes)
> 
> #calculate simple returns
> sr <- diff(quotes) / lag(quotes, k = -1, na.pad=TRUE) 
> 
> #calculate log returns
> 
> lr <- diff(log(quotes))
> 
> #number of rows in the dataframe
> x <-nrow(sr)
> 
> # lag the data
> x1 <- lag(sr, k=-1, na.pad=TRUE)
> x2 <- lag(sr, k=-2, na.pad=TRUE)
> x3 <- lag(sr, k=-3, na.pad=TRUE)
> x4 <- lag(sr, k=-4, na.pad=TRUE)
> x5 <- lag(sr, k=-5, na.pad=TRUE)
> 
> # combine all the above matrices into one matrix having close prices
> CQuotes <- cbind (x1[ ,6], x2[ ,6], x3[ ,6], x4[ ,6], x5[ ,6],
+                   
+                   sr[,6])
> 
> 
> Direction <- ifelse(CQuotes[, 6] >=0, 1, 0)
> 
> CQuotes <- cbind(CQuotes, Direction)
> 
> #name the columsn Open, High, Low, Close, Volume
> colnames(CQuotes) <-c('Lag1','Lag2','Lag3','Lag4','Lag5', 'Today', 'Direction')
> 
> summary(CQuotes)
      Lag1                Lag2                Lag3                Lag4          
 Min.   :-0.034064   Min.   :-0.034064   Min.   :-0.034064   Min.   :-0.034064  
 1st Qu.:-0.003365   1st Qu.:-0.003365   1st Qu.:-0.003365   1st Qu.:-0.003365  
 Median :-0.000168   Median :-0.000168   Median :-0.000168   Median :-0.000168  
 Mean   :-0.000129   Mean   :-0.000129   Mean   :-0.000129   Mean   :-0.000129  
 3rd Qu.: 0.003103   3rd Qu.: 0.003103   3rd Qu.: 0.003103   3rd Qu.: 0.003103  
 Max.   : 0.029461   Max.   : 0.029461   Max.   : 0.029461   Max.   : 0.029461  
 NA's   :5           NA's   :5           NA's   :5           NA's   :5          
      Lag5               Today             Direction     
 Min.   :-0.034064   Min.   :-0.034064   Min.   :0.0000  
 1st Qu.:-0.003365   1st Qu.:-0.003365   1st Qu.:0.0000  
 Median :-0.000168   Median :-0.000168   Median :0.0000  
 Mean   :-0.000129   Mean   :-0.000129   Mean   :0.4868  
 3rd Qu.: 0.003103   3rd Qu.: 0.003103   3rd Qu.:1.0000  
 Max.   : 0.029461   Max.   : 0.029461   Max.   :1.0000  
 NA's   :5           NA's   :5           NA's   :5       
> 
> table(CQuotes[,7])

   0    1 
1105 1048 
> 
> 
> 
> #divide the data into train and test
> 
> CQUotes=as.data.frame(CQuotes)
> 
> train=as.data.frame(CQuotes[10:(x-1000), 1:7])
> test= as.data.frame(CQuotes[(x-1000):x, 1:7])
> #let's now do the Quadratic Discriminant Analysis
> qda.fit=qda(Direction~Lag1+Lag2, data=train)
> qda.fit
Call:
qda(Direction ~ Lag1 + Lag2, data = train)

Prior probabilities of groups:
        0         1 
0.5113636 0.4886364 

Group means:
           Lag1          Lag2
0 -0.0004861019 -0.0002165024
1  0.0001232127 -0.0001817700
> 
> 
> #make predictions
> qda.pred=predict(qda.fit, test)
> qda.class =qda.pred$class
> table(qda.class )
qda.class
  0   1 
643 358 
> 358/(358+643)
[1] 0.3576424
> #let's now do the Quadratic Discriminant Analysis
> qda.fit=qda(Direction~Lag1+Lag2+Lag3+Lag4+Lag5 ,data=train)
> qda.fit
Call:
qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, data = train)

Prior probabilities of groups:
        0         1 
0.5113636 0.4886364 

Group means:
           Lag1          Lag2          Lag3          Lag4          Lag5
0 -0.0004861019 -0.0002165024 -3.075892e-04 -0.0001606360 -5.043089e-05
1  0.0001232127 -0.0001817700 -9.913062e-05 -0.0002445455 -3.560125e-04
> 
> 
> #make predictions
> qda.pred=predict(qda.fit, test)
> qda.class =qda.pred$class
> table(qda.class )
qda.class
  0   1 
272 729 
> 729/(272+729)
[1] 0.7282717

In the above model you can see we first used 2 lags that gave us a very poor predictive accuracy of 37%. So we again reverted back to our 5 lag model that gave us a robust predictive accuracy of 72% which is very good. As said above we can even achieve a better predictive accuracy by using a few more inputs that I keep for myself so that I have the edge. As said above R takes less than 30 seconds to do all the calculations which makes this model quite good for trading binary options on short timeframes like M15 and M30.