Indicator Improvement- Part 1- A supervised classification learning

Indicator Improvement- Part 1- A supervised classification learning

2022, Oct 14    

As a trader, I do not need to predict the final price to be able to make a profit! As far as I could distinguish a trend in the price change, I can enter a trade and make a profit. This is the mindset of the majority of the scalpers and the day traders! Over a years, lots of methods and trading strategies have been developed for this aim. In the trading world, people are looking to the indicators to get trading pulses! For example, when a 20-bar simple moving average (SMA) passes and goes above the 50-bar SMA, that indicates an up trend in the price, so people get a long trade. Unfortunately, these indicators do not work well, and the nature of the market can be much more complicated. Here, I want to see if a supervised ML algorithm can help us to improve this indicator!

What I am going to do here is not a regression, but classification! I need to know whether the indicator can show the up or down trends or not! Therefore, I need to alter the time series data frame to satisfy this purpose. Let’s start with importing libraries and data.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    TimeSeriesSplit,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler

plt.rcParams["font.size"] = 16
from datetime import datetime
from google.colab import drive
drive.mount('/content/gdrive')
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
fx_df = pd.read_csv('/content/gdrive/MyDrive/Python-Colab-Projects/Forex_Data/EURGBP_Candlestick_5_M_BID_14.10.2019-14.10.2022.csv')
fx_df.head()
Gmt time Open High Low Close Volume
0 15.10.2019 00:00:00.000 0.87438 0.87461 0.87429 0.87430 434.60
1 15.10.2019 00:05:00.000 0.87430 0.87440 0.87418 0.87440 402.75
2 15.10.2019 00:10:00.000 0.87441 0.87450 0.87439 0.87448 421.10
3 15.10.2019 00:15:00.000 0.87448 0.87451 0.87436 0.87441 342.43
4 15.10.2019 00:20:00.000 0.87441 0.87444 0.87435 0.87443 195.58

When the market is closed, the prices are set on the last price of the active market. However, I can find those times by looking at the volume of the trades and dropping them.

fx_df.loc[fx_df['Volume'] == 0].head()
Gmt time Open High Low Close Volume
1116 18.10.2019 21:00:00.000 0.86047 0.86047 0.86047 0.86047 0.0
1117 18.10.2019 21:05:00.000 0.86047 0.86047 0.86047 0.86047 0.0
1118 18.10.2019 21:10:00.000 0.86047 0.86047 0.86047 0.86047 0.0
1119 18.10.2019 21:15:00.000 0.86047 0.86047 0.86047 0.86047 0.0
1120 18.10.2019 21:20:00.000 0.86047 0.86047 0.86047 0.86047 0.0
fx_df_active = fx_df.loc[fx_df['Volume'] != 0]
fx_df_active.loc[fx_df_active['Volume'] == 0].any()
Gmt time    False
Open        False
High        False
Low         False
Close       False
Volume      False
dtype: bool

Checking for missing data:

fx_df_active.isnull().any()
Gmt time    False
Open        False
High        False
Low         False
Close       False
Volume      False
dtype: bool

Now, convert the index to the date-time. Also, I want to remove the volume column and go just with the time and price analysis.

fx_df_active.columns = ['date', 'open', 'high', 'low', 'close', 'volume']
fx_df_active.loc[:, 'date'] = pd.to_datetime(fx_df_active.loc[:, 'date'] , format='%d.%m.%Y %H:%M:%S.%f')
fx_df_active = fx_df_active.set_index(fx_df_active.loc[:, 'date'])
fx_df_5m = fx_df_active[['open', 'high', 'low', 'close', 'volume']]
fx_df_5m = fx_df_5m.drop_duplicates(keep=False)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py:1773: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
fx_df_5m_prices = fx_df_5m.drop(['volume'], axis=1) #dropping volume column
fx_df_5m_prices.head()
open high low close
date
2019-10-15 00:00:00 0.87438 0.87461 0.87429 0.87430
2019-10-15 00:05:00 0.87430 0.87440 0.87418 0.87440
2019-10-15 00:10:00 0.87441 0.87450 0.87439 0.87448
2019-10-15 00:15:00 0.87448 0.87451 0.87436 0.87441
2019-10-15 00:20:00 0.87441 0.87444 0.87435 0.87443

Similar to the previous work, I am adding the lagged data and the simple moving averages to the data frame.

fx_lag_prices_df = fx_df_5m_prices.copy()
for i in range(1,7):
    forex_data_lag = fx_df_5m_prices.shift(i)
    lag_columns = list( s+"_lag%d" % i for s in fx_df_5m_prices.columns)
    forex_data_lag.columns = lag_columns
    fx_lag_prices_df = pd.concat([fx_lag_prices_df, forex_data_lag], axis =1)
def simple_moving_average(data, moving_average_list, criteria='close'):
    df = pd.DataFrame()
    for ma in moving_average_list:
        sma_column = f"sma{ma}"
        sma = data[criteria].rolling(center=False, window = ma).mean().to_frame(name = sma_column)
        df = pd.concat((df, sma),  axis=1)

    return(df )

sma_df = simple_moving_average(fx_df_5m_prices, [20, 50, 100])
fx_lag_prices_sma_df = pd.concat([fx_lag_prices_df, sma_df], axis =1)
fx_lag_prices_sma_df.head()
open high low close open_lag1 high_lag1 low_lag1 close_lag1 open_lag2 high_lag2 ... high_lag5 low_lag5 close_lag5 open_lag6 high_lag6 low_lag6 close_lag6 sma20 sma50 sma100
date
2019-10-15 00:00:00 0.87438 0.87461 0.87429 0.87430 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-10-15 00:05:00 0.87430 0.87440 0.87418 0.87440 0.87438 0.87461 0.87429 0.87430 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-10-15 00:10:00 0.87441 0.87450 0.87439 0.87448 0.87430 0.87440 0.87418 0.87440 0.87438 0.87461 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-10-15 00:15:00 0.87448 0.87451 0.87436 0.87441 0.87441 0.87450 0.87439 0.87448 0.87430 0.87440 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-10-15 00:20:00 0.87441 0.87444 0.87435 0.87443 0.87448 0.87451 0.87436 0.87441 0.87441 0.87450 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 31 columns

The forex market is indeed a 24-hour market, but it does not mean that it is always active! The active hours of the forex market are in tune with the active hours of other markets in Europe and the USA. So, it is good the add the trading hours as a feature to our data frame.

fx_df_5m.index.hour.unique().values
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])
fx_lag_prices_sma_hour_df = fx_lag_prices_sma_df.copy()
fx_lag_prices_sma_hour_df['trade_hour'] = fx_lag_prices_sma_hour_df.index.hour

Another feature I am going to add to the data frame is the differences between the current closing price with the maximum/minimum of the price in the past data points. Each data point (bar, as it is called) in our 5-min chart shows opening and closing prices as well as the maximum and minimum of the price in a 5-minute interval.

fx_lag_prices_sma_hour_minmax_df = fx_lag_prices_sma_hour_df.copy()
n_minmax = 12  #number of the past data points
clmn_min = f'dis_to_min_{n_minmax}bar'
clmn_max = f'dis_to_max_{n_minmax}bar'
fx_lag_prices_sma_hour_minmax_df[clmn_max] = 0
fx_lag_prices_sma_hour_minmax_df[clmn_min] = 0
for i in range(0,len(fx_lag_prices_sma_hour_df)):
    if i < n_minmax:
       dis_to_max = fx_lag_prices_sma_hour_minmax_df['close'].iloc[0:i+1].max() - fx_lag_prices_sma_hour_minmax_df['close'].iloc[i]
       fx_lag_prices_sma_hour_minmax_df[clmn_max].iloc[i] = dis_to_max

       dis_to_min = fx_lag_prices_sma_hour_minmax_df['close'].iloc[i] - fx_lag_prices_sma_hour_minmax_df['close'].iloc[0:i+1].min()
       fx_lag_prices_sma_hour_minmax_df[clmn_min].iloc[i] = dis_to_min

    else:
       dis_to_max = fx_lag_prices_sma_hour_minmax_df['close'].iloc[i-n_minmax:i+1].max() - fx_lag_prices_sma_hour_minmax_df['close'].iloc[i]
       fx_lag_prices_sma_hour_minmax_df[clmn_max].iloc[i] = dis_to_max

       dis_to_min = fx_lag_prices_sma_hour_minmax_df['close'].iloc[i] - fx_lag_prices_sma_hour_minmax_df['close'].iloc[i-n_minmax:i+1].min()
       fx_lag_prices_sma_hour_minmax_df[clmn_min].iloc[i] = dis_to_min

Usually, I copied the last feature-engineered data frame, to a new one. In this way, by changing the features, it is easier to update the rest of the program and models.

exd_df_final = fx_lag_prices_sma_hour_minmax_df.dropna()
exd_df_final.head()
open high low close open_lag1 high_lag1 low_lag1 close_lag1 open_lag2 high_lag2 ... open_lag6 high_lag6 low_lag6 close_lag6 sma20 sma50 sma100 trade_hour dis_to_max_12bar dis_to_min_12bar
date
2019-10-15 08:15:00 0.87099 0.87114 0.87055 0.87104 0.87096 0.87122 0.87073 0.87098 0.87068 0.87165 ... 0.87022 0.87082 0.87017 0.87048 0.871091 0.872396 0.873362 8 0.00052 0.00099
2019-10-15 08:20:00 0.87107 0.87187 0.87092 0.87179 0.87099 0.87114 0.87055 0.87104 0.87096 0.87122 ... 0.87048 0.87127 0.87027 0.87091 0.871109 0.872345 0.873337 8 0.00000 0.00174
2019-10-15 08:25:00 0.87179 0.87179 0.87044 0.87080 0.87107 0.87187 0.87092 0.87179 0.87099 0.87114 ... 0.87091 0.87156 0.87080 0.87156 0.871094 0.872275 0.873301 8 0.00099 0.00075
2019-10-15 08:30:00 0.87080 0.87116 0.87013 0.87045 0.87179 0.87179 0.87044 0.87080 0.87107 0.87187 ... 0.87155 0.87161 0.87056 0.87069 0.871031 0.872200 0.873260 8 0.00134 0.00024
2019-10-15 08:35:00 0.87045 0.87062 0.87008 0.87062 0.87080 0.87116 0.87013 0.87045 0.87179 0.87179 ... 0.87068 0.87165 0.87054 0.87097 0.870946 0.872130 0.873223 8 0.00117 0.00041

5 rows × 34 columns

Now, the most important part is: what is the target?

As I explained at the beginning, I expect an up trend if 20-bar SMA goes above 50-bar SMA, and a downtrend if it goes in the opposite way. But, the reality might be different from my expectation. So, let’s look at the behavior of the close price, during 10 data points after 20-bar SMA passes the 50-Bar SMA.

So, the new data frame only includes the data of the passing points. The target is the behavior of the close price and its difference from our expectations. For example, if we expect an up trend, but in the next 10 data points, the minimum price moves downwards by 0.0010 (10 pips), I classified this data point as “UpTrend-PriceDown”. Here, the first part (UpTrend) shows my expectation from the indicator, and the second part (PriceDown) shows the actual behavior of the price in 10 data points in the future. Other defined classes are distinguished in the comments.

pred_windo = 10
pip_val = 0.0010 #10 pips
df_size = len(exd_df_final)

ctgorized_df_columns = list(list(exd_df_final.columns) +  ['market-type'] )
ctgorized_df = pd.DataFrame(columns = ctgorized_df_columns)

_row_series = []
for i in range(1, df_size - pred_windo):
#for i in range(1, 1000):
    if exd_df_final['sma20'][i-1] <= exd_df_final['sma50'][i-1]:
        if exd_df_final['sma20'][i] > exd_df_final['sma50'][i]:  #Expecting a Bull market
            if exd_df_final['close'][i:i+pred_windo].max() - exd_df_final['close'][i] > pip_val:
                if exd_df_final['close'][i] - exd_df_final['close'][i:i+pred_windo].min() < pip_val:
                    mkt = 'UpTrend_PriceUp'  #Expecting an up trend, the price goes up
            if exd_df_final['close'][i:i+pred_windo].max() - exd_df_final['close'][i] < pip_val:
                if exd_df_final['close'][i] - exd_df_final['close'][i:i+pred_windo].min() > pip_val:
                    mkt = 'UpTrend-PriceDown'   #Expecting an up trend, the price goes down
            if exd_df_final['close'][i:i+pred_windo].max() - exd_df_final['close'][i] > pip_val:
                if exd_df_final['close'][i] - exd_df_final['close'][i:i+pred_windo].min() > pip_val:
                    mkt = 'UpTrend-Volatile'    #Expecting an up trend, the price moves in both ways
            if exd_df_final['close'][i:i+pred_windo].max() - exd_df_final['close'][i] < pip_val:
                if exd_df_final['close'][i] - exd_df_final['close'][i:i+pred_windo].min() < pip_val:
                    mkt = 'UpTrend-NoTrend'  #Expecting an up trend, the price does not change by 10 pips

            _row_series = pd.Series(exd_df_final.iloc[i,:])
            _row_series = list(list(_row_series) + [mkt])
            
            ctgorized_df = ctgorized_df.append(pd.Series(_row_series, 
                                                         index = ctgorized_df_columns), 
                                               ignore_index = True, sort = False)

    if exd_df_final['sma20'][i-1] >= exd_df_final['sma50'][i-1]:
        if exd_df_final['sma20'][i] < exd_df_final['sma50'][i]: #Expecting a Bear market
            if exd_df_final['close'][i:i+pred_windo].max() - exd_df_final['close'][i] < pip_val:
                if exd_df_final['close'][i] - exd_df_final['close'][i:i+pred_windo].min() > pip_val:
                    mkt = 'DownTrend-PriceDown' #Expecting a down trend, the price goes down
            if exd_df_final['close'][i:i+pred_windo].max() - exd_df_final['close'][i] > pip_val:
                if exd_df_final['close'][i] - exd_df_final['close'][i:i+pred_windo].min() < pip_val:
                    mkt = 'DownTrend-PriceUp'   #Expecting a down trend, the price goes up
            if exd_df_final['close'][i:i+pred_windo].max() - exd_df_final['close'][i] > pip_val:
                if exd_df_final['close'][i] - exd_df_final['close'][i:i+pred_windo].min() > pip_val:
                    mkt = 'DownTrend-Volatile'  #Expecting a down trend, the price moves in both ways
            if exd_df_final['close'][i:i+pred_windo].max() - exd_df_final['close'][i] < pip_val:
                if exd_df_final['close'][i] - exd_df_final['close'][i:i+pred_windo].min() < pip_val:
                    mkt = 'DownTrend-NoTrend'    #Expecting a down trend, the price does not change by 10 pips
            
            _row_series = pd.Series(exd_df_final.iloc[i,:])
            _row_series = list(list(_row_series) + [mkt])
            
            ctgorized_df = ctgorized_df.append(pd.Series(_row_series, 
                                                         index = ctgorized_df_columns), 
                                               ignore_index = True, sort = False)

ctgorized_df.head()
open high low close open_lag1 high_lag1 low_lag1 close_lag1 open_lag2 high_lag2 ... high_lag6 low_lag6 close_lag6 sma20 sma50 sma100 trade_hour dis_to_max_12bar dis_to_min_12bar market-type
0 0.87138 0.87138 0.87064 0.87103 0.87093 0.87156 0.87056 0.87139 0.87088 0.87134 ... 0.87217 0.87121 0.87161 0.871588 0.871522 0.872727 10.0 0.00291 0.00015 UpTrend-PriceDown
1 0.86942 0.87013 0.86900 0.87012 0.86939 0.86976 0.86913 0.86943 0.86988 0.87028 ... 0.87070 0.87012 0.87050 0.871093 0.871101 0.872387 10.0 0.00127 0.00076 DownTrend-PriceUp
2 0.86411 0.86417 0.86403 0.86409 0.86409 0.86416 0.86407 0.86409 0.86395 0.86412 ... 0.86400 0.86353 0.86389 0.863412 0.863401 0.864325 22.0 0.00042 0.00121 UpTrend-NoTrend
3 0.86455 0.86470 0.86453 0.86469 0.86441 0.86456 0.86422 0.86456 0.86467 0.86468 ... 0.86483 0.86466 0.86471 0.864797 0.864814 0.864159 2.0 0.00019 0.00028 DownTrend-NoTrend
4 0.86504 0.86504 0.86430 0.86468 0.86527 0.86536 0.86500 0.86502 0.86504 0.86530 ... 0.86484 0.86461 0.86479 0.864798 0.864798 0.864729 6.0 0.00065 0.00015 UpTrend_PriceUp

5 rows × 35 columns

ctgorized_df['market-type'].value_counts()
DownTrend-NoTrend      2255
UpTrend-NoTrend        2237
UpTrend_PriceUp         238
DownTrend-PriceUp       228
UpTrend-PriceDown       218
DownTrend-PriceDown     208
DownTrend-Volatile       12
UpTrend-Volatile          8
Name: market-type, dtype: int64

As can be seen, the defined classes are very imbalanced. The last two classes rarely happen, so, I’ll start by dropping them. Later, I’ll deal with imbalances in a separate blog post, but for now, let’s see how much a classification ML algorithm is successful to classify this problem.

ctgorized_df.drop(ctgorized_df[ctgorized_df['market-type']=='DownTrend-Volatile'].index, inplace=True)
ctgorized_df.drop(ctgorized_df[ctgorized_df['market-type']=='UpTrend-Volatile'].index, inplace=True)

Besides the imbalanced classes, we can understand that the indicator we are looking at, is not a good indicator to show the trend. Let me show the graph of the price changes when we expect to have an up trend!

df_size = len(exd_df_final)

up_up_series = []
up_down_series = []

for i in range(1, df_size - pred_windo):
    if exd_df_final['sma20'][i-1] <= exd_df_final['sma50'][i-1]:
        if exd_df_final['sma20'][i] > exd_df_final['sma50'][i]:  #Bull market
            bull_max_dif = exd_df_final['close'][i:i+pred_windo].max() - exd_df_final['close'][i]
            up_up_series.append(bull_max_dif)
            bull_min_dif = exd_df_final['close'][i] - exd_df_final['close'][i:i+pred_windo].min()
            up_down_series.append(-1*bull_min_dif)


plt.figure()
plt.rcParams["figure.figsize"] = (15,7)
plt.scatter(range(len(up_up_series)),up_up_series, marker='o', c='g')
plt.axhline(y=0.0010, color='b', linestyle='-')
plt.scatter(range(len(up_down_series)),up_down_series, marker='o', c='r')
plt.axhline(y=-0.0010, color='b', linestyle='-')
plt.show()

png

It seems that our indicator cannot identify a trend! the change in the price can be in any direction, and most of the time, the price does not change as much as we expected! You might get better results if started with some indicators that have been developed to identify the trends.

Train-Test split

Now, it is time to prepare our data for a classification learning method. Since I altered the problem to a classification, and do not deal with a time series anymore, I can use the Sklear train-test-split function.

train_df, test_df = train_test_split(ctgorized_df, test_size=0.2, random_state=42)
train_df['market-type'].value_counts()
UpTrend-NoTrend        1806
DownTrend-NoTrend      1797
DownTrend-PriceUp       189
UpTrend_PriceUp         182
UpTrend-PriceDown       179
DownTrend-PriceDown     154
Name: market-type, dtype: int64
test_df['market-type'].value_counts()
DownTrend-NoTrend      458
UpTrend-NoTrend        431
UpTrend_PriceUp         56
DownTrend-PriceDown     54
DownTrend-PriceUp       39
UpTrend-PriceDown       39
Name: market-type, dtype: int64

Preprocessing the data

train_df.columns
Index(['open', 'high', 'low', 'close', 'open_lag1', 'high_lag1', 'low_lag1',
       'close_lag1', 'open_lag2', 'high_lag2', 'low_lag2', 'close_lag2',
       'open_lag3', 'high_lag3', 'low_lag3', 'close_lag3', 'open_lag4',
       'high_lag4', 'low_lag4', 'close_lag4', 'open_lag5', 'high_lag5',
       'low_lag5', 'close_lag5', 'open_lag6', 'high_lag6', 'low_lag6',
       'close_lag6', 'sma20', 'sma50', 'sma100', 'trade_hour',
       'dis_to_max_12bar', 'dis_to_min_12bar', 'market-type'],
      dtype='object')
numeric_features = ['open', 'high', 'low', 'close', 'open_lag1', 'high_lag1', 'low_lag1',
       'close_lag1', 'open_lag2', 'high_lag2', 'low_lag2', 'close_lag2',
       'open_lag3', 'high_lag3', 'low_lag3', 'close_lag3', 'open_lag4',
       'high_lag4', 'low_lag4', 'close_lag4', 'open_lag5', 'high_lag5',
       'low_lag5', 'close_lag5', 'open_lag6', 'high_lag6', 'low_lag6',
       'close_lag6', 'sma20', 'sma50', 'sma100',
       'dis_to_max_12bar', 'dis_to_min_12bar', ]

ordinal_features = ['trade_hour']

drop_features = ['market-type']
numeric_transform = make_pipeline(StandardScaler())

preprocessor = make_column_transformer(
    (numeric_transform, numeric_features),
    ('passthrough', ordinal_features),
    ('drop', drop_features)
)

preprocessor.fit(train_df)

new_clomuns = numeric_features + ordinal_features
X_train  = pd.DataFrame(preprocessor.transform(train_df), index=train_df.index, columns=new_clomuns)
X_test  = pd.DataFrame(preprocessor.transform(test_df), index=test_df.index, columns=new_clomuns)

y_train = train_df['market-type']
y_test = test_df['market-type']

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(4307, 34) (1077, 34) (4307,) (1077,)

Classification

Due to class imbalance, cross-validation is not a good criterion to validate our model. To have a good understanding of how our classes are classified, I use the Confusion matrix.

from sklearn.metrics import ConfusionMatrixDisplay

def plot_confusion_matrix_classifier(clf):
    plt.rc('font', size=12)
    disp = ConfusionMatrixDisplay.from_estimator(
        clf,
        X_test,
        y_test,
        display_labels=dummy.classes_,
        values_format="d",
        cmap=plt.cm.Blues,
        colorbar=False,
    )

    plt.xticks(rotation = 90)
    fig = disp.ax_.get_figure() 
    fig.set_figwidth(8)
    fig.set_figheight(8)

Dummy Classifier

from sklearn.dummy import DummyClassifier

dummy = DummyClassifier()
dummy.fit(X_train, y_train)
pd.DataFrame(cross_validate(dummy, X_train, y_train, return_train_score=True)).mean()
fit_time       0.003735
score_time     0.001221
test_score     0.419318
train_score    0.419317
dtype: float64
plot_confusion_matrix_classifier(dummy)   #plotting the confusion matrix for X_test and y_test 

png

RandomForest

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(max_depth=10, random_state=0)
rfc.fit(X_train, y_train)
pd.DataFrame(cross_validate(rfc, X_train, y_train, return_train_score=True)).mean()
fit_time       1.382421
score_time     0.031423
test_score     0.621547
train_score    0.740597
dtype: float64
plot_confusion_matrix_classifier(rfc)   #plotting the confusion matrix for X_test and y_test

png

AdaBoost

from sklearn.ensemble import AdaBoostClassifier

adaboost = AdaBoostClassifier(n_estimators=100)
adaboost.fit(X_train, y_train)
pd.DataFrame(cross_validate(adaboost, X_train, y_train, return_train_score=True)).mean()
fit_time       2.602302
score_time     0.058126
test_score     0.573941
train_score    0.594499
dtype: float64
plot_confusion_matrix_classifier(adaboost)   #plotting the confusion matrix for X_test and y_test

png

Gradient Boosted Decision Trees

from sklearn.utils import class_weight
from sklearn.ensemble import HistGradientBoostingClassifier

hgbc = HistGradientBoostingClassifier(max_iter=100)
hgbc.fit(X_train, y_train)
pd.DataFrame(cross_validate(hgbc, X_train, y_train, return_train_score=True)).mean()
fit_time       4.151003
score_time     0.069076
test_score     0.601578
train_score    0.973763
dtype: float64
plot_confusion_matrix_classifier(hgbc)   #plotting the confusion matrix for X_test and y_test

png

One-vs-All

from sklearn.multiclass import OneVsRestClassifier

ova_adaboost = OneVsRestClassifier(AdaBoostClassifier(n_estimators=100))
ova_adaboost.fit(X_train, y_train)
pd.DataFrame(cross_validate(ova_adaboost, X_train, y_train, return_train_score=True)).mean()

fit_time       11.210396
score_time      0.216500
test_score      0.624567
train_score     0.665022
dtype: float64
plot_confusion_matrix_classifier(ova_adaboost)   #plotting the confusion matrix for X_test and y_test

png

One-vs-One

from sklearn.multiclass import OneVsOneClassifier

ovo_adaboost = OneVsOneClassifier(AdaBoostClassifier(n_estimators=100))
ovo_adaboost.fit(X_train, y_train)
pd.DataFrame(cross_validate(ovo_adaboost, X_train, y_train, return_train_score=True)).mean()
fit_time       10.252223
score_time      1.013752
test_score      0.616438
train_score     0.676921
dtype: float64
plot_confusion_matrix_classifier(ovo_adaboost)   #plotting the confusion matrix for X_test and y_test

png

Support Vector Machine

from sklearn.svm import SVC

svc = SVC(kernel='rbf', probability=True)
svc.fit(X_train, y_train)
pd.DataFrame(cross_validate(svc, X_train, y_train, return_train_score=True)).mean()
fit_time       4.656176
score_time     0.231267
test_score     0.620849
train_score    0.625493
dtype: float64
plot_confusion_matrix_classifier(svc)   #plotting the confusion matrix for X_test and y_test

png

Due to the imbalanced classes, we cannot validate properly the performance of our models, but by looking at the confusion matrixes we can definitely say that all of the classification methods failed here! For sure, having imbalanced classes affected the results, but the nature of the financial market also plays a significant role, especially in the short time frames (5 minutes here) where price has lots of fluctuations.

In the next blog post, I’ll explain how we can deal with the imbalanced classes.