
Indicator Improvement- Part 2- Dealing with Imbalanced Classes
In the previous post, I tried to improve on the outcome of an indicator to distinguish a trend in price changes. It is shown that we can alter time series data and define a classification problem. However, due to the class imbalance, the classification models I used failed. In this post, I’m going to talk about some possible ways to deal with class imbalance.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
TimeSeriesSplit,
cross_val_score,
cross_validate,
train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
plt.rcParams["font.size"] = 16
from datetime import datetime
The final data frame that I used for supervised classification methods:
ctgorized_df.head()
open | high | low | close | open_lag1 | high_lag1 | low_lag1 | close_lag1 | open_lag2 | high_lag2 | ... | high_lag6 | low_lag6 | close_lag6 | sma20 | sma50 | sma100 | trade_hour | dis_to_max_12bar | dis_to_min_12bar | market-type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.87138 | 0.87138 | 0.87064 | 0.87103 | 0.87093 | 0.87156 | 0.87056 | 0.87139 | 0.87088 | 0.87134 | ... | 0.87217 | 0.87121 | 0.87161 | 0.871588 | 0.871522 | 0.872727 | 10.0 | 0.00291 | 0.00015 | UpTrend-PriceDown |
1 | 0.86942 | 0.87013 | 0.86900 | 0.87012 | 0.86939 | 0.86976 | 0.86913 | 0.86943 | 0.86988 | 0.87028 | ... | 0.87070 | 0.87012 | 0.87050 | 0.871093 | 0.871101 | 0.872387 | 10.0 | 0.00127 | 0.00076 | DownTrend-PriceUp |
2 | 0.86411 | 0.86417 | 0.86403 | 0.86409 | 0.86409 | 0.86416 | 0.86407 | 0.86409 | 0.86395 | 0.86412 | ... | 0.86400 | 0.86353 | 0.86389 | 0.863412 | 0.863401 | 0.864325 | 22.0 | 0.00042 | 0.00121 | UpTrend-NoTrend |
3 | 0.86455 | 0.86470 | 0.86453 | 0.86469 | 0.86441 | 0.86456 | 0.86422 | 0.86456 | 0.86467 | 0.86468 | ... | 0.86483 | 0.86466 | 0.86471 | 0.864797 | 0.864814 | 0.864159 | 2.0 | 0.00019 | 0.00028 | DownTrend-NoTrend |
4 | 0.86504 | 0.86504 | 0.86430 | 0.86468 | 0.86527 | 0.86536 | 0.86500 | 0.86502 | 0.86504 | 0.86530 | ... | 0.86484 | 0.86461 | 0.86479 | 0.864798 | 0.864798 | 0.864729 | 6.0 | 0.00065 | 0.00015 | UpTrend_PriceUp |
5 rows × 35 columns
ctgorized_df['market-type'].value_counts()
DownTrend-NoTrend 2255
UpTrend-NoTrend 2237
UpTrend_PriceUp 238
DownTrend-PriceUp 228
UpTrend-PriceDown 218
DownTrend-PriceDown 208
Name: market-type, dtype: int64
There are several methods to deal with imbalanced classes in machine learning classifications. Some of them define different cost functions proportional to class populations, which is called cost-sensitive learning. Another category of methods deals with the way we are doing sampling. In simple words, take more samples from low-population classes or fewer samples from more-populated classes. Of course, the way they do this is more complicated than what I explained here. I’m going to use one of these sampling methods here and compare their outcomes with what I have found in my previous blog post. If you are interested to learn more about the sampling methods, take a look at this blog post.
Using the sampling methods to balance the classes has drawbacks! You either remove information or cause overfitting of the model. However, sometimes the nature of your data might help you to decrease the imbalanced classes issue! If you read the previous post, I mentioned that indeed the forex market is a 24-hour market, but it does not mean that it is always active! Let me show you the standard deviation of the price differences between the close price and its maximum/minimum during 12 data points (one hour) for 24 hours of the day!
diff_to_max = ctgorized_df.groupby('trade_hour')['dis_to_max_12bar'].std()
diff_to_mis = ctgorized_df.groupby('trade_hour')['dis_to_min_12bar'].std()
volatility = diff_to_max + diff_to_mis
plt.figure()
plt.rcParams["figure.figsize"] = (10,7)
plt.bar(volatility.index, volatility)
plt.axhline(y=0.0010, color='r', linestyle='-')
plt.show()
As can be seen, the summation of the standard deviations of the price differences from maximum and minimum during some hours of the day is less than 10 pips (0.0010)! So, the probability of having a price change of more than 10 pips during those hours is low!
One thing I should emphasize here is that I do not do classification for the classification! I’m looking for profitable trades! I do not have to trade during those hours if the market is not volatile enough! Let’s remove those hours from the data and see the total numbers of each class.
ctgorized_sub_df = ctgorized_df[(ctgorized_df['trade_hour']>= 7) & (ctgorized_df['trade_hour']<= 19)]
ctgorized_sub_df['market-type'].value_counts()
DownTrend-NoTrend 993
UpTrend-NoTrend 974
UpTrend_PriceUp 200
DownTrend-PriceUp 173
UpTrend-PriceDown 166
DownTrend-PriceDown 163
Name: market-type, dtype: int64
That is much better! So, now let’s see how the classification methods work. As always, split data to the train and test data frame, and fit our models.
train-test split
train_df, test_df = train_test_split(ctgorized_sub_df, test_size=0.2, random_state=42)
train_df['market-type'].value_counts()
DownTrend-NoTrend 813
UpTrend-NoTrend 766
UpTrend_PriceUp 159
UpTrend-PriceDown 138
DownTrend-PriceDown 130
DownTrend-PriceUp 129
Name: market-type, dtype: int64
test_df['market-type'].value_counts()
UpTrend-NoTrend 208
DownTrend-NoTrend 180
DownTrend-PriceUp 44
UpTrend_PriceUp 41
DownTrend-PriceDown 33
UpTrend-PriceDown 28
Name: market-type, dtype: int64
Preprocessing the data
train_df.columns
Index(['open', 'high', 'low', 'close', 'open_lag1', 'high_lag1', 'low_lag1',
'close_lag1', 'open_lag2', 'high_lag2', 'low_lag2', 'close_lag2',
'open_lag3', 'high_lag3', 'low_lag3', 'close_lag3', 'open_lag4',
'high_lag4', 'low_lag4', 'close_lag4', 'open_lag5', 'high_lag5',
'low_lag5', 'close_lag5', 'open_lag6', 'high_lag6', 'low_lag6',
'close_lag6', 'sma20', 'sma50', 'sma100', 'trade_hour',
'dis_to_max_12bar', 'dis_to_min_12bar', 'market-type'],
dtype='object')
numeric_features = ['open', 'high', 'low', 'close', 'open_lag1', 'high_lag1', 'low_lag1',
'close_lag1', 'open_lag2', 'high_lag2', 'low_lag2', 'close_lag2',
'open_lag3', 'high_lag3', 'low_lag3', 'close_lag3', 'open_lag4',
'high_lag4', 'low_lag4', 'close_lag4', 'open_lag5', 'high_lag5',
'low_lag5', 'close_lag5', 'open_lag6', 'high_lag6', 'low_lag6',
'close_lag6', 'sma20', 'sma50', 'sma100',
'dis_to_max_12bar', 'dis_to_min_12bar', ]
ordinal_features = ['trade_hour']
drop_features = ['market-type']
numeric_transform = make_pipeline(StandardScaler())
preprocessor = make_column_transformer(
(numeric_transform, numeric_features),
('passthrough', ordinal_features),
('drop', drop_features)
)
preprocessor.fit(train_df)
new_clomuns = numeric_features + ordinal_features
X_train = pd.DataFrame(preprocessor.transform(train_df), index=train_df.index, columns=new_clomuns)
X_test = pd.DataFrame(preprocessor.transform(test_df), index=test_df.index, columns=new_clomuns)
y_train = train_df['market-type']
y_test = test_df['market-type']
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(2135, 34) (534, 34) (2135,) (534,)
Classification
I only use the Random Over Sampling method for dealing with imbalanced classes. After producing a proper training data frame with this method, do the classification with the Random Forest and the Support Vector Machine supervised learning models from the sklearn library.
#confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay
def plot_confusion_matrix_classifier(clf):
plt.rc('font', size=12)
disp = ConfusionMatrixDisplay.from_estimator(
clf,
X_test,
y_test,
display_labels=dummy.classes_,
values_format="d",
cmap=plt.cm.Blues,
colorbar=False,
)
plt.xticks(rotation = 90)
fig = disp.ax_.get_figure()
fig.set_figwidth(8)
fig.set_figheight(8)
Random over sampling
# import library
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=123)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=10, random_state=0)
rfc.fit(X_train_ros, y_train_ros)
plot_confusion_matrix_classifier(rfc) #plotting the confusion matrix for X_test and y_test
from sklearn.svm import SVC
svc = SVC(kernel='rbf', probability=True)
svc.fit(X_train_ros, y_train_ros)
plot_confusion_matrix_classifier(svc) #plotting the confusion matrix for X_test and y_test
In comparison with the previous outcomes of our supervised learning models, after balancing the classes, we have a significant improvement in our models. However, classification-wise, the results are not acceptable yet! One big reason goes back to the nature of the market fluctuations which can suddenly change their direction. This effect is less significant for the larger time frames. Also, if you are interested, you can try this methodology for other trend indicators, that might provide better outcomes.