Supervised Regression Model to predict forex price

Supervised Regression Model to predict forex price

2022, Oct 06    

In this post, I’m going to build a simple supervised regression model to predict tomorrow’s close price of the EUR-USD currencies pair, based on the past close prices and moving averages. In the end, will discuss the error, and show you how the model scores might be misleading.

Let me start with the Yahoo finance package to get the forex data. It is one of the convenient ways to get financial market data. First, install the yfinance package.

pip install yfinance #!pip for Notebook's users

Now, importing the libraries that we need. Here, I used Sklearn, one of the famous libraries for machine learning in Python.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler

plt.rcParams["font.size"] = 16
from datetime import datetime
import yfinance as yf

To get more familiar with how to work with the yfinance package and download financial data, see here and here.

forex_data = yf.download('EURUSD=X', start='2018-01-01', end='2022-09-01') 
forex_data.index = pd.to_datetime(forex_data.index)
forex_data.head()
Open High Low Close Adj Close Volume
Date
2018-01-01 00:00:00+00:00 1.200495 1.201504 1.199904 1.200495 1.200495 0
2018-01-02 00:00:00+00:00 1.201086 1.208094 1.200855 1.201158 1.201158 0
2018-01-03 00:00:00+00:00 1.206200 1.206709 1.200495 1.206345 1.206345 0
2018-01-04 00:00:00+00:00 1.201129 1.209190 1.200495 1.201043 1.201043 0
2018-01-05 00:00:00+00:00 1.206622 1.208459 1.202154 1.206884 1.206884 0
forex_data.isnull().values.any()
False

In the forex data, the Adjusted close price is similar to the close price, so I will drop the corresponding column. Also, the yfinance does not report the trading volume. So, let me drop that column as well.

forex_data = forex_data.drop(['Volume', 'Adj Close'], axis=1)
forex_data.columns
Index(['Open', 'High', 'Low', 'Close'], dtype='object')

Feature engineering

I’m looking for predicting the closing price of the EUR-USD pair on the next day (independent variable) based on some features, such as the prices on the previous days (dependent variable). However, in each row of our data frame, we only have the closing price of each day. What I need to do, is to prepare the data frame in a way that has all the variables I need. That is called feature engineering. To do that, I need to shift the close price column and concatenate the results in a data frame.

close_prices_df = forex_data['Close'].to_frame()
for i in range(1,7):
    _closed_lag = forex_data['Close'].to_frame().shift(i)
    new_column = "Close_Lag%d" %i
    _closed_lag.rename(columns = {'Close':new_column}, inplace = True)
    close_prices_df = pd.concat([close_prices_df, _closed_lag], axis =1)

close_prices_df
Close Close_Lag1 Close_Lag2 Close_Lag3 Close_Lag4 Close_Lag5 Close_Lag6
Date
2018-01-01 00:00:00+00:00 1.200495 NaN NaN NaN NaN NaN NaN
2018-01-02 00:00:00+00:00 1.201158 1.200495 NaN NaN NaN NaN NaN
2018-01-03 00:00:00+00:00 1.206345 1.201158 1.200495 NaN NaN NaN NaN
2018-01-04 00:00:00+00:00 1.201043 1.206345 1.201158 1.200495 NaN NaN NaN
2018-01-05 00:00:00+00:00 1.206884 1.201043 1.206345 1.201158 1.200495 NaN NaN
... ... ... ... ... ... ... ...
2022-08-25 00:00:00+01:00 0.996910 0.996691 0.993947 1.003522 1.008990 1.018019 1.017066
2022-08-26 00:00:00+01:00 0.997128 0.996910 0.996691 0.993947 1.003522 1.008990 1.018019
2022-08-29 00:00:00+01:00 0.993868 0.997128 0.996910 0.996691 0.993947 1.003522 1.008990
2022-08-30 00:00:00+01:00 1.001402 0.993868 0.997128 0.996910 0.996691 0.993947 1.003522
2022-08-31 00:00:00+01:00 1.002506 1.001402 0.993868 0.997128 0.996910 0.996691 0.993947

1217 rows × 7 columns

By shifting the data frame, there will be null values that need to be dropped later. Here, I add the moving averages of the close price to the features as well.

def simple_moving_average(data, moving_average_list):
    df = pd.DataFrame()
    for ma in moving_average_list:
        sma_column = f"sma{ma}"
        sma = data['Close'].rolling(center=False, window = ma).mean().to_frame(name = sma_column)
        df = pd.concat((df, sma),  axis=1)

    return(df )
sma_df = simple_moving_average(forex_data, [20, 50])
close_sma_prices_df = pd.concat([close_prices_df, sma_df], axis =1)
close_sma_prices_df
Close Close_Lag1 Close_Lag2 Close_Lag3 Close_Lag4 Close_Lag5 Close_Lag6 sma20 sma50
Date
2018-01-01 00:00:00+00:00 1.200495 NaN NaN NaN NaN NaN NaN NaN NaN
2018-01-02 00:00:00+00:00 1.201158 1.200495 NaN NaN NaN NaN NaN NaN NaN
2018-01-03 00:00:00+00:00 1.206345 1.201158 1.200495 NaN NaN NaN NaN NaN NaN
2018-01-04 00:00:00+00:00 1.201043 1.206345 1.201158 1.200495 NaN NaN NaN NaN NaN
2018-01-05 00:00:00+00:00 1.206884 1.201043 1.206345 1.201158 1.200495 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
2022-08-25 00:00:00+01:00 0.996910 0.996691 0.993947 1.003522 1.008990 1.018019 1.017066 1.015935 1.024769
2022-08-26 00:00:00+01:00 0.997128 0.996910 0.996691 0.993947 1.003522 1.008990 1.018019 1.014830 1.023618
2022-08-29 00:00:00+01:00 0.993868 0.997128 0.996910 0.996691 0.993947 1.003522 1.008990 1.013482 1.022513
2022-08-30 00:00:00+01:00 1.001402 0.993868 0.997128 0.996910 0.996691 0.993947 1.003522 1.012245 1.021499
2022-08-31 00:00:00+01:00 1.002506 1.001402 0.993868 0.997128 0.996910 0.996691 0.993947 1.011592 1.020484

1217 rows × 9 columns

Also, we need to add the close price of the next day to our data frame. This column is our target. We want to predict these target values. To add that, we need to shift the close column of our database backward.

prices_df_final = close_sma_prices_df.assign(Tomorrow_Close = close_sma_prices_df['Close'].shift(-1))
prices_df_final[['Close', 'Tomorrow_Close']].tail()
Close Tomorrow_Close
Date
2022-08-25 00:00:00+01:00 0.996910 0.997128
2022-08-26 00:00:00+01:00 0.997128 0.993868
2022-08-29 00:00:00+01:00 0.993868 1.001402
2022-08-30 00:00:00+01:00 1.001402 1.002506
2022-08-31 00:00:00+01:00 1.002506 NaN

Now, dropping all the null values.

prices_df_final = prices_df_final.dropna()
prices_df_final
Close Close_Lag1 Close_Lag2 Close_Lag3 Close_Lag4 Close_Lag5 Close_Lag6 sma20 sma50 Tomorrow_Close
Date
2018-03-09 00:00:00+00:00 1.230663 1.241465 1.241665 1.233654 1.231542 1.227084 1.219126 1.233603 1.226892 1.230875
2018-03-12 00:00:00+00:00 1.230875 1.230663 1.241465 1.241665 1.233654 1.231542 1.227084 1.233882 1.227499 1.233958
2018-03-13 00:00:00+00:00 1.233958 1.230875 1.230663 1.241465 1.241665 1.233654 1.231542 1.234062 1.228155 1.239234
2018-03-14 00:00:00+00:00 1.239234 1.233958 1.230875 1.230663 1.241465 1.241665 1.233654 1.234255 1.228813 1.237562
2018-03-15 00:00:00+00:00 1.237562 1.239234 1.233958 1.230875 1.230663 1.241465 1.241665 1.233796 1.229543 1.230921
... ... ... ... ... ... ... ... ... ... ...
2022-08-24 00:00:00+01:00 0.996691 0.993947 1.003522 1.008990 1.018019 1.017066 1.016198 1.017136 1.025744 0.996910
2022-08-25 00:00:00+01:00 0.996910 0.996691 0.993947 1.003522 1.008990 1.018019 1.017066 1.015935 1.024769 0.997128
2022-08-26 00:00:00+01:00 0.997128 0.996910 0.996691 0.993947 1.003522 1.008990 1.018019 1.014830 1.023618 0.993868
2022-08-29 00:00:00+01:00 0.993868 0.997128 0.996910 0.996691 0.993947 1.003522 1.008990 1.013482 1.022513 1.001402
2022-08-30 00:00:00+01:00 1.001402 0.993868 0.997128 0.996910 0.996691 0.993947 1.003522 1.012245 1.021499 1.002506

1167 rows × 10 columns

Here, I just used some numerical features, derived from prices. In the case of time series data, it is very usual to consider the time and date components in the features. Most of the time series data, have a kind of periodic behavior, and not considering time and date, will not catch those behaviors. However, I think specifically for the forex data, that is not a good idea! later, I will write about my reasons in a separate blog post.

train_test split

The common way of evaluating and verifying the performance of a machine learning model before using is to test it on a section of the data. What is important here is the section that we consider for testing the model, must not be part of the training of the model. To be sure that this will not happen unwillingly, the SKlearn has a special function called train_test_split. With the use of this function, we can split our data to train and test sections. However, the time series are different stories! We have a series of events that happen over time. If we want to predict something, we are not allowed to know anything from the future! The train_test_split samples randomly from data. We do not want to train our model based on data that came after our test data! The simplest way here is to separate the data based on a specific date. Before that date, is our training data, and after that is our test data.

prices_df_final.isnull().any().values
array([False, False, False, False, False, False, False, False, False,
       False])
n_train = int(0.8 * len(prices_df_final))

train_df = prices_df_final.iloc[:n_train, :]
test_df = prices_df_final.iloc[n_train:, :]
print(train_df.shape, test_df.shape)
(933, 10) (234, 10)

I separated my data based on its length. About 80% of the data is going to be used for traing our model, and 20% for testing it.

print(train_df.index.max(), test_df.index.min())
2021-10-06 00:00:00+01:00 2021-10-07 00:00:00+01:00

Preprocessing

Here, I’m going to define the preprocessing steps, to prepare data to fit the models.

prices_df_final.columns
Index(['Close', 'Close_Lag1', 'Close_Lag2', 'Close_Lag3', 'Close_Lag4',
       'Close_Lag5', 'Close_Lag6', 'sma20', 'sma50', 'Tomorrow_Close'],
      dtype='object')
numerical_features = [
    'Close', 'Close_Lag1', 'Close_Lag2', 'Close_Lag3', 'Close_Lag4',
       'Close_Lag5', 'Close_Lag6', 'sma20', 'sma50',
]

drop_features = [
    'Tomorrow_Close',
]
def preprocess_features(
    train_df,
    test_df,
    numerical_features,
    drop_features,
):
    numeric_transfer = StandardScaler()

    preprocessor = make_column_transformer(
        (numeric_transfer, numerical_features),
        ("drop", drop_features),
    )

    preprocessor.fit(train_df)
    new_columns = numerical_features

    X_train = pd.DataFrame(
        preprocessor.transform(train_df), index=train_df.index, columns=new_columns
    )
    X_test = pd.DataFrame(
        preprocessor.transform(test_df), index=test_df.index, columns=new_columns
    )

    y_train = train_df['Tomorrow_Close']
    y_test = test_df['Tomorrow_Close']

    return X_train, X_test, y_train, y_test, preprocessor
X_train, X_test, y_train, y_test, preprocessor = preprocess_features(
    train_df,
    test_df,
    numerical_features,
    drop_features,
)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(933, 9) (234, 9) (933,) (234,)

The model will train based on X_train and y_train, and evaluate on X_test and y_test. The y_test is our target, and we are going to make our prediction as close as possible to it.

Supervised Learing

def model_evaluation(X_train, X_test, y_train, y_test, regressor_model):
    regressor_model.fit(X_train, y_train)
    print("Train-set R^2: {:.5f}".format(regressor_model.score(X_train, y_train)))
    print("Test-set R^2: {:.5f}".format(regressor_model.score(X_test, y_test)))

    y_pred = regressor_model.predict(X_test)
    y_pred_train = regressor_model.predict(X_train)
    plt.figure(figsize=(10, 3))

    strt_day = 200
    plt.plot(range(len(y_train) - strt_day, n_train), y_train.iloc[len(y_train) - strt_day:], label="train")
    plt.plot(range(len(y_train) - strt_day, n_train), y_pred_train[len(y_train) - strt_day:], "--", label="prediction train")
    
    plt.plot(range(n_train, len(y_test) + n_train), y_test, "-", label="test")
    plt.plot(
        range(n_train, len(y_test) + n_train), y_pred, "--", label="prediction test"
    )
    plt.legend(loc=(1.01, 0))
    plt.xlabel("Trading day")
    plt.ylabel("Close Price")

DummyRegressor:

Let’s start with a dummy reggression model, a buildt-in package in sklearn. This is a good prectice to compare our model to a dummy model, to be sure that it makes a meaningful results (not similar to a dummy model!)

from sklearn.dummy import DummyRegressor

dummy = DummyRegressor()
model_evaluation(X_train, X_test, y_train, y_test, dummy)
Train-set R^2: 0.00000
Test-set R^2: -1.57690

png

Random Forest Regressor:

Do not use tree-based models with time series data! Especially, when you are dealing with time and dates as the features in your model. Here, I did not include the date-time in my data, so let’s see the performance of the random forest regressor.

from sklearn.ensemble import RandomForestRegressor

RFRegressor = RandomForestRegressor(n_estimators=100, random_state=0)
model_evaluation(X_train, X_test, y_train, y_test, RFRegressor)
Train-set R^2: 0.99802
Test-set R^2: 0.50661

png

This is a typical result you might get when using a tree-based model with time series data!

Support vector machine with RBF kernel:

Let me try another famous regression model.

from sklearn.svm import SVR
svm = SVR(kernel="rbf", C= 1, gamma=0.002, epsilon=0.001)
model_evaluation(X_train, X_test, y_train, y_test, svm)
Train-set R^2: 0.98759
Test-set R^2: 0.95122

png

I might get better results if I tune the hyperparameters of the model, but I prefer to leave it as it is and check another model!

Ridge:

Ridge is a linear regression model in the sklearn package.

from sklearn.linear_model import Ridge

lr_ridge = Ridge(alpha = 1)
model_evaluation(X_train, X_test, y_train, y_test, lr_ridge)
Train-set R^2: 0.98762
Test-set R^2: 0.98675

png

The model R-squared score is more than 0.98 for our test data! Isn’t that promising?! Well, actually not! Interpreting R-squared here is a bit tricky! let me calculate the Mean Absolute Error. That will give us a better insight.

def mae(true, pred): #mean absolute error
    return  np.mean(np.abs((pred - true)))
y_pred = lr_ridge.predict(X_test)

ridge_scores = {"R^2": [lr_ridge.score(X_test, y_test)],
                "MAE": [mae(y_test, y_pred)]
                }

pd.DataFrame.from_dict(ridge_scores)
R^2 MAE
0 0.986752 0.004311

The Mean Absolute Error is 0.0043! It means on average, your prediction is off by 0.0043. At the first glance, it seems very low! But, for knowing what this number shows, you need to know a little bit more about forex trading! Forex trader calculate their gain and loss based on a unit, called Pip! One pip is a change in the value of the currencies pair by 0.0001 (It might be different for some currencies pairs). So, 0.0043 means 43 pip error! If you trade one “standard lot” (which is another trading term), a 43 pip error means $430! It is a huge error for each trade! Let me plot fewer trading days to see what is going on!

td_start = 0
td_end = 20
plt.figure(figsize=(10, 3))
plt.plot(range(td_start, td_end), y_test[td_start:td_end], 'b', label='tomorrow Close')
plt.plot(range(td_start, td_end), y_pred[td_start:td_end], 'r', label='prediction Close')
plt.plot(range(td_start, td_end), test_df['Close'][td_start:td_end], 'g', label='today Close')
plt.xticks(rotation="vertical")
plt.legend();

png

As can be seen, the model predicted the close price of the other day, as it is for today! The R-squared score shows a tremendous outcome for our model, just because these two prices are very close to each other!