Reinforcement Learning for trading

I’ve been learning about machine learning for a while now and I was wondering if reinforcement learning can be applied to trading. Trading is a unique challenge because it’s known to “mostly” be a random walk simulation. Unless somebody trades with a LARGE sum of money, it’s people trading against algorithms. Can you then try to trade with a reinforcement learning mechanism?

I’m not very familiar with reinforcement learning algorithms, so I asked ChatGPT about it, and through some trial and error, I got something that was functional after applying some of my knowledge. Reinforcement learning is what it sounds like. The machine learning model runs through a Pavlovianesque model where it’s given rewards and punishments and the model then learns what to output based on those rewards and punishments. You’ll shortly see all of these things in code. The model runs but appears to get stuck at a local maxima/minima. So if you want to stop reading here, feel free to do so. If you can help me improve this model or are interested in learning how to get started, continue reading and leave comments below.

In general, here are the things you need to do to get started. I’m not an expert by any means so if you know of better ways to accomplish this feel free to comment.

  1. Gather data, in this case, it was for the SPY index. The larger your dataset the better.
  2. Load the data and split it into a training and testing set.
  3. Determine your input space and output space.
    1. Input space: What are all of your input parameters, for example open, high, low, close, balance, profits, etc? 
    2. Output space: What are your actual actions? buy, sell, hold, etc
  4. Define your reward.
    1. Are you rewarding profitable trades, or what about not repeating the same trades again and again?
  5. Define your state.
    1. What does the state of your application look like in each iteration?
    2. This will be added to the input layer and will change as a result of your output layer
  6. Then determine your model architecture
    1. What layers are you going to implement and what are their activation functions?
    2. Make sure that everything converges to the output space
    3. What loss function are you going to use?
  7. Create your learning loop
    1. Iterate through your data
    2. Get your current state
    3. Have the model predict the new state
    4. Apply the actions to the state
    5. Calculate your reward
    6. Adjust the model with the reward

Before we get started, note that there are many ways to accomplish this. I kept it simple and stayed with tensorflow. But from my understanding OpenAI’s gym is the gold standard for reinforcement learning. See:

Okay, so let’s get started.

Gathering data. There are many different APIs out there available to let you gather stocks and financial data. Since we are dealing with historical data. I decided to use Alpaca’s free API that lets you retrieve this. Remember to set the API key and secret on line 8 and 9. And then pick your symbol on line 14 and then rename the file accordingly on line 32.

from import TimeFrame
from import StockHistoricalDataClient
from import StockBarsRequest
import pandas as pd
from datetime import datetime
# Set up Alpaca API
# Make sure to set these values
api_key = '' 
api_secret = ''
base_url = '' # use '' for live trading
api = StockHistoricalDataClient(api_key=api_key, secret_key=api_secret)

# Define the symbol and time period of the data
symbol = 'SPY'
start_date = '2022-01-01'
end_date = '2023-05-01'
start_date = datetime.strptime(start_date, "%Y-%m-%d").isoformat()
end_date = datetime.strptime(end_date, "%Y-%m-%d").isoformat()

# Get the historical data from Alpaca API
barset = api.get_stock_bars(StockBarsRequest(symbol_or_symbols=symbol, timeframe=TimeFrame.Hour, start=start_date, end=end_date))
bars = barset[symbol]

# Convert the data to a pandas DataFrame
df = pd.DataFrame(columns=['date', 'open', 'high', 'low', 'close', 'volume'])
for bar in bars:
    date = bar.timestamp.strftime("%Y-%m-%d")
    df = df.append({'date': date, 'open':, 'high': bar.high, 'low': bar.low, 'close': bar.close, 'volume': bar.volume}, ignore_index=True)

# Save the data to a CSV file
df.to_csv('SPY.csv', index=False)

Then you can split the data like so.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import random

# Load the data
data = pd.read_csv('SPY.csv')

# Train-test split
train_data, test_data = train_test_split(data, test_size=0.2, shuffle=False)

Determining your input space.

Your input space is what you want to feed the model for it to take action. In the stocks world, this could be things like stock pricing data, indicators, trends, or even portfolio balance. In other verticals, it could be pixels of an image (for games), player location, or mapped vector space (for things like self-driving). In my case, I kept things simple and used the current price, previous price, current volume, and previous volume.

Determine your output space.

Your output space is what your model will output to take action. Applying this to the examples above this could be keyboard input, it could be mouse input, it could be buy/sell signals, or in the real world, it could even be controller outputs to motors to turn the steering wheel or accelerate/decelerate. A bit of warning, I used chatgpt for this part and it created an odd array. It created an output space of three elements each of which contains three elements with a 1 for an active state for each position. My actions are buy/sell/hold. I also chose to keep short selling out of the picture to keep things simple.

action_space = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]  # buy, sell, hold

Reward function

This is probably the most important part of this post. Your reward function will determine everything. It has a large influence on your maxima/minima and is likely where most of my mistakes are. If you apply this to the previous examples, this could be finishing a level, or not crashing into anything. When I first worked on this section, my model got stuck on repeating the same action again and again. It kept trying to short-sell and I’ve prevented that manually by not letting the state change if that happens. So I had to modify the reward function to punish if the action was the same. I also had to add some randomization to get the model out of the local maxima/minima. It’s sometimes profitable and sometimes sits at a loss. In the end, this is what I ended up with. 

# Define the reward function
def get_reward(profit, rebuy_punishment, action, q_values):
    reward = 0
    exploration_rate = .2
    if action != action_space[np.argmax(q_values)] and random.random() < exploration_rate:
        reward = 0.1
    return reward + (profit / 10000) * 1000 + rebuy_punishment  # return percentage of profit as reward

Define your state

The state function is what’s going into your input space so I defined it as a function that returns the current price, previous price, current volume, and previous volume as a function of time. The result of this function is what gets sent into the input space each iteration.

def get_state(data, time_step):
    if time_step == 0:
        current_price = data.loc[time_step, 'open']
        previous_price = current_price
        current_volume = data.loc[time_step, 'volume']
        previous_volume = current_volume
        current_price = data.loc[time_step, 'open']
        previous_price = data.loc[time_step-1, 'open']
        current_volume = data.loc[time_step, 'volume']
        previous_volume = data.loc[time_step-1, 'volume']
    state = np.array([current_price, previous_price, current_volume, previous_volume])
    # state = np.reshape(state, (1, -1))  # Reshape the state to have shape (1, input_dim)
    return state

Model Architecture

This is a tricky area to discuss because there is so much variability here. You can choose what works best for your application here. From my understanding, the number of layers and the number of neurons in each layer impact how much it changes each iteration, and how much training is needed, and will vary depending on your needs in terms of input and output space. Remember that each neuron is assigned a weight and those weights are adjusted until you reach global and or local maxima and minima and your outputs match the data that you trained on. These neurons perform differently depending on the activation function which activates those neurons cascading down to the next layer and the process repeats until you have an output. This output then triggers an action, which affects the reward function and a reward vector is applied to the model which changes the weights. This is different from a traditional model because the weights are shifting based on a reward instead of directly from a training dataset. To learn what is best for your application, this is a good read:

It covers what its title says. How do you choose a loss function when training deep-learning neural networks.

In my case, I chose categorical_crossentropy because I want explicit actions as a result and each action is essentially a class. 

As far as the number of layers goes, I went with what ChatGPT suggested but if you want to learn more as to choose what works for you, see this website:

Finally, you need to decide what activation functions you need to choose.

In my case, I chose relu or Rectified Linear Activation functions because I’m trying to find a regression line with pricing and volume data but modifying based on reward. If you want to learn more go here:

In general, it has to do with what type of neural net you are building. The link suggests sigmoid and softmax activation functions for classification and linear activation functions for regression.

Here is the model that was generated by ChatGPT:

# Define the model architecture
model = Sequential([
    Dense(64, input_shape=(4,), activation='relu'),
    Dense(64, activation='relu'),
    Dense(3, activation='softmax')
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Logically it makes sense because we are initially trying to regress to the stock data and then classify the 3 actions. Then adjusting the weights till the 3 actions are profitable.

Learning Loop

This is where the learning happens and I presume there are better ways to implement this. For this section, I basically iterated through the training data, and for each timestep, I ran the model prediction, and then based on the action, I made adjustments to the balance and profits and calculated the reward. After this, a new synthetic prediction was calculated given the reward, and the next predicted q values were calculated. If you want to learn more about Q-Learning algorithms, this article gives a great explanation:

I will admit that I’m not sure if I’ve correctly implemented it as Python is not something I know very well. 

Note: I added some outputs to see how well it’s learning by outputting profit levels and what action it took. This is how I found out that my initial model doesn’t work at all, after which I started tweaking the reward function to get it to try different actions. This worked to an extent and I know my issue is related to the reward function and specifically the weighting of the profit to the other rewards.

for t in range(len(train_data)-1):
    state = get_state(train_data, t)
    q_values = model.predict(np.array([state]))[0]  # predict Q-values for current state
    action_index = np.argmax(q_values)
    action = action_space[action_index]  # select action from action space

    next_state = get_state(train_data, t+1)
    next_q_values = model.predict(np.array([next_state]))[0]  # predict Q-values for next state
    next_action_index = np.argmax(next_q_values)
    next_action = action_space[next_action_index]  # select next action from action space

    current_price = train_data.loc[t, 'open']
    next_price = train_data.loc[t+1, 'open']
    if t == 0:
        action = [1, 0, 0]
        print("Buying because first timestep")
    if action == [1, 0, 0]:  # buy
        if balance/current_price > 0:
          quantity = balance / current_price  # buy as much as possible with available balance
          if quantity != 0:
            balance -= quantity * current_price
            owned += quantity * current_price
            print("Can't buy because no capital")
          negative_rebuy_reward += -1
    elif action == [0, 1, 0] and quantity > 0:  # sell
        balance += quantity * current_price  # sell previously bought quantity at current price
        owned -= quantity * current_price
        profit += balance - 10000  # update profit
        quantity = 0  # reset quantity
    else:  # hold
        pass  # do nothing

    reward = get_reward(profit, negative_rebuy_reward, action, q_values)
    # target = reward + 0.95 * model.predict(next_state)[0][next_action]

    next_action_index = np.argmax(next_action)
    target = reward + 0.2 * model.predict(np.array([next_state]))[0][next_action_index]

    target_vec = model.predict(np.array([state]))[0]
    target_vec[action] = target[state]), target_vec.reshape(-1, 3), epochs=1, verbose=0)

I’m still working on the evaluation of this but I stopped because it wasn’t working very well. If anybody knows why, please leave a comment. If I can solve this, I will follow up with an implementation of the evaluation of the performance. Once that’s implemented the next step is to trigger api calls to your favorite brokerage api to perform the buy/sell/holding and let the script run on a secure server.

What I like about this implementation is that it’s learning like a person would. It tries different things and depending on the results it adjusts it’s approach. This approach isn’t new by any means because I know there are many tutorials and videos out there about this so I’m curious if anybody has seen any good long-term results besides the hedge funds, if you have please comment.

Some people have done all of the hard work of implementing different reinforcement learning algorithms specifically to finance and trading and made it open source! You can find that library here:

I may implement this and see if it produces any good results.

Here are some videos that use reinforcement learning and apply it to trading.

Something closer to what I’ve implemented here:

This also covers similar concepts but shows how they work with formulas:






Leave a Reply

Your email address will not be published. Required fields are marked *

Share via
Copy link