Skip to content

Reinforcement Learning for Trading Bots: Concepts and Pitfalls

Published: at 12:00 AM

Table of Contents

Open Table of Contents

Framing Trading as a Markov Decision Process

Reinforcement learning requires you to define your problem as a Markov Decision Process: states, actions, rewards, and transitions. For a forex trading agent, the state might include a window of recent OHLCV bars, technical indicators, and the agent’s current position (long, short, or flat). Actions are discrete — buy, sell, or hold — or continuous, encoding position size as a fraction of portfolio. The reward is where things get subtle: naive implementations use raw PnL per step, which produces agents that take massive risks for small expected gains because the reward signal is sparse and noisy.

A better reward design incorporates risk-adjusted returns. Shaping the reward as the Sharpe ratio over a rolling window encourages the agent to balance returns against volatility, producing strategies that are actually deployable. Some practitioners add a penalty term for excessive trading frequency to discourage transaction-cost-destroying churning. Getting the reward function right is the single most important design decision in an RL trading system — more important than the choice of algorithm or network architecture.

Building the Simulation Environment

The environment is your backtesting engine, and it must be realistic. Slippage, spread, and transaction costs are not optional details — omitting them is the fastest path to a strategy that looks brilliant in simulation and loses money immediately in live trading. Implement a custom gym.Env that replays historical tick or minute data, applies realistic fill assumptions, and tracks portfolio value with compounding. Use multiple currency pairs simultaneously to prevent the agent from overfitting to the idiosyncrasies of a single instrument.

import gymnasium as gym
import numpy as np

class ForexEnv(gym.Env):
    def __init__(self, df, window=50, spread_pips=1.5, lot_size=10_000):
        super().__init__()
        self.df        = df.reset_index(drop=True)
        self.window    = window
        self.spread    = spread_pips * 1e-4
        self.lot       = lot_size
        self.obs_space = gym.spaces.Box(-np.inf, np.inf, shape=(window, 5), dtype=np.float32)
        self.act_space = gym.spaces.Discrete(3)  # 0=hold, 1=buy, 2=sell

    def _get_obs(self):
        frame = self.df.iloc[self.step - self.window: self.step][["open","high","low","close","volume"]]
        return ((frame - frame.mean()) / (frame.std() + 1e-8)).values.astype(np.float32)

    def reset(self, seed=None):
        self.step     = self.window
        self.position = 0
        self.equity   = 10_000.0
        return self._get_obs(), {}

Why Most RL Trading Projects Fail

The most common failure mode is overfitting to a specific market regime. An agent trained exclusively on the 2020–2021 bull market learns to buy every dip and never encounters a prolonged downtrend — it will be destroyed in a bear market. Walk-forward validation across multiple market regimes (trending, ranging, high-volatility, low-volatility) is essential. The second failure mode is ignoring non-stationarity: financial time series distributions shift over time as market microstructure evolves, new participants enter, and macroeconomic regimes change. An RL agent needs periodic retraining or an online learning component to remain relevant. Treat any backtested Sharpe ratio above 2.0 with deep suspicion until you have validated it on truly out-of-sample data.