Skip to content

Reinforcement Learning

Agent learns by interacting with an environment, receiving rewards/penalties, and optimizing a policy to maximize cumulative reward. Unlike supervised learning, there are no labeled examples - the agent discovers optimal behavior through trial and error.

Core Concepts

  • Agent: the learner/decision-maker
  • Environment: everything outside the agent
  • State (s): current situation representation
  • Action (a): what agent can do
  • Reward (r): scalar feedback signal
  • Policy (pi): mapping from states to actions
  • Value function V(s): expected cumulative reward from state s
  • Q-function Q(s,a): expected cumulative reward from taking action a in state s
  • Discount factor (gamma): 0-1, how much to weight future vs immediate rewards

Exploration vs exploitation trade-off: agent must balance trying new actions (explore) with using known good actions (exploit). Epsilon-greedy is the simplest strategy - random action with probability epsilon, best known action otherwise.

Markov Decision Process (MDP)

Formal framework: (S, A, P, R, gamma) where P is transition probability and R is reward function.

Markov property: future depends only on current state, not history. If violated, need to augment state (frame stacking, recurrent networks).

Bellman equation relates value of a state to values of successor states: V(s) = max_a [R(s,a) + gamma * sum P(s'|s,a) * V(s')]

Tabular Methods

Q-Learning (Off-Policy)

Learns Q-values directly. Off-policy: can learn from data generated by any policy.

import numpy as np

# Q-table: states x actions
Q = np.zeros((n_states, n_actions))
alpha = 0.1   # learning rate
gamma = 0.99  # discount
epsilon = 0.1 # exploration

for episode in range(n_episodes):
    state = env.reset()
    done = False
    while not done:
        # Epsilon-greedy action selection
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])

        next_state, reward, done, _ = env.step(action)

        # Q-learning update (uses max over next actions)
        Q[state, action] += alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )
        state = next_state

SARSA (On-Policy)

Update uses the action actually taken, not the greedy max:

# SARSA update
next_action = epsilon_greedy(Q[next_state], epsilon)
Q[state, action] += alpha * (
    reward + gamma * Q[next_state, next_action] - Q[state, action]
)

SARSA is more conservative - accounts for exploration in its estimates.

Deep RL

When state space is too large for tables (images, continuous), use neural networks to approximate value/policy functions.

Deep Q-Network (DQN)

Neural network approximates Q(s,a). Key tricks for stability:

import torch
import torch.nn as nn

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )

    def forward(self, x):
        return self.net(x)

# Two networks: online + target (updated periodically)
online_net = DQN(state_dim, action_dim)
target_net = DQN(state_dim, action_dim)
target_net.load_state_dict(online_net.state_dict())

# Experience replay buffer
from collections import deque
replay_buffer = deque(maxlen=100000)

# Training step
batch = random.sample(replay_buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)

current_q = online_net(states).gather(1, actions)
with torch.no_grad():
    next_q = target_net(next_states).max(1)[0]
    target_q = rewards + gamma * next_q * (1 - dones)

loss = nn.MSELoss()(current_q.squeeze(), target_q)

Experience replay: store transitions, sample random mini-batches. Breaks correlation between consecutive samples.

Target network: separate network for computing targets, updated every N steps. Prevents moving target problem.

Policy Gradient Methods

Directly optimize the policy without learning value function.

REINFORCE: simplest policy gradient. High variance, slow convergence.

Actor-Critic: actor (policy network) + critic (value network). Critic reduces variance of policy gradient.

PPO (Proximal Policy Optimization): most popular modern algorithm. Clips policy updates to prevent too-large changes.

# PPO clipped objective (pseudocode)
ratio = new_policy(action|state) / old_policy(action|state)
advantage = rewards_to_go - value_estimate

clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
loss = -torch.min(ratio * advantage, clipped_ratio * advantage).mean()

RL for Finance

  • Portfolio optimization: actions = asset allocations, reward = risk-adjusted returns
  • Order execution: minimize market impact when executing large orders
  • Market making: set bid/ask spreads, reward = profit minus inventory risk
  • Challenge: non-stationary environment, partial observability, low signal-to-noise

Reward Shaping

Designing reward functions is the hardest part of RL.

  • Sparse rewards: only at episode end (win/lose). Hard to learn - need exploration bonuses
  • Dense rewards: continuous feedback. Easier to learn but can cause reward hacking
  • Reward shaping: add intermediate rewards without changing optimal policy. Must satisfy potential-based condition
  • Curiosity-driven exploration: intrinsic reward for novel states (prediction error as reward)

Gotchas

  • Reward hacking: agent finds unintended shortcuts. If reward is "don't crash" the agent may learn to not move at all. Always validate that the learned behavior matches intent, not just reward signal
  • Catastrophic forgetting in DQN: without experience replay, network overfits to recent experience and forgets earlier lessons. Buffer size matters - too small loses diversity, too large wastes memory
  • Hyperparameter sensitivity: RL is notoriously sensitive to learning rate, epsilon schedule, gamma, network architecture. Small changes can cause total failure. Always do grid search on simple environments first
  • Sample inefficiency: deep RL requires millions of environment interactions. Sim-to-real transfer (train in simulator, deploy in real world) is standard practice but domain gap is a real problem

See Also