Reinforcement Learning¶

Agent learns by interacting with an environment, receiving rewards/penalties, and optimizing a policy to maximize cumulative reward. Unlike supervised learning, there are no labeled examples - the agent discovers optimal behavior through trial and error.

Core Concepts¶

Agent: the learner/decision-maker
Environment: everything outside the agent
State (s): current situation representation
Action (a): what agent can do
Reward (r): scalar feedback signal
Policy (pi): mapping from states to actions
Value function V(s): expected cumulative reward from state s
Q-function Q(s,a): expected cumulative reward from taking action a in state s
Discount factor (gamma): 0-1, how much to weight future vs immediate rewards

Exploration vs exploitation trade-off: agent must balance trying new actions (explore) with using known good actions (exploit). Epsilon-greedy is the simplest strategy - random action with probability epsilon, best known action otherwise.

Markov Decision Process (MDP)¶

Formal framework: (S, A, P, R, gamma) where P is transition probability and R is reward function.

Markov property: future depends only on current state, not history. If violated, need to augment state (frame stacking, recurrent networks).

Bellman equation relates value of a state to values of successor states: V(s) = max_a [R(s,a) + gamma * sum P(s'|s,a) * V(s')]

Tabular Methods¶

Q-Learning (Off-Policy)¶

Learns Q-values directly. Off-policy: can learn from data generated by any policy.

import numpy as np

# Q-table: states x actions
Q = np.zeros((n_states, n_actions))
alpha = 0.1   # learning rate
gamma = 0.99  # discount
epsilon = 0.1 # exploration

for episode in range(n_episodes):
    state = env.reset()
    done = False
    while not done:
        # Epsilon-greedy action selection
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])

        next_state, reward, done, _ = env.step(action)

        # Q-learning update (uses max over next actions)
        Q[state, action] += alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )
        state = next_state

SARSA (On-Policy)¶

Update uses the action actually taken, not the greedy max:

# SARSA update
next_action = epsilon_greedy(Q[next_state], epsilon)
Q[state, action] += alpha * (
    reward + gamma * Q[next_state, next_action] - Q[state, action]
)

SARSA is more conservative - accounts for exploration in its estimates.

Deep RL¶

When state space is too large for tables (images, continuous), use neural networks to approximate value/policy functions.

Deep Q-Network (DQN)¶

Neural network approximates Q(s,a). Key tricks for stability:

import torch
import torch.nn as nn

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )

    def forward(self, x):
        return self.net(x)

# Two networks: online + target (updated periodically)
online_net = DQN(state_dim, action_dim)
target_net = DQN(state_dim, action_dim)
target_net.load_state_dict(online_net.state_dict())

# Experience replay buffer
from collections import deque
replay_buffer = deque(maxlen=100000)

# Training step
batch = random.sample(replay_buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)

current_q = online_net(states).gather(1, actions)
with torch.no_grad():
    next_q = target_net(next_states).max(1)[0]
    target_q = rewards + gamma * next_q * (1 - dones)

loss = nn.MSELoss()(current_q.squeeze(), target_q)

Experience replay: store transitions, sample random mini-batches. Breaks correlation between consecutive samples.

Target network: separate network for computing targets, updated every N steps. Prevents moving target problem.

Policy Gradient Methods¶

Directly optimize the policy without learning value function.

REINFORCE: simplest policy gradient. High variance, slow convergence.

Actor-Critic: actor (policy network) + critic (value network). Critic reduces variance of policy gradient.

PPO (Proximal Policy Optimization): most popular modern algorithm. Clips policy updates to prevent too-large changes.

# PPO clipped objective (pseudocode)
ratio = new_policy(action|state) / old_policy(action|state)
advantage = rewards_to_go - value_estimate

clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
loss = -torch.min(ratio * advantage, clipped_ratio * advantage).mean()

RL for Finance¶

Portfolio optimization: actions = asset allocations, reward = risk-adjusted returns
Order execution: minimize market impact when executing large orders
Market making: set bid/ask spreads, reward = profit minus inventory risk
Challenge: non-stationary environment, partial observability, low signal-to-noise

Reward Shaping¶

Designing reward functions is the hardest part of RL.

Sparse rewards: only at episode end (win/lose). Hard to learn - need exploration bonuses
Dense rewards: continuous feedback. Easier to learn but can cause reward hacking
Reward shaping: add intermediate rewards without changing optimal policy. Must satisfy potential-based condition
Curiosity-driven exploration: intrinsic reward for novel states (prediction error as reward)

Gotchas¶

Reward hacking: agent finds unintended shortcuts. If reward is "don't crash" the agent may learn to not move at all. Always validate that the learned behavior matches intent, not just reward signal
Catastrophic forgetting in DQN: without experience replay, network overfits to recent experience and forgets earlier lessons. Buffer size matters - too small loses diversity, too large wastes memory
Hyperparameter sensitivity: RL is notoriously sensitive to learning rate, epsilon schedule, gamma, network architecture. Small changes can cause total failure. Always do grid search on simple environments first
Sample inefficiency: deep RL requires millions of environment interactions. Sim-to-real transfer (train in simulator, deploy in real world) is standard practice but domain gap is a real problem