Reinforcement Learning¶
Agent learns by interacting with an environment, receiving rewards/penalties, and optimizing a policy to maximize cumulative reward. Unlike supervised learning, there are no labeled examples - the agent discovers optimal behavior through trial and error.
Core Concepts¶
- Agent: the learner/decision-maker
- Environment: everything outside the agent
- State (s): current situation representation
- Action (a): what agent can do
- Reward (r): scalar feedback signal
- Policy (pi): mapping from states to actions
- Value function V(s): expected cumulative reward from state s
- Q-function Q(s,a): expected cumulative reward from taking action a in state s
- Discount factor (gamma): 0-1, how much to weight future vs immediate rewards
Exploration vs exploitation trade-off: agent must balance trying new actions (explore) with using known good actions (exploit). Epsilon-greedy is the simplest strategy - random action with probability epsilon, best known action otherwise.
Markov Decision Process (MDP)¶
Formal framework: (S, A, P, R, gamma) where P is transition probability and R is reward function.
Markov property: future depends only on current state, not history. If violated, need to augment state (frame stacking, recurrent networks).
Bellman equation relates value of a state to values of successor states: V(s) = max_a [R(s,a) + gamma * sum P(s'|s,a) * V(s')]
Tabular Methods¶
Q-Learning (Off-Policy)¶
Learns Q-values directly. Off-policy: can learn from data generated by any policy.
import numpy as np
# Q-table: states x actions
Q = np.zeros((n_states, n_actions))
alpha = 0.1 # learning rate
gamma = 0.99 # discount
epsilon = 0.1 # exploration
for episode in range(n_episodes):
state = env.reset()
done = False
while not done:
# Epsilon-greedy action selection
if np.random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])
next_state, reward, done, _ = env.step(action)
# Q-learning update (uses max over next actions)
Q[state, action] += alpha * (
reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
state = next_state
SARSA (On-Policy)¶
Update uses the action actually taken, not the greedy max:
# SARSA update
next_action = epsilon_greedy(Q[next_state], epsilon)
Q[state, action] += alpha * (
reward + gamma * Q[next_state, next_action] - Q[state, action]
)
SARSA is more conservative - accounts for exploration in its estimates.
Deep RL¶
When state space is too large for tables (images, continuous), use neural networks to approximate value/policy functions.
Deep Q-Network (DQN)¶
Neural network approximates Q(s,a). Key tricks for stability:
import torch
import torch.nn as nn
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, action_dim)
)
def forward(self, x):
return self.net(x)
# Two networks: online + target (updated periodically)
online_net = DQN(state_dim, action_dim)
target_net = DQN(state_dim, action_dim)
target_net.load_state_dict(online_net.state_dict())
# Experience replay buffer
from collections import deque
replay_buffer = deque(maxlen=100000)
# Training step
batch = random.sample(replay_buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
current_q = online_net(states).gather(1, actions)
with torch.no_grad():
next_q = target_net(next_states).max(1)[0]
target_q = rewards + gamma * next_q * (1 - dones)
loss = nn.MSELoss()(current_q.squeeze(), target_q)
Experience replay: store transitions, sample random mini-batches. Breaks correlation between consecutive samples.
Target network: separate network for computing targets, updated every N steps. Prevents moving target problem.
Policy Gradient Methods¶
Directly optimize the policy without learning value function.
REINFORCE: simplest policy gradient. High variance, slow convergence.
Actor-Critic: actor (policy network) + critic (value network). Critic reduces variance of policy gradient.
PPO (Proximal Policy Optimization): most popular modern algorithm. Clips policy updates to prevent too-large changes.
# PPO clipped objective (pseudocode)
ratio = new_policy(action|state) / old_policy(action|state)
advantage = rewards_to_go - value_estimate
clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
loss = -torch.min(ratio * advantage, clipped_ratio * advantage).mean()
RL for Finance¶
- Portfolio optimization: actions = asset allocations, reward = risk-adjusted returns
- Order execution: minimize market impact when executing large orders
- Market making: set bid/ask spreads, reward = profit minus inventory risk
- Challenge: non-stationary environment, partial observability, low signal-to-noise
Reward Shaping¶
Designing reward functions is the hardest part of RL.
- Sparse rewards: only at episode end (win/lose). Hard to learn - need exploration bonuses
- Dense rewards: continuous feedback. Easier to learn but can cause reward hacking
- Reward shaping: add intermediate rewards without changing optimal policy. Must satisfy potential-based condition
- Curiosity-driven exploration: intrinsic reward for novel states (prediction error as reward)
Gotchas¶
- Reward hacking: agent finds unintended shortcuts. If reward is "don't crash" the agent may learn to not move at all. Always validate that the learned behavior matches intent, not just reward signal
- Catastrophic forgetting in DQN: without experience replay, network overfits to recent experience and forgets earlier lessons. Buffer size matters - too small loses diversity, too large wastes memory
- Hyperparameter sensitivity: RL is notoriously sensitive to learning rate, epsilon schedule, gamma, network architecture. Small changes can cause total failure. Always do grid search on simple environments first
- Sample inefficiency: deep RL requires millions of environment interactions. Sim-to-real transfer (train in simulator, deploy in real world) is standard practice but domain gap is a real problem