Skip to content

Knowledge Tracing

Intermediate

Knowledge tracing (KT) models predict the probability that a learner will answer a question correctly, given their interaction history. Core component of adaptive learning systems that decide what to teach next and when to review.

Key Facts

  • KT models take sequences of (question, correct/incorrect) pairs and predict P(correct) for the next question
  • Evolution: BKT (1995) -> DKT (2015) -> SAKT (2019) -> AKT (2020) -> simpleKT (2023) -> frontier models (2025)
  • simpleKT (ICLR 2023) is a simplified transformer baseline that beats most complex models - hard to outperform
  • pyKT is the standard library: 10+ models, 7+ datasets, actively maintained
  • FSRS-7 is the SOTA spaced repetition scheduler - integrated into Anki 23.10+, implementations in JS/Go/Rust/Python
  • The 4-layer adaptive learning architecture (domain model, student model, tutoring model, interface) is the standard framework

Model Evolution

BKT (Bayesian Knowledge Tracing)

Hidden Markov model with per-skill binary mastery state. Four parameters per skill: P(init), P(learn), P(guess), P(slip).

State: UNLEARNED -> LEARNED (transition with P(learn))

Observation model:
  If LEARNED:   P(correct) = 1 - P(slip)
  If UNLEARNED: P(correct) = P(guess)

After each observation, update P(LEARNED) via Bayes' rule.
Mastery threshold: typically P(LEARNED) > 0.95

Limitations: binary mastery (no partial knowledge), independent skills (no transfer), no forgetting.

DKT (Deep Knowledge Tracing, 2015)

RNN/LSTM over the full interaction sequence. Learns latent knowledge state implicitly.

# DKT input encoding
# Each timestep: one-hot of (question_id, correctness)
# For Q questions: input_dim = 2 * Q
# x_t = one_hot(q_t + Q * a_t) where a_t in {0, 1}

import torch.nn as nn

class DKT(nn.Module):
    def __init__(self, n_questions, hidden_dim=100):
        super().__init__()
        self.input_dim = 2 * n_questions
        self.rnn = nn.LSTM(self.input_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, n_questions)

    def forward(self, x):
        # x: (batch, seq_len, 2*n_questions)
        h, _ = self.rnn(x)
        # h: (batch, seq_len, hidden_dim)
        return torch.sigmoid(self.fc(h))
        # output: (batch, seq_len, n_questions) = P(correct) per question

Improvement over BKT: captures complex skill dependencies, partial knowledge, and temporal patterns. But acts as a black box - no interpretable knowledge state.

SAKT (Self-Attentive KT, 2019)

Replaces RNN with self-attention. Selectively attends to relevant past interactions rather than compressing everything into a hidden state.

Key idea:
  Query: current question embedding
  Keys/Values: past interaction embeddings (question + response)

  Attention scores = which past interactions are relevant
  to predicting the current question

  Advantage: handles long sequences better than LSTM
  (no vanishing gradient over hundreds of interactions)

AKT (Attentive KT, 2020)

Adds contextual attention + exponential decay for forgetting. Considers question similarity explicitly.

Innovations over SAKT:
  1. Monotonic attention - context-aware, distinguishes 
     "mastered then forgot" from "never learned"
  2. Rasch model integration - question difficulty as explicit parameter
  3. Exponential forgetting factor - recent interactions weighted higher
  4. Question similarity - attention weighted by how similar 
     past questions are to the current one

simpleKT (ICLR 2023)

Simplified transformer baseline with Rasch model-inspired question-specific difficulty. Hard to beat despite its simplicity.

Architecture:
  1. Question embedding + difficulty parameter (Rasch-like)
  2. Interaction embedding = question_emb + correctness_emb
  3. Standard transformer encoder over interaction sequence
  4. Linear prediction head -> P(correct)

Why it wins: proper question difficulty modeling + clean architecture
eliminates noise that complex models add. 
Essential baseline for any KT research.

2025 Frontier Models

Model Innovation Architecture
DKT2 xLSTM + IRT output Interpretable deep KT
HCGKT Hierarchical graph + contrastive learning Graph convolutions
LefoKT Decouples forgetting from relevance Relative forgetting attention
UKT Knowledge as probability distributions Wasserstein attention
ReKT 3D knowledge state (topic/point/domain) FRU: Forget-Response-Update
DyGKT Dynamic prerequisite graphs GNNs for prerequisite structure

Adaptive Learning Architecture

Standard 4-layer architecture for building adaptive education systems:

Layer 1: Domain Model
  Knowledge graph / concept map / prerequisite structure
  Nodes: concepts/skills
  Edges: prerequisites, similarity, co-occurrence

Layer 2: Student Model
  Per-learner state: proficiency per concept, learning style,
  cognitive load, forgetting curve, behavioral patterns
  Updated after every interaction via KT model

Layer 3: Tutoring Model
  Pedagogical decisions: what next? how to present? when to review?
  Inputs: student model state + domain model structure
  Uses: FSRS for review scheduling, i+1 for difficulty selection

Layer 4: Interface Layer
  Interaction design, exercise rendering, feedback presentation

Spaced Repetition with FSRS-7

FSRS (Free Spaced Repetition Scheduler) is the SOTA algorithm for scheduling reviews. Based on predicted forgetting probability rather than fixed intervals.

FSRS-7 key properties:
  - Predicts P(recall) at any point in time
  - Supports fractional interval lengths
  - 4 core parameters per item: difficulty, stability, 
    elapsed days, scheduled days
  - Backed by academic research, integrated into Anki 23.10+

Implementations: JS, Go, Rust, Python
Benchmark: github.com/open-spaced-repetition/srs-benchmark

Student Context Compression

Efficient encoding of student state for LLM-based tutoring systems:

Tier 1 - Always in context (~200-500 tokens):
  Proficiency vector (concept -> score, last 10-20 concepts)
  Active lesson context
  Last 3-5 interaction summaries
  Active misconceptions

Tier 2 - Summarized (~500-1000 tokens):
  Session summaries (last 5 sessions)
  Error pattern clusters
  Learning pace indicators

Tier 3 - Retrieved on demand (RAG):
  Full interaction history
  Detailed error logs
  Historical proficiency curves

Placement in LLM context ("Lost in the Middle" mitigation):
  START: student proficiency data (critical)
  MIDDLE: background summaries (least critical)
  END: recent interactions (high attention)

pyKT Library

The standard Python library for knowledge tracing research and deployment.

# pyKT: 3-step model training
# Step 1: Prepare dataset
# Supports: ASSISTments, EdNet, Junyi, Statics, NIPS34, etc.

# Step 2: Train model
# python train.py --dataset_name assist2015 --model_name simpleKT \
#   --emb_size 64 --learning_rate 0.001 --seed 42

# Step 3: Evaluate
# Metrics: AUC, accuracy, RMSE
# Cross-validation with temporal split (no future leakage)

Available models: DKT, DKVMN, SAKT, AKT, simpleKT, CL4KT, LPKT, HawkesKT, and more.

Open-Source Platforms

  • adaptive-knowledge-graph - Knowledge Graphs + Local LLMs + Bayesian skill tracking, privacy-first RAG with KG-enhanced retrieval, IRT/BKT mastery modeling
  • OATutor - React + Firebase, BKT for skill mastery, field-tested in classrooms
  • pyKT - standard KT library, 10+ models, 7+ standardized datasets

Gotchas

  • simpleKT is deceptively hard to beat - many papers claim improvements over DKT/SAKT but fail to outperform simpleKT when properly tuned. Always include it as a baseline. The Rasch-inspired question difficulty parameter carries most of the performance
  • Temporal data leakage in evaluation - standard random train/test splits leak future information (student interactions from later sessions appear in training). Always use temporal splits: train on interactions before time T, test on interactions after T
  • Knowledge graph quality bottlenecks everything - the domain model (prerequisite structure) is typically hand-authored and expensive. Errors in prerequisites propagate through the tutoring model. Automated KG construction from exercise data is an active research area but not yet reliable
  • Forgetting is asymmetric - skills learned through effortful recall are forgotten slower than those learned through re-reading or hints. KT models that treat all correct responses equally miss this. AKT and LefoKT partially address this

See Also