Skip to content

Gradio for LLM Interfaces

Gradio provides rapid prototyping of web UIs for LLM applications. From simple chat interfaces to multi-model comparison dashboards, it handles streaming, markdown rendering, and model switching with minimal code.

Minimal Chat Interface

import gradio as gr
from openai import OpenAI

client = OpenAI()

def chat(message, history):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": message}]
    )
    return response.choices[0].message.content

demo = gr.ChatInterface(fn=chat)
demo.launch()

Streaming Responses

Gradio detects generator functions and automatically renders streaming typewriter-style output.

def stream_gpt(message, history):
    messages = [{"role": "user", "content": message}]
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        stream=True
    )

    result = ""
    for chunk in response:
        content = chunk.choices[0].delta.content
        if content:
            result += content
            yield result  # MUST yield cumulative result, not individual chunks

demo = gr.ChatInterface(fn=stream_gpt)
demo.launch()

Critical: yield the cumulative string, not individual chunks. If you yield only the current chunk, Gradio replaces the previous text instead of appending.

Markdown Rendering

Replace gr.Textbox output with gr.Markdown for formatted responses:

def ask_llm(question):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Respond in well-formatted Markdown."},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

demo = gr.Interface(fn=ask_llm, inputs="text", outputs=gr.Markdown())
demo.launch()

Multi-Model Comparison

import anthropic

anthropic_client = anthropic.Anthropic()

def stream_claude(message, history):
    result = ""
    with anthropic_client.messages.stream(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": message}]
    ) as stream:
        for text in stream.text_stream:
            result += text
            yield result

def chat_with_model(message, history, model_choice):
    if model_choice == "GPT-4o":
        yield from stream_gpt(message, history)
    else:
        yield from stream_claude(message, history)

demo = gr.ChatInterface(
    fn=chat_with_model,
    additional_inputs=[
        gr.Dropdown(["GPT-4o", "Claude"], label="Model", value="GPT-4o")
    ]
)
demo.launch()

Key Differences: OpenAI vs Anthropic Streaming

Feature OpenAI Anthropic
Stream parameter stream=True in create() Use .stream() instead of .create()
Max tokens Optional (has default) Required parameter
System message In messages list Separate system= parameter
Chunk access chunk.choices[0].delta.content Context manager + .text_stream

Advanced: Log Viewer and Plots

import gradio as gr

with gr.Blocks() as demo:
    with gr.Row():
        chatbot = gr.Chatbot()
        log_output = gr.Textbox(label="Agent Logs", lines=10)

    with gr.Row():
        msg = gr.Textbox(label="Message")
        send = gr.Button("Send")

    # Optional: add plots, tables, images
    plot = gr.Plot(label="Vector Space")
    table = gr.Dataframe(label="Results")

    send.click(fn=process, inputs=[msg], outputs=[chatbot, log_output, table])

Gotchas

  • Streaming must yield cumulative text. Yielding individual chunks causes flickering - each yield replaces the entire output, so you must yield the full text so far.
  • Gradio auto-detects generators. If your function uses yield, Gradio treats it as streaming. If it uses return, it waits for the full response. No configuration needed.
  • Long-running operations blank interactive components. While a Gradio callback is running, other components may not update. For agent workflows with multiple stages, use background threads or async to keep the UI responsive.
  • Anthropic max_tokens is required. Unlike OpenAI which defaults to a reasonable max, Anthropic raises an error if max_tokens is not explicitly set.

Cross-References