Skip to content

LEMAS: Multilingual TTS and Speech Editing

Intermediate

150K+ hours of multilingual audio training data across 10 languages, yielding two complementary models: zero-shot TTS and word-level speech editing. CC-BY-4.0.

Models

Model Size Task Architecture
LEMAS-TTS 0.3B Zero-shot TTS Flow matching (F5-TTS based)
LEMAS-Edit 0.3B Speech editing Flow matching + AR codec (VoiceCraft based)

Languages

Chinese, English, Spanish, Russian, French, German, Italian, Portuguese, Indonesian, Vietnamese

LEMAS-TTS

Zero-shot voice cloning: 5-10 second reference clip → generate speech on any of 10 languages.

Parameters: - nfe_steps - number of function evaluations (speed/quality trade-off) - cfg_strength - classifier-free guidance strength - speed - output speech rate multiplier - UVR5 denoising (optional, improves quality on noisy references)

# Install
conda create -n lemas python=3.10 && conda activate lemas
sudo apt-get install -y ffmpeg  # or conda install -c conda-forge ffmpeg
git clone https://github.com/LEMAS-Project/LEMAS-TTS.git
cd LEMAS-TTS && pip install -r requirements.txt
# Download model weights from HuggingFace to pretrained_models/

# Launch Gradio UI
export PYTHONPATH="$PWD:${PYTHONPATH}"
python lemas_tts/scripts/inference_gradio.py --host 0.0.0.0 --port 7860 --share

HuggingFace Space: LEMAS-Project/LEMAS-TTS

LEMAS-Edit

Word-level speech replacement without regenerating the full utterance. Preserves surrounding speech characteristics.

Two backends:

Backend Languages Method
lemas_tts (flow matching) 10 languages Same as TTS
lemas_edit (AR codec) 7 languages WhisperX + MMS alignment

The lemas_edit backend provides better naturalness for mid-sentence edits; lemas_tts has broader language coverage.

Evaluation: MUSHRA and ABX testing included in ./eval/.

HuggingFace Space: LEMAS-Project/LEMAS-Edit

Upstream Dependencies

F5-TTS, VoiceCraft, Vocos, UVR5, DeepFilterNet, Seamless-Expressive

Open TTS Landscape (April 2026)

Model Org Strengths
LEMAS-TTS LEMAS Project 10 languages, zero-shot, CC-BY-4.0
F5-TTS - Base architecture LEMAS builds on
CosyVoice 2 Alibaba Strong Chinese, multi-speaker
Fish Speech - Fast inference, good EN/ZH
Kokoro TTS - High quality EN
Orpheus TTS Canopy AI Expressive, emotional range
Dia Nari Labs Dialogue-optimized

Voice Cloning Pipeline

For continuous audio generation (e.g., learning systems, voice assistants):

# Architecture pattern for background audio generation
# 1. Claude API → lesson text (~2 min on voice)
# 2. LEMAS-TTS → synthesize to mp3 with cloned voice
# 3. Schedule + notification (plyer / win10toast)
# 4. Auto-play in background

# Container: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
# GPU: single modern GPU sufficient for 0.3B model
# Latency: real-time factor < 0.5 on T4 GPU

Gotchas

  • Flow matching requires fixed NFE steps - unlike diffusion models, LEMAS-TTS doesn't support variable step counts mid-generation. Set NFE once per run.
  • Reference audio quality directly affects output - noisy or reverberant references produce noisy clones. Run UVR5 denoising on references before use.
  • Language detection is implicit - the model doesn't auto-detect source language. You must specify the target language explicitly. Mismatched language/text combinations produce degraded output.
  • lemas_edit AR codec is slower - autoregressive generation is slower than flow matching. For low-latency applications, use the lemas_tts backend even for editing.

See Also