本页面归档日常积累的技术分享和 Talk Slides,涵盖语音合成、语音识别、音频生成等多个方向。


2026
21
VTP: Towards Scalable Pre-training of Visual Tokenizers for Generation
2026-04-11
Visual Tokenizer Latent Diffusion
关键工作:VTP, Visual Tokenizer, RAE
Slides
20
LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space
2026-03-29
语音合成 Diffusion/CFM
关键工作:LongCat-AudioDiT, Wav-VAE, APG
Slides
2025
19
Recent Advances in Autoregressive Diffusion Models for Speech Generation
2025-11-03
语音合成 扩散模型+自回归
关键工作:MELA-TTS, VoxCPM, VibeVoice, Ming-UniAudio
Slides
18
DraftDiffRO: Differentiable Reward Optimization for LLM based TTS system
2025-08-21
语音合成 强化学习
关键工作:CosyVoice, CosyVoice2
Slides
17
DraftRethinking Tortoise TTS Paradigm
2025-06-26
语音合成 TTS 大模型
关键工作:Tortoise-TTS, XTTS, Index-TTS
Slides
16
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
2025-04-14
语音合成 扩散模型
关键工作:MegaTTS 3, InspireMusic, RALL-E, VALLE-R, VoiceLDM, PeRFlow
Slides
2024
15
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
2024-09-25
语音合成 扩散模型
关键工作:FireRedTTS, Semantic-Aware Speech Tokenizer
Slides
14
Moshi: A Speech-Text Foundation Model for Real-Time Dialogue
2024-09-23
语音对话 全双工大模型
关键工作:Helium, Mimi, Moshi
Slides
13
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
2024-07-09
语音合成 大模型
关键工作:CosyVoice
Slides
12
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
2024-04-01
语音合成 离散 Diffusion
关键工作:FACodec, NaturalSpeech 3, MaskGIT, Discrete Diffusion
Slides
11
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech
2024-03-06
端侧语音合成 高效推理
关键工作:MobileSpeech, SoundStorm, Pheme
Slides
2023
10
Recent Advances on Unified Model for Voice and Audio Generation (Part 2)
2023-11-20
语音与音频生成 扩散模型
关键工作:AudioLDM 2, Stable Diffusion
Slides
9
Recent Advances on Unified Model for Voice and Audio Generation (Part 1)
2023-11-13
语音与音频生成 LLM 模型
关键工作:Speech-X, Make-A-Voice / MVoice, UniAudio
Slides
8
Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
2023-06-01
语音合成
关键工作:w2v-BERT, SoundStream, Spear-TTS
Slides
7
Text-Based Speech Editing From a Perspective of Mask-Based Generative Models
2023-01-12
语音编辑 语音合成
关键工作:SpeechPainter, A3T, MUSE (文生图)
Slides
2022
6
Voice Cloning C: Advanced Methods
2022-11-13
声音复刻 Perceiver Adapter
关键工作:RetrieverTTS, TTS Adapters
Slides
5
Voice Cloning B: Overview on Transfer Learning, Representation Learning and Meta-Learning
2022-09-15
声音复刻 迁移学习 元学习
关键工作:ProsodySpeech, StyleSpeech, GST, Meta-TTS
Slides
4
Voice Cloning A: Recent Advances in Adaptive Text-to-Speech
2022-05-26
声音复刻 自适应TTS 少样本
关键工作:AdaSpeech, AdaSpeech2, AdaSpeech3, AdaSpeech4
Slides
2021
3
Lattice: Concepts, Methods and Applications
2021-07-01
Lattice 语言模型 WFST
关键工作:WFST, Lattice, Lattice-Transformer
Slides
2020
2
An Overview of Hybrid ASR: Acoustic Models, Language Models, and Post-Processing
2020-09-17
语音识别 声学模型 语言模型
关键工作:LF-MMI, TDNN-F, Lattice Rescoring
Slides
1
Personal Work on Speech Recognition (Bachelor/Master)
2020-07-20
语音识别 说话人自适应 迁移学习
关键工作:Speaker Adaption, Transfer Learning
Slides