Predicting emotion from text for TTS

Emotion label specified during synthesis

Emotion is predicted from language model (no emotion supervision from human during synthesis stage)

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search (0)	2021.03.31
Blizzard challenge 2020 (0)	2021.02.18
FastSpeech: Fast, Robust and Controllable Text to Speech (0)	2020.05.28
CHiVE: Varying prosody in speech synthesis with a linguistically driven dynamic hierarchical conditional variational network (0)	2020.05.25
Pitchtron: Towards audiobook generation from ordinary people’s voices (0)	2020.04.30

Sunghee's research blog