Deep Voice 3

hash2430/dv3_world

Deep voice 3 implementation of r9y9 is modified to support WORLD vocoder. - hash2430/dv3_world

github.com

https://github.com/hash2430/Neural-voice-cloning

hash2430/Neural-voice-cloning

This repo is started from my dv3_world to implement 'speaker-encoder' approach of 'Neural voice cloning using a few samples' - hash2430/Neural-voice-cloning

github.com

특징:
- Encoder-attention-decoder에 이어 converter 사용. switching을 통해 다양한 보코더를 지원하는 converter module 구현
- RNN 제거: 훈련 가속화를 위해 RNN을 CNN + positional encoding으로 대체함.
  - 훈련 시: CNN을 활용해 ground-truth mel spectrogram을 병렬 학습한다.
  - 추론 시: CNN의 local한 사용, 즉, buffer를 사용해 previous decoder 출력을 저장하고, DNN을 사용해 current decoder output을 출력한 다시 이를 buffer에 저장하는 식으로 autoregressive하게 추론한다.
- Stacked convolutional block for encoder
  - Gated linuear unita을 activation으로 사용하는 conv block이 반복적으로 쌓여있다.
- Hierarchical (scale-out) decoder 구조
  - 여러개의 prenet block과 decoder block이 반복적으로 사용된다.
- Mixed representation of phoneme and grapheme
  - lexicon을 사용해 phoneme으로 바꿀 수 있는 단어라 하더라도, 특정 비율 만큼은 강제로 grapheme으로 사용하게 한다.
  - 이로 인해 합성기 사용시 out of lexicon인 단어라도 읽을 수 있게 robust하게 학습된다.
- Repeted conditioning of speaker embedding for manifold layers
  - speaker embedding을 text embedding의 매 time step마다 append하는 여타 개성 표현 TTS들과 달리, encoder, decoder prenet, decoder의 여러 layer에 반복적으로 speaker embedding을 conditioning 해주는 것이 특징이다.
성능 (자연성 MOS)
- (similarity MOS가 없다는 것에 깜짝 놀람)
- Single speaker with WORLD vocoder: 3.63
- Multi speaker with WORLD vocoder: 3.44

'TTS' 카테고리의 다른 글

Bytes are all you need (0)	2020.01.21
Global style token (1)	2020.01.17
Transformer TTS (0)	2020.01.17
Tacotron2 (0)	2020.01.17
CondConv: Conditionally Parameterized Convolutions for Efficient Inference (0)	2020.01.06

Sunghee's research blog

Deep Voice 3

'TTS' 카테고리의 다른 글

티스토리툴바

Deep Voice 3

'TTS' 카테고리의 다른 글

'TTS' Related Articles

티스토리툴바