Tacotron2

TTS

Tacotron2

정 성 희 2020. 1. 17. 14:14

Tacotron과 동일한 점: encoder prenet, encoder, attention, decoder prenet, decoder 모듈로 이루어져, 큰 틀에서는 동일하다.이 '큰 틀'은아래와 같다.
- Decoder prenet: 2-layer linear projection으로써, Attention 과정에서 encoder의 출력인 text space와 decoder 출력인 acoustic space를 비교 가능하게 동일 space로 매핑하는 역할
- Encoder prenet: 3-layer convolution network로 이루어져, character 입력을 받아, convolution kernel의 크기에 따라 이웃 character를 반영해 요약함으로써, 전통적인 TTS나 ASR에서의 n-gram에 상응하는 피쳐를 추출하는 역할
- Encoder: character 입력을 text embedding으로 변환한다.
- Decoder: Encoder states와의 정렬을 통해 매 decoder timestep에 어느 character에 해당하는 음성이 합성될지가 정해지며, decoder의 출력으로는 mel spectrogram과 stop token이 있다. 이들은 각각 decoder에서 출력된 뒤 80 dimension과 1 dimension으로의 linear projection을 거친다.
  - Mel spectrogram: 이번 time step의 mel spectrogram은 decoder-prenet을 거쳐 다음 time step의 decoder 입력으로 사용된다. 이렇게 이전 time step의 출력을 현재 time step의 입력으로 사용하는 것을 'auto-regressive'하다고 한다. Tacotron은 한 번에 여러 time step의 mel spectrogram을 예측하도록 설정할 수 있는데, 즉, 한 번에 r time step의 mel spectrogram을 예측한 경우 이 중 마지막 time step의 mel spectrogram만 autoregressive입력으로 사용한다.
  - Stop token: 0 또는 1의 boolean 값을 가지며, stop token=1이 예측되면 합성을 멈춘다.
Tacotron과 달라진 점
- Encoder와 decoder에서 사용하는 네트워크 종류:
  - Tacotron의 Convolution bank highway network GRU(CBHG) 대신 비교적 단순한 LSTM으로 바뀌었다. Seq2seq attention alignment가 최초로 고안된 조경연님의 논문은, Character-based neural machine translation이었다. character 단위로 입력을 사용하면 단어 단위로 사용할 때보다 OOV가 안 나는 장점이 있지만, 어느 단위로 character sequence를 잘라야 의미 있는 단위가 될 지 알 수 없기 때문에 convolution filter size를 달리해가며 1 ~ 8개의 character 단위로 잘라보고, 이것을 aggregation해 text embedding을 뽑은 것이 CBHG의 CB이다. 그런데 음소는 meaningful한 단위가 있다고 보기 어려우므로 convolution filter bank가 불필요하다. 이런 이유로 Tacotron2에서는 CBHG를 LSTM으로 대체했다고 생각한다.
    - Encoder: text의 앞, 뒤 문맥을 noncausal하게 반영하여 text embedding을 뽑기 위해 bidirectional LSTM을 사용하였다.
    - Decoder: auto-regressive한 특성으로 인해 uni-directional한 2 layer LSTM을 사용하였다.
- Text embedding과 acoustic embedding의 정렬 방법
  - Tacotron에서는 dot product attention을 사용해, 모든 time step의 encoder states, previous decoder state를 사용해 attention alignment를 구하는 반면, Tacotron2에서는 'location-sensitive attention'이라는 방법을 사용해 preivious time attention alignment까지 고려해 현 시점의 alignment를 구한다.
  - 난 이게 좀 별로라고 생각하는 이유: location-sensitive attention은 음성인식을 위해 고안되었다. Encoder sequence가 decoder sequence보다 압도적으로 긴 경우에, decoder current hidden state을 만들 때, '아 전에 어디까지 attend했더라' 라고 헷갈릴 수가 있기 때문에, 전에 어느 encoder hidden state을 주로 attend했는지에 대한 pointer를 사용하는 것이 location-sensitive attention 기법이다. 그런데 합성기는 인식기와 입력-출력이 반대라서, 입력이 더 짧기 때문에 굳이 location-sensitive attention이 필요하지 않다. 오히려 Luong의 machine translation 논문에 나온 local-p attention (Gaussian filter로 monotonous alignment를 부드럽게 강요하는)을 쓰는 것이 더 좋을 것 같다.
- Neural vocoder 사용과 postnet의 출력
  - Tacotron1에서는 Griffin-Lim이라는 non-neural vocoder를 사용함에 따라 postnet의 역할은 decoder에서 예측한 mel-spectrogram을 linear spectrogram으로 변환하는 것이었다.
  - Tacotron2에서는 WaveNet이라는 neurla vocoder를 사용함에 따라 postnet의 역할은 decoder에서 예측한 mel-spectrogram을 가다듬어 더 정교한 mel-spectrogram을 만드는 것이다.
  - 왜 GriffinLim은 linear spectrogram을 입력으로 받을까?
    - Mel filter bank는 f축 위에 여러개의 필터들이 겹쳐 있다. DSP에서 배운 필터들은 inverse 가능한 경우가 많지만, 이렇게 filter bank가 겹쳐 있는 경우 linear 한 operation만으로 invert할 수가 없다. 그래서 Neural network를 써야 Mel spectrogram을 linear spectrogram으로 invert할 수 있다. 반면에, Griffin-Lim vocoder는 neural model이 아니다. 그래서 Tacotron에서 linear spectrogram으로 conversion해서 리턴해주는 것이다.
음질(자연성MOS)
- Tacotron: 4.0
- Tacotron2: 4.5
- 자연음: 4.5