References: [링크], [링크2]

Abstract

We propose LPCNet, a WaveRNN variant that combines linear prediction with recurrent neural networks to significantly improve the efficiency of speech synthesis.

LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPCNet speech synthesis is achievable with a complexity under 3 GFLOPS

Concept

Lightweight하면서도 성능이 좋은 모델을 만들겠다
takes the burden of spectral envelope modeling away
- A spectral envelope is a curve in the frequency-amplitude plane, derived from a Fourier magnitude spectrum. It describes one point in time (one window, to be precise)

Prior Knowledge

PCM - 펄스 부호 변조(Pulse-code modulation, 줄여서 PCM)는 아날로그 신호의 디지털 표현으로, 신호 등급을 균일한 주기로 표본화한 다음 디지털 (이진) 코드로 양자화 처리한다.

mu-law Companding algorithms reduce the dynamic range of an audio signal. In analog systems, this can increase the signal-to-noise ratio (SNR) achieved during transmission; in the digital domain, it can reduce the quantization error (hence increasing signal to quantization noise ratio). These SNR increases can be traded instead for reduced bandwidth for equivalent SNR.

Excitation Signal

https://www.quora.com/What-is-excitation-of-audio-signal

Features

Cepstrum
- Audio → FFT → Spectrum → iFFT → Cepstrum
- serves as a tool to investigate periodic structures within frequency spectra
Bark Scale
- The Bark scale is a psychoacoustical scale proposed by Eberhard Zwicker in 1961. It is named after Heinrich Barkhausen who proposed the first subjective measurements of loudness.[1] One definition of the term is "...a frequency scale on which equal distances correspond with perceptually equal distances. Above about 500 Hz this scale is more or less equal to a logarithmic frequency axis. Below 500 Hz the Bark scale becomes more and more linear."[2]
- It is related to, but somewhat less popular than the mel scale, a perceptual scale of pitches judged by listeners to be equal in distance from one another.
Contributions

pre-emphasis and quantization

dual fc

we have found that the DualFC layer slightly improves quality when comparing to a regular fully-connected layer at equivalent complexity. The intuition behind the DualFC later is that determining whether a value falls withing a certain range (µ-law quantization interval in this case) requires two comparisons, with each fullyconnected tanh layer implementing the equivalent of one comparison.

Sparse matrices
- inspired by wavernn
- 매 학습이 진행될 때마다 점점 gru network의 weight들을 pruning함 (이는 구현상에서 callback으로 구현됨)
- lpcnet.Sparsify(2000, 40000, 400, (0.05, 0.05, 0.2)) → (t_start, t_end, interval, density)
  - 2000 steps부터 40000 steps까지 400 steps 마다 sparsify
    - 처음 시작에는 모든 matrics가 dense하지만 이후 가장 낮은 magnitude를 가지고 있는 block들을 0으로 바꾸는 것으로 원하는 sparseness를 얻을 때까지 반복함
Embedddings and algebraic simplification

Sampling from probability distribution

Abstract

Concept

Prior Knowledge

Excitation Signal

Features

Contributions

pre-emphasis and quantization

dual fc

Sparse matrices

Embedddings and algebraic simplification

Sampling from probability distribution