phase synthesis edits

author: drowe67 <[email protected]> 2023-11-29 07:35:19 +1030
committer: David Rowe <[email protected]> 2023-11-29 07:35:19 +1030
commit: fbbea0946111c90e3c9ab90ac641cb2d5b8b4bc0 (patch)
tree: e442b463317aaeb45631e884a4fa59076d74d728
parent: ba7321c6f0b2fbd9d08eeda7b03b7a31c0aa878c (diff)
2 files changed, 19 insertions, 16 deletions
diff --git a/doc/codec2.pdf b/doc/codec2.pdf
index 7f81e46..dbb7d25 100644
--- a/doc/codec2.pdf
+++ b/doc/codec2.pdf
diff --git a/doc/codec2.tex b/doc/codec2.tex
index f8d0a5d..bc2096f 100644
--- a/doc/codec2.tex
+++ b/doc/codec2.tex
@@ -512,7 +512,9 @@ The voicing decision is post processed by several experimentally derived rules t
 
 \subsection{Phase Synthesis}
 
-In Codec 2 the harmonic phases $\{\theta_m\}$ are not transmitted to the decoder, instead they are synthesised at the decoder using a rules based algorithm and information from the remaining model parameters, $\{A_m\}$, $\omega_0$, and $v$.  Consider the source-filter model of speech production:
+In Codec 2 the harmonic phases $\{\theta_m\}$ are not transmitted, instead they are synthesised at the decoder from the remaining model parameters, $\{A_m\}$, $\omega_0$, and $v$.  The phase model described in this section is referred to as ``zero order" or \emph{phase0} in the source code, as it requires zero model parameters to be transmitted over the channel.
+
+Consider the source-filter model of speech production:
 \begin{equation}
 \hat{S}(z)=E(z)H(z)
 \end{equation}
@@ -520,24 +522,22 @@ where $E(z)$ is an excitation signal with a relatively flat spectrum, and $H(z)$
 \begin{equation}
 \begin{split}
 arg \left[ \hat{S}(e^{j \omega_0 m}) \right] &= arg \left[ E(e^{j \omega_0 m}) H(e^{j \omega_0 m}) \right] \\
-\hat{theta}_m &= arg \left[ E(e^{j \omega_0 m}) \right] + arg \left[ H(e^{j \omega_0 m}) \right] \\
+\hat{\theta}_m &= arg \left[ E(e^{j \omega_0 m}) \right] + arg \left[ H(e^{j \omega_0 m}) \right] \\
 &= \phi_m +  arg \left[ H(e^{j \omega_0 m}) \right]
 \end{split}
 \end{equation}
 
-For voiced speech $E(z)$ is an impulse train (period $P$ in the time domain and $\omega_0$ in the frequency domain).  We can construct a time domain excitation pulse train using a sum of sinusoids:
+For voiced speech $E(z)$ is an impulse train (in both the time and frequency domain). We can construct a time domain excitation pulse train using a sum of sinusoids:
 \begin{equation}
-e(n) = \sum_{m-1}^L e^{j m \omega_0 (n - n_0)}
+e(n) = \sum_{m-1}^L cos( m \omega_0 (n - n_0))
 \end{equation}
 Where $n_0$ is a time shift that represents the pulse position relative to the centre of the synthesis frame $n=0$. By finding the DTCF transform of $e(n)$ we can determine the phase of each excitation harmonic:
 \begin{equation}
 \phi_m = - m \omega_0 n_0
 \end{equation}
-As we don't transmit any phase information the pulse position $n_0$ is unknown.  Fortunately the ears is insensitive to the absolute position of pitch pulses in voiced speech, as long as they evolve smoothly over time (discontinuities in phase are a characteristic of unvoiced speech).
-
-The excitation pulses occur at a rate of $\omega_0$ (one for each pitch period). The phase of the first harmonic advances by $N \phi_0$ radians over a synthesis frame of $N$ samples.  For example if $\omega_0 = \pi /20$ (200 Hz), then over a (10ms $N=80$) sample frame, the phase of the first harmonic would advance $(\pi/20)*80 = 4 \pi$ radians or two complete cycles.
+As we don't transmit any phase information the pulse position $n_0$ is unknown at the decoder.  Fortunately the ear is insensitive to the absolute position of pitch pulses in voiced speech, as long as they evolve smoothly over time (discontinuities in phase are a characteristic of unvoiced speech).
 
-We therefore derive $n_0$ from the excitation phase of the fundamental, which we treat as a timing reference.  Each frame we advance the phase of the fundamental:
+The excitation pulses occur at a rate of $\omega_0$ (one for each pitch period). The phase of the first harmonic advances by $N \phi_1$ radians over a synthesis frame of $N$ samples.  For example if $\omega_1 = \pi /20$ (200 Hz), then over a (10ms $N=80$) sample frame, the phase of the first harmonic would advance $(\pi/20)80 = 4 \pi$ radians or two complete cycles. We therefore derive $n_0$ from the excitation phase of the fundamental, which we treat as a timing reference.  Each frame we advance the phase of the fundamental:
 \begin{equation}
 \phi_1^l = \phi_1^{l-1} + N\omega_0
 \end{equation}
@@ -546,27 +546,27 @@ Given $\phi_1$ we can compute $n_0$ and the excitation phase of the other harmon
 \begin{split}
 n_0    &= -\phi_1 / \omega_0 \\
 \phi_m &= - m \omega_0 n_0 \\
-       &= m \phi_1, \quad m=2,...,L
+       &= m \phi_1 \quad \quad m=2,...,L
 \end{split}
 \end{equation}
 
-For unvoiced speech $E(z)$ is a white noise signal.  At each frame we sample a random number generator on the interval $-/pi ... /pi$ to obtain the excitation phase of each harmonic.  We set $\omega_0 = F0_min$ to use a large number of harmonics to synthesise to approximate a noise signal.
+For unvoiced speech $E(z)$ is a white noise signal.  At each frame we sample a random number generator on the interval $-\pi ... \pi$ to obtain the excitation phase of each harmonic.  We set $F_0 = 50$ Hz to use a large number of harmonics $L=4000/50=80$ for synthesis to best approximate a noise signal.
 
-An additional phase component is provided by sampling $H(z)$ at the harmonic centres.  The phase spectra of $H(z)$ is derived from the filter magnitude response described by $\{A_m\}$ available at the decoder using minimum phase techniques.  The method for deriving the phase differs between Codec 2 modes and is described below in Sections \ref{sect:mode_lpc_lsp} and \ref{sect:mode_newamp1}.  This component of the phase tends to disperse the pitch pulse energy in time, especially around spectral peaks (formants) where ``ringing" occurs.
+An additional phase component is provided by sampling $H(z)$ at the harmonic centres.  The phase spectra of $H(z)$ is derived from the filter magnitude response using minimum phase techniques.  The method for deriving the phase spectra of $H(z)$ differs between Codec 2 modes and is described below in Sections \ref{sect:mode_lpc_lsp} and \ref{sect:mode_newamp1}.  This component of the phase tends to disperse the pitch pulse energy in time, especially around spectral peaks (formants).
 
-TODO: phase postfilter
+The zero phase model tends to make speech with background noise sound "clicky".  With high levels of background noise the low level inter-formant parts of the spectrum will contain noise rather than speech harmonics, so modelling them as voiced (i.e. a continuous, non-random phase track) is inaccurate. Some codecs (like MBE) have a mixed voicing model that breaks the spectrum into voiced and unvoiced regions.  However (5-12) bits/frame (5-12) are required to transmit the frequency selective voicing information.  Mixed excitation also requires accurate voicing estimation (parameter estimators always break occasionally under exceptional conditions).
+
+In our case we use a post processing approach which requires no additional bits to be transmitted.  The decoder measures the average level of the background noise during unvoiced frames.  If a harmonic is less than this level it is made unvoiced by randomising it's phases.
 
 Comparing to speech synthesised using original phases $\{\theta_m\}$ the following observations have been made:
 \begin{enumerate}
 \item Through headphones speech synthesised with this model drops in quality. Through a small loudspeaker it is very close to original phases.
 \item If there are voicing errors, the speech can sound clicky or staticy.  If voiced speech is mistakenly declared unvoiced, this model tends to synthesise annoying impulses or clicks, as for voiced speech $H(z)$ is relatively flat (broad, high frequency formants), so there is very little dispersion of the excitation impulses through $H(z)$.
 \item When combined with amplitude modelling or quantisation, such that $H(z)$ is derived from $\{\hat{A}_m\}$ there is an additional drop in quality.
-\item This synthesis model is effectively the same as a simple LPC-10 vocoders, and yet (especially when $H(z)$ is derived from $\{A_m\}$) sounds much better.  Conventional wisdom (AMBE, MELP) says mixed voicing is required for high quality speech.
+\item This synthesis model (e.g. a pulse train exciting a LPC filter) is effectively the same as a simple LPC-10 vocoders, and yet (especially when $H(z)$ is derived from unquantised $\{A_m\}$) sounds much better.  Conventional wisdom (AMBE, MELP) says mixed voicing is required for high quality speech.
 \item If $H(z)$ is changing rapidly between frames, it's phase contribution may also change rapidly. This approach could cause some discontinuities in the phase at the edge of synthesis frames, as no attempt is made to make sure that the phase tracks are continuous (the excitation phases are continuous, but not the final phases after filtering by $H(z)$).
 \end{enumerate}
 
-TODO: Energy distribution theory.  Need to V model, neural vocoders, non-linear function. Figures and simulation plots would be useful. Figure of phase synthesis.
-
 \subsection{LPC/LSP based modes}
 \label{sect:mode_lpc_lsp}
 
@@ -578,10 +578,11 @@ Block diagram of LPC/LSP mode encoder and decoder.  Walk through operation.  Dec
 \section{Further Work}
 
 \begin{enumerate}
-\item Some worked examples aimed at the experimenter - e.g. using c2sim to extract and plot model parameters
+\item Some worked examples aimed at the experimenter - e.g. using c2sim to extract and plot model parameters.  Listen to various phases of quantisation.
 \item How to use Octave tools to single step through codec operation
 \item Table summarising source files with one line description
 \item Add doc license (Creative Commons?)
+\item Energy distribution theory.  Need for V model, neural vocoders, non-linear function. Figures and simulation plots would be useful. Figure of phase synthesis.
 \end{enumerate}
 
 \section{Glossary}
@@ -595,6 +596,7 @@ Block diagram of LPC/LSP mode encoder and decoder.  Walk through operation.  Dec
 Acronym & Description \\
 \hline
 DFT & Discrete Fourier Transform \\
+DTCF & Discrete Time Continuous Frequency Fourier Transform \\
 IDFT & Inverse Discrete Fourier Transform \\
 MBE & Multi-Band Excitation \\
 NLP & Non Linear Pitch (algorithm) \\
@@ -623,6 +625,7 @@ $r$ & Maps a harmonic number $m$ to a DFT index \\
 $s(n)$ & Input speech \\
 $s_w(n)$ & Time domain windowed input speech \\
 $S_w(k)$ & Frequency domain windowed input speech \\
+$\phi_m$ & Phase of excitation harmonic \\
 $\omega_0$ & Fundamental frequency (pitch) & radians/sample \\
 $v$ & Voicing decision for the current frame \\
 \hline
author	drowe67 <[email protected]>	2023-11-29 07:35:19 +1030
committer	David Rowe <[email protected]>	2023-11-29 07:35:19 +1030
commit	fbbea0946111c90e3c9ab90ac641cb2d5b8b4bc0 (patch)
tree	e442b463317aaeb45631e884a4fa59076d74d728
parent	ba7321c6f0b2fbd9d08eeda7b03b7a31c0aa878c (diff)