diff options
| -rw-r--r-- | doc/codec2.pdf | bin | 168242 -> 159985 bytes | |||
| -rw-r--r-- | doc/codec2.tex | 10 |
2 files changed, 5 insertions, 5 deletions
diff --git a/doc/codec2.pdf b/doc/codec2.pdf Binary files differindex 35aace6..4ca8f32 100644 --- a/doc/codec2.pdf +++ b/doc/codec2.pdf diff --git a/doc/codec2.tex b/doc/codec2.tex index bfd8698..22fa749 100644 --- a/doc/codec2.tex +++ b/doc/codec2.tex @@ -75,7 +75,7 @@ This production of this document was kindly supported by an ARDC grant \cite{ard \subsection{Model Based Speech Coding} -A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of "what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible. +A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of ``what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible. As such low bit rates we use a speech production ``model". The input speech is anlaysed, and we extract model parameters, which are then sent over the channel. An example of a model based parameter is the pitch of the person speaking. We estimate the pitch of the speaker, quantise it to a 7 bit number, and send that over the channel every 20ms. @@ -106,7 +106,7 @@ Note that each harmonic has it's own amplitude, that varies across frequency. T A sinewave will cause a spike or spectral line on a spectrum plot, so we can see each spike as a small sine wave generator. Each sine wave generator has it's own frequency that are all multiples of the fundamental pitch frequency (e.g. $230, 460, 690,...$ Hz). They will also have their own amplitude and phase. If we add all the sine waves together (Figure \ref{fig:sinusoidal_model}) we can produce reasonable quality synthesised speech. This is called sinusoidal speech coding and is the speech production ``model" at the heart of Codec 2. \begin{figure}[h] -\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves we can fit in 4kHz.} +\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves we can fit in 4 kHz.} \label{fig:sinusoidal_model} \begin{center} \begin{tikzpicture}[>=triangle 45,x=1.0cm,y=1.0cm] @@ -132,7 +132,7 @@ A sinewave will cause a spike or spectral line on a spectrum plot, so we can see \end{center} \end{figure} -The model parameters evolve over time, but can generally be considered constant for short time window (a few 10s of ms). For example pitch evolves over time, moving up or down as a word is articulated. +The model parameters evolve over time, but can generally be considered constant for a short time window (a few 10s of ms). For example pitch evolves over time, moving up or down as a word is articulated. As the model parameters change over time, we need to keep updating them. This is known as the \emph{frame rate} of the codec, which can be expressed in terms of frequency (Hz) or time (ms). For sampling model parameters Codec 2 uses a frame rate of 10ms. For transmission over the channel we reduce this to 20-40ms, in order to lower the bit rate. The trade off with a lower frame rate is reduced speech quality. @@ -186,7 +186,7 @@ Up until this point the processing happens at a 10ms frame rate. However in the Once we have the desired frame rate, we ``quantise"" each model parameter. This means we use a fixed number of bits to represent it, so we can send the bits over the channel. Parameters like pitch and voicing are fairly easy, but quite a bit of DSP goes into quantising the spectral amplitudes. For the higher bit rate Codec 2 modes, we design a filter that matches the spectral amplitudes, then send a quantised version of the filter over the channel. Using the example in Figure \ref{fig:hts2a_time} - the filter would have a band pass peaks at 500 and 2300 Hz. It's frequency response would follow the red line. The filter is time varying - we redesign it for every frame. -You'll notice the term "estimate" being used a lot. One of the problems with model based speech coding is the algorithms we use to extract the model parameters are not perfect. Occasionally the algorithms get it wrong. Look at the red crosses on the bottom plot of Figure \ref{fig:hts2a_time}. These mark the amplitude estimate of each harmonic. If you look carefully, you'll see that above 2000Hz, the crosses fall a little short of the exact centre of each harmonic. This is an example of a ``fine" pitch estimator error, a little off the correct value. +You'll notice the term ``estimate" being used a lot. One of the problems with model based speech coding is the algorithms we use to extract the model parameters are not perfect. Occasionally the algorithms get it wrong. Look at the red crosses on the bottom plot of Figure \ref{fig:hts2a_time}. These mark the amplitude estimate of each harmonic. If you look carefully, you'll see that above 2000Hz, the crosses fall a little short of the exact centre of each harmonic. This is an example of a ``fine" pitch estimator error, a little off the correct value. Often the errors interact, for example the fine pitch error shown above will mean the amplitude estimates are a little bit off as well. Fortunately these errors tend to be temporary, and are sometimes not even noticeable to the listener - remember this codec is often used for HF/VHF radio where channel noise is part of the normal experience. @@ -224,7 +224,7 @@ Frames of speech are synthesised using an inverse FFT. We take a blank array of Table \ref{tab:bit_allocation} presents the bit allocation for two popular Codec 2 modes. One additional parameter is the frame energy, this is the average level of the spectral amplitudes, or ``AF gain" of the speech frame. -At very low bit rates such as 700C, we use Vector Quantisation (VQ) to represent the spectral amplitudes. We construct a table such that each row of the table has a set of spectral amplitude samples. In Codec 2 700C the table has 512 rows. During the quantisation process, we choose the table row that best matches the spectral amplitudes for this frame, then send the \emph{index} of the table row. The decoder has a similar table, so can use the index to look up the output values. If the table is 512 rows, we can use a 9 bit number to quantise the spectral amplitudes. In Codec 2 700C, we use two tables of 512 entries each (18 bits total), the second one helps fine tune the quantisation from the first table. +At very low bit rates such as 700 bits/s, we use Vector Quantisation (VQ) to represent the spectral amplitudes. We construct a table such that each row of the table has a set of spectral amplitude samples. In Codec 2 700C the table has 512 rows. During the quantisation process, we choose the table row that best matches the spectral amplitudes for this frame, then send the \emph{index} of the table row. The decoder has a similar table, so can use the index to look up the output values. If the table is 512 rows, we can use a 9 bit number to quantise the spectral amplitudes. In Codec 2 700C, we use two tables of 512 entries each (18 bits total), the second one helps fine tune the quantisation from the first table. Vector Quantisation can only represent what is present in the tables, so if it sees anything unusual (for example a different microphone frequency response or background noise), the quantisation can become very rough and speech quality poor. We train the tables at design time using a database of speech samples and a training algorithm - an early form of machine learning. |
