Published on March 20, 2014
Basic Audio Compression Techniques: Basic Audio Compression Techniques ADPCM in Speech Coding: ADPCM in Speech Coding Adaptive difference pulse-code modulation, also called ADPCM, is the core technique in the compression for speech signal. ADPCM is the variant of DPCM. The main difference is the quantization steps. ADPCM uses the variety size of quantization steps to achieve the better performance, and it becomes more complex. PowerPoint Presentation: ADPCM forms the heart of the ITU's speech compression standards G.721,G.723 , G.726, and G.727. The difference between these standards are the bit-rate (from 3 to 5 bits per sample ) and some algorithm details . The input is μ-law coded PCM 16-bit samples PowerPoint Presentation: G.726 ADPCM It works by adapting a fixed quantizer in a simple way . The different sizes of codewords used amount to bit-rates of 16 kbps, 24 kbps, 32 kbps, or 40 kbps, at 8 kHz sampling rate . The standard defines a multiplier constant α that will change for every difference value, e n , depending on the current signals . Define a difference signal g n as follows Backward Adaptive Quantizer: Backward Adaptive Quantizer Backward adaptive works in principle by noticing either of the cases: too many values are quantized to values far from zero , we can know that the quantization steps are too small, so we use the backward adaptive quantizer to decrease the quantization steps. too many values fall close to zero we can know that the quantization steps are too large, so we use the backward adaptive quantizer to increase the quantization steps. Vocoders: Vocoders Vocoder is developed for speech signals. It is not useful for other signals, such as image signals or music signals. classified into two types Vocoder in frequency : The vocoder in frequency divide the signal into frequency components and model the divided parts. For example, channel vocoder and formant vocoder . Vocoder in time : the vocoder in time is always use a model of speech waveforms in time domain. For example, LPC(Linear Predictive Coding) coding Phase insensitivity: Phase insensitivity A complete reconstructing of speech waveform is really unnecessary, perceptually: all that is needed is for the amount of energy at any time to be about right, and the signal will sound about right . Phase is a shift in the time argument inside a function of time . PowerPoint Presentation: Solid line: Superposition of two cosines, with a phase shift. Dashed line: No phase shift. wave is different, yet the sound is the same, perceptually Channel Vocoder: Channel Vocoder Vocoders can operate at low bit-rates, 1–2 kbps. channel vocoder uses a filter bank to separate out the different frequency components: PowerPoint Presentation: A channel vocoder also analyses the signal to determine the general pitch of the speech low - bass, or high - tenor ), and also the excitation of the speech . A channel vocoder applies a vocal tract transfer model to generate a vector of excitation parameters that describe a model of the sound, and also guesses whether the sound is voiced or unvoiced . Formant Vocoder: Formant Vocoder Formants: the salient frequency components that are present in a sample of speech . It encoding only the most important frequencies . Linear Predictive Coding (LPC): Linear Predictive Coding (LPC) LPC vocoders extract salient features of speech directly from the waveform, rather than transforming the signal to the frequency domain LPC Features : uses a time-varying model of vocal tract sound generated from a given excitation transmits only a set of parameters modeling the shape and excitation of the vocal tract, not actual signals or differences. PowerPoint Presentation: LPC starts by deciding whether the current segment is voiced or unvoiced : For unvoiced : a wide-band noise generator is used to create sample values f ( n ) that act as input to the vocal tract simulator. For voiced : a pulse train generator creates values f ( n ) PowerPoint Presentation: If the output values generate s(n), for input values f(n), T he output s ( n ), depends on p previous output sample value PowerPoint Presentation: LP coefficients can be calculated by solving the following minimization problem: By taking the derivative of ai and setting it to zero, we get a set of J equations : PowerPoint Presentation: An often - used method to calculate LP coefficients is the autocorrelation method : Gain G can be calculated : The pitch P of the current speech frame can be extracted by correlation method, by finding the index of the peak of: Code Excited Linear Prediction (CELP): Code Excited Linear Prediction (CELP ) One of the most important factors in generating natural sounding speech is the excitation signal. Human ear is sensitive to pitch errors. LPC filter using a single periodic pulse excitation is not good. For each segment, the encoder finds the excitation vector that generates synthesized speech that best matches the speech segment. An entire set (a codebook) of excitation vectors is matched to the actual speech, and the index of the best match is sent to the receiver Quality achieved this way is sufficient for audio conferencing. PowerPoint Presentation: CELP coders contain two kinds of prediction LTP (Long time prediction): try to reduce redundancy in speech signals by finding the basic periodicity or pitch. STP (Short Time Prediction): try to eliminate the redundancy in speech signals by attempting to predict the next sample from several previous ones. PowerPoint Presentation: Short Time Prediction ( STP)-Find a LPC filter . Long Time Prediction ( LTP)-Find an excitation sequence from a codebook . The excitation sequence from LTP is passed through the filter from STP to synthesize the speech . PowerPoint Presentation: Adaptive Codebook Searching Codeword : a shifted speech residue segment indexed by the lag τ corresponding to the current speech frame or subframe . Look in a codebook of waveforms to find one that matches the current subframe . Types of searching Open loop Closed loop PowerPoint Presentation: Open-Loop Codeword Searching Tries to minimize the long-term prediction error By setting the partial derivative of g 0 to zero and hence a minimum summed-error value PowerPoint Presentation: Close-Loop Codeword Searching Closed-loop search is more often used in CELP coders - also called Analysis-By-Synthesis (A-B-S ) S peech is reconstructed with perceptual error minimized via an adaptive codebook search, rather than considering sum-of-squares. The best candidate in the adaptive codebook is selected to minimize the distortion of locally reconstructed speech Parameters are created by minimizing the amount of the difference between the original and the reconstructed speech. Hybrid Excitation Vocoder: Hybrid Excitation Vocoder Hybrid Excitation Vocoders are different from CELP in that they use model-based methods to introduce multi-model excitation includes two major types : MBE (Multi-Band Excitation ): MELP (Multiband Excitation Linear Predictive ) PowerPoint Presentation: MBE Vocoder MBE utilizes the A-B-S scheme in parameter estimation : The parameters such as basic frequency, spectrum envelope , and sub-band decisions are all done by closed- loop searching The closed-loop optimization is based on minimizing the weighted reconstructed speech error, which can be represented in frequency domain as PowerPoint Presentation: MELP Vocoder MELP: also based on LPC analysis, uses a multiband soft decision model for the excitation signal. The LP residue is band-passed and a voicing strength parameter is estimated for each band. Speech can be then reconstructed by passing the excitation through the LPC synthesis filter Differently from MBE, MELP divides the excitation into fixed bands of 0-500, 500-1000, 1000-2000, 2000-3000, and 3000-4000 Hz PowerPoint Presentation: A voice degree parameter is estimated in each band based on the normalized correlation function of the speech signal . MPEG Audio Compression: MPEG Audio Compression Psychoacoustics : Psychoacoustics Human hearing and voice Range of human hearing is about 20 Hz to 20 kHz, most sensitive at 2 to 4 KHz. Normal voice range is about 500 Hz to 2 kHz Low frequencies are vowels and bass High frequencies are consonants Frequency Masking : Frequency Masking Lossy audio data compression methods, such as MPEG/Audio encoding, remove some sounds which are masked anyway, thus reducing the total amount of information. The general situation in regard to masking is as follows: A lower tone can effectively mask (make us unable to hear) a higher tone The reverse is not true. A higher tone does not mask a lower tone well As a consequence, if two tones are widely separated in frequency then little masking occurs Threshold of Hearing : Threshold of Hearing Human hearing has a threshold, which varies with frequency. For every frequency, if the sound (in dB) is lower than the corresponding threshold,we will not hear the sound, and it is called in-audible . Increase the sound to exeed the corresponding threshold, and the sound will be audible PowerPoint Presentation: There is an approximate formula for the threshold curve Temporal Masking: Temporal Masking Phenomenon : any loud tone will cause the hearing receptors in the inner ear to become saturated and require time to recover. The sound occurred right after a loud sound is difficult to hear due to masking for 0.1 second even after the louder has vanished already. The masking occurs ,when there is a certain time-delay between the two sounds.