Internet Draft S. V. Andersen Document: [-draft-ietf-avt-ilbc-codec-00.txt-] {+draft-ietf-avt-ilbc-codec-01.txt+} H. [-Astrom-] {+Êstr÷m+} Category: Experimental A. Duric [-September 20th 2002-] {+March 3rd, 2003+} F. [-Galschiodt-] {+Galschi÷dt+} Expires: [-March 20th-] {+September 3rd,+} 2003 R. Hagen W. B. Kleijn J. Linden M. N. Murthi J. Skoglund J. Spittka Global IP Sound Internet Low Bit Rate Codec Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document specifies a speech codec suitable for robust voice communication over IP. The codec is developed by Global IP Sound (GIPS). It is designed for narrow band speech and results in a payload bit rate of 13.33 [-kbit/s, with an encoding frame length of-] {+kbit/s for+} 30 [-ms.-] {+ms frames and 15.20 kbit/s for 20 ms frames.+} The codec enables graceful speech quality degradation in the case of lost frames, which occurs in connection with lost or delayed IP packets. Andersen et. al. 1 Internet Low Bit Rate Codec [-September 2002-] {+March 2003+} Table of Contents Status of this Memo................................................1 Abstract...........................................................1 Table of Contents..................................................2 1. INTRODUCTION....................................................5 2. OUTLINE OF THE CODEC............................................5 2.1 Encoder........................................................6 2.2 Decoder........................................................7 3. ENCODER [-PRINCIPLES..............................................7-] {+PRINCIPLES..............................................8+} 3.2 LPC Analysis and [-Quantization..................................8-] {+Quantization..................................9+} 3.2.1 Computation of Autocorrelation Coefficients..................9 3.2.2 Computation of LPC [-Coefficients.............................10-] {+Coefficients.............................11+} 3.2.3 Computation of LSF Coefficients from LPC Coefficients.......11 3.2.4 Quantization of LSF Coefficients............................11 3.2.5 Stability Check of LSF Coefficients.........................12 3.2.6 Interpolation of LSF [-Coefficients...........................12-] {+Coefficients...........................13 3.2.7 LPC Analysis and Quantization for 20 ms frames..............13+} 3.3 Calculation of the [-Residual...................................13-] {+Residual...................................14+} 3.4 Perceptual Weighting [-Filter...................................13-] {+Filter...................................14+} 3.5 Start State [-Encoder...........................................13-] {+Encoder...........................................15+} 3.5.1 Start State [-Estimation......................................14-] {+Estimation......................................15+} 3.5.2 All-Pass Filtering and Scale [-Quantization...................15-] {+Quantization...................16 3.5.3 Scalar Quantization.........................................17+} 3.6 Encoding the remaining [-samples................................16-] {+samples................................17+} 3.6.1 Codebook [-Memory.............................................17-] {+Memory.............................................18+} 3.6.2 Perceptual Weighting of Codebook Memory and [-Target..........19-] {+Target..........20+} 3.6.3 Codebook [-Creation...........................................19-] {+Creation...........................................20+} 3.6.3.1 Creation of a Base [-Codebook...............................20-] {+Codebook...............................21+} 3.6.3.2 Codebook [-Expansion........................................20-] {+Expansion........................................21+} 3.6.3.3 Codebook [-Augmentation.....................................21-] {+Augmentation.....................................22+} 3.6.4 Codebook [-Search.............................................22-] {+Search.............................................23+} 3.6.4.1 The Codebook Search at Each [-Stage.........................22-] {+Stage.........................24+} 3.6.3.2 The Gain Quantization at Each [-Stage.......................23-] {+Stage.......................24+} 3.6.3.3 Preparation of Target for Next [-Stage......................24-] {+Stage......................25+} 3.7 Gain Correction [-Encoding......................................24-] {+Encoding......................................25+} 3.8 Bitstream [-Definition..........................................25-] {+Definition..........................................26+} 4. DECODER [-PRINCIPLES.............................................27-] {+PRINCIPLES.............................................29+} 4.1 LPC Filter [-Reconstruction.....................................28-] {+Reconstruction.....................................29+} 4.2 Start State [-Reconstruction....................................28-] {+Reconstruction....................................30+} 4.3 Excitation Decoding [-Loop......................................29-] {+Loop......................................30+} 4.4 Multistage Adaptive Codebook [-Decoding.........................29-] {+Decoding.........................31+} 4.4.1 Construction of the Decoded Excitation [-Signal...............29-] {+Signal...............31+} 4.5 Packet Loss [-Concealment.......................................30-] {+Concealment.......................................32+} 4.5.1 Block Received Correctly and Previous Block also [-Received...30-] {+Received...32+} 4.5.2 Block Not [-Received..........................................30 4.5.3 Block Received Correctly When Previous Block Not Received...31 4.6 Enhancement...................................................31-] {+Received..........................................32+} Andersen et. al. Experimental - Expires [-March 20th,-] {+September 3rd,+} 2003 2 Internet Low Bit Rate Codec [-September 2002-] {+March 2003 4.5.3 Block Received Correctly When Previous Block Not Received...33 4.6 Enhancement...................................................33+} 4.6.1 Estimating the [-pitch........................................33-] {+pitch........................................35+} 4.6.2 Determination of the Pitch-Synchronous [-Sequences............33-] {+Sequences............35+} 4.6.3 Calculation of the smoothed [-excitation......................34-] {+excitation......................37+} 4.6.4 Enhancer [-criterion..........................................35-] {+criterion..........................................37+} 4.6.5 Enhancing the [-excitation....................................35-] {+excitation....................................37+} 4.7 Synthesis [-Filtering...........................................36-] {+Filtering...........................................38+} 4.8 Post [-Filtering................................................36-] {+Filtering................................................38+} 5. SECURITY [-CONSIDERATIONS........................................36-] {+CONSIDERATIONS........................................38+} 6. [-REFERENCES.....................................................36-] {+REFERENCES.....................................................39+} 7. [-ACKNOWLEDGEMENTS...............................................36-] {+ACKNOWLEDGEMENTS...............................................39+} 8. AUTHOR'S [-ADDRESSES.............................................37-] {+ADDRESSES.............................................40+} APPENDIX A REFERENCE [-IMPLEMENTATION...............................39-] {+IMPLEMENTATION...............................42+} A.1 [-iLBC_test.c...................................................40-] {+iLBC_test.c...................................................43+} A.2 [-iLBC_encode.h.................................................45-] {+iLBC_encode.h.................................................48+} A.3 [-iLBC_encode.c.................................................46-] {+iLBC_encode.c.................................................49+} A.4 [-iLBC_decode.h.................................................54-] {+iLBC_decode.h.................................................58+} A.5 [-iLBC_decode.c.................................................55-] {+iLBC_decode.c.................................................59+} A.6 [-iLBC_define.h.................................................65-] {+iLBC_define.h.................................................70+} A.7 [-constants.h...................................................68-] {+constants.h...................................................74+} A.8 [-constants.c...................................................69-] {+constants.c...................................................75+} A.9 [-anaFilter.h...................................................82-] {+anaFilter.h...................................................88+} A.10 [-anaFilter.c..................................................83-] {+anaFilter.c..................................................89+} A.11 [-createCB.h...................................................84-] {+createCB.h...................................................90+} A.12 [-createCB.c...................................................85-] {+createCB.c...................................................91+} A.13 [-doCPLC.h.....................................................89-] {+doCPLC.h.....................................................95+} A.14 [-doCPLC.c.....................................................90-] {+doCPLC.c.....................................................96+} A.15 [-enhancer.h...................................................95-] {+enhancer.h..................................................100+} A.16 [-enhancer.c...................................................96-] {+enhancer.c..................................................101+} A.17 [-filter.h....................................................107-] {+filter.h....................................................113+} A.18 [-filter.c....................................................109-] {+filter.c....................................................114+} A.19 [-FrameClassify.h.............................................112-] {+FrameClassify.h.............................................117+} A.20 [-FrameClassify.c.............................................112-] {+FrameClassify.c.............................................118+} A.21 [-gainquant.h.................................................114-] {+gainquant.h.................................................120+} A.22 [-gainquant.c.................................................115-] {+gainquant.c.................................................120+} A.23 [-getCBvec.h..................................................117-] {+getCBvec.h..................................................122+} A.24 [-getCBvec.c..................................................117-] {+getCBvec.c..................................................123+} A.25 [-helpfun.h...................................................120-] {+helpfun.h...................................................126+} A.26 [-helpfun.c...................................................122-] {+helpfun.c...................................................128+} A.27 [-hpInput.h...................................................128-] {+hpInput.h...................................................134+} A.28 [-hpInput.c...................................................128-] {+hpInput.c...................................................134+} A.29 [-hpOutput.h..................................................129-] {+hpOutput.h..................................................135+} A.30 [-hpOutput.c..................................................130-] {+hpOutput.c..................................................136+} A.31 [-iCBConstruct.h..............................................131-] {+iCBConstruct.h..............................................137+} A.32 [-iCBConstruct.c..............................................132 A.33 iCBSearch.h.................................................134 A.34 iCBSearch.c.................................................134-] {+iCBConstruct.c..............................................137+} Andersen et. al. Experimental - Expires [-March 20th,-] {+September 3rd,+} 2003 3 Internet Low Bit Rate Codec [-September 2002-] {+March 2003 A.33 iCBSearch.h.................................................139 A.34 iCBSearch.c.................................................140+} A.35 [-LPCdecode.h.................................................143-] {+LPCdecode.h.................................................149+} A.36 [-LPCdecode.c.................................................144-] {+LPCdecode.c.................................................150+} A.37 [-LPCencode.h.................................................146-] {+LPCencode.h.................................................152+} A.38 [-LPCencode.c.................................................147-] {+LPCencode.c.................................................153+} A.39 [-lsf.h.......................................................150-] {+lsf.h.......................................................157+} A.40 [-lsf.c.......................................................151-] {+lsf.c.......................................................158+} A.41 [-packing.h...................................................156-] {+packing.h...................................................163+} A.42 [-packing.c...................................................157-] {+packing.c...................................................164+} A.43 [-StateConstructW.h...........................................160-] {+StateConstructW.h...........................................167+} A.44 [-StateConstructW.c...........................................161-] {+StateConstructW.c...........................................168+} A.45 [-StateSearchW.h..............................................162-] {+StateSearchW.h..............................................169+} A.46 [-StateSearchW.c..............................................163-] {+StateSearchW.c..............................................170+} A.47 [-syntFilter.h................................................166-] {+syntFilter.h................................................173+} A.48 [-syntFilter.c................................................167-] {+syntFilter.c................................................174+} Andersen et. al. Experimental - Expires [-March 20th,-] {+September 3rd,+} 2003 4 Internet Low Bit Rate Codec [-September 2002-] {+March 2003+} 1. INTRODUCTION This document contains the description of an algorithm for the coding of speech signals sampled at 8 kHz. The algorithm, called iLBC, [-has a bit rate of 13.33 kbit/s using-] {+uses+} a block-independent linear-predictive coding (LPC) [-algorithm. The-] {+algorithm and has support for two basic frame lengths û 20 ms at 15.2 kbit/s and 30 ms at 13.33 kbit/s. When the+} codec operates at block lengths of {+20 ms, it produces 303 bits per block which SHOULD be packetized in 38 bytes. Similarly, for block lengths of+} 30 ms [-and-] {+it+} produces 399 bits per block which SHOULD be packetized in 50 bytes. The {+two modes for the different frame sizes operate in a very similar way. When they differ it is explicitly said in the text, usually with the notation x/y, where x refers to the 20 ms mode and y refers to the 30 ms mode. The+} described algorithm results in a speech coding system with a controlled response to packet losses similar to what is known from pulse code modulation (PCM) with packet loss concealment (PLC), such as the ITU-T G.711 standard [3] which operates at a fixed bit rate of 64 kbit/s. At the same time, the described algorithm enables fixed bit rate coding with a quality-versus-bit rate tradeoff close to state-of-the-art. A suitable RTP payload format for this codec is specified in [1]. Some of the applications for which this coder is suitable are: real time communications such as telephony and videoconferencing, streaming audio, archival, and messaging. This document is organized as follows. In Section 2 a brief outline of the codec is given. The specific encoder and decoder algorithms are explained in Sections 3 and 4, respectively. A c-code reference implementation is provided in Appendix A. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [2]. 2. OUTLINE OF THE CODEC The codec consists of an encoder and a decoder described in Section 2.1 and 2.2, respectively. The essence of the codec is LPC and block based coding of the LPC residual signal. For each [-240-] {+160/240 (20ms/30ms)+} sample block, the following major steps are done: A set of LPC filters are computed and the speech signal is filtered through them to produce the residual signal. The codec uses scalar quantization of the dominant part, in terms of energy, of the residual signal for the block. The dominant state is of length [-58-] {+57/58 (20ms/30ms)+} samples and forms a start state for dynamic codebooks constructed from the already coded parts of the residual signal. These dynamic codebooks are used to code the remaining parts of the residual signal. By this method, {+Andersen et. al. Experimental - Expires September 3rd, 2003 5 Internet Low Bit Rate Codec March 2003+} coding independence between blocks is achieved, resulting in elimination of propagation of perceptual degradations due to packet loss. The method facilitates high-quality packet loss concealment (PLC). [-Andersen et. al. Experimental - Expires March 20th, 2003 5 Internet Low Bit Rate Codec September 2002-] 2.1 Encoder The input to the encoder should be 16 bit uniform PCM sampled at 8 kHz. It should be partitioned into blocks of [-BLOCKL=240 samples.-] {+BLOCKL=160/240 samples for the 20/30 ms frame size.+} Each block is divided into [-NSUB=6-] {+NSUB=4/6+} consecutive sub-blocks of SUBL=40 samples each. For [-each input block,-] {+30 ms frame size,+} the encoder performs two LPC_FILTERORDER=10 linear-predictive coding (LPC) analyses. The first analysis applies a smooth window centered over the 2nd [-sub- block-] {+sub-block+} and extending to the middle of the 5'th sub-block. The second LPC analysis applies a smooth window centered over the 5'th [-sub- block-] {+sub-block+} and extending to the end of the 6'th sub-block. For [-both-] {+20 ms frame size one LPC_FILTERORDER=10 linear- predictive coding (LPC) analysis is performed with a smooth window centered over the 3'rd and 4'th subframe. For each of the+} LPC [-analyses, sets-] {+analysis, a set+} of line-spectral frequencies(LSF)'s are obtained, quantized and interpolated to obtain LSF coefficients for each sub-block. Subsequently, the LPC residual is computed using the quantized and interpolated LPC analysis filters. The two consecutive sub-blocks of residual exhibiting the maximal weigthed energy are identified. Within these 2 sub-blocks, the start state (segment) is selected from two choices: the first [-58-] {+57/58+} samples or the last [-58-] {+57/58+} samples of the 2 consecutive sub-blocks. The selected segment is the one of higher energy. The start state is encoded with scalar quantization. A dynamic codebook encoding procedure is used to encode 1) the [-22-] {+23/22 (20ms/30ms)+} remaining samples in the 2 sub-blocks containing the start state; 2) encoding of the sub-blocks after the start state in time; 3) encoding of the sub-blocks before the start state in time. Thus, the encoding target can be either the [-22-] {+23/22+} samples remaining of the 2 [-sub- blocks-] {+sub-blocks+} containing the start state or a 40 sample [-sub-block.-] {+sub- block.+} This target can consist of samples that are indexed forwards in time or backwards in time depending on the location of the start state. The coding is based on an adaptive codebook that is built from a codebook memory which contains decoded LPC excitation samples from the already encoded part of the block. These samples are indexed in the same time direction as the target vector and ending at the sample instant prior to the first sample instant represented in the target vector. The codebook is used in CB_NSTAGES=3 stages in a successive refinement approach and the resulting 3 code vector gains are encoded with 5, 4, and 3 bit scalar quantization, respectively. {+Andersen et. al. Experimental - Expires September 3rd, 2003 6 Internet Low Bit Rate Codec March 2003+} The codebook search method employs noise shaping derived from the LPC filters and the main descision criteria is minimizing the squared error between the target vector and the code vectors. Each code vector in this codebook comes from one of CB_EXPAND=2 codebook sections. The first section is filled with delayed, already encoded residual vectors. The code vectors of the second codebook section are constructed by predefined linear combinations of vectors in the first section of the codebook. Since codebook encoding with squared-error matching is known to produce a coded signal of less power than the scalar quantized [-Andersen et. al. Experimental - Expires March 20th, 2003 6 Internet Low Bit Rate Codec September 2002-] start state signal, a gain re-scaling method is implemented by a refined search for a better set of codebook gains in terms of power matching after encoding. This is done by searching for a higher value of the gain factor for the first stage codebook since the subsequent stage codebook gains are scaled by the first stage gain. 2.2 Decoder For packet communications, typically a jitter buffer placed at the receiving end decides whether the packet containing an encoded signal block has been received or lost. This logic is not part of the codec described here. For each received encoded signal block the decoder performs a decoding. For each lost signal block the decoder performs a PLC operation. The decoding for each block starts by decoding and interpolating the LPC coefficients. Subsequently the start state is decoded. For codebook encoded segments, each segment is decoded by constructing the 3 code vectors given by the received codebook indices in the same way as the code vectors were constructed in the encoder. The 3 gain factors are also decoded and the resulting decoded signal is given by the sum of the 3 codebook vectors scaled with respective gain. An enhancement algorithm is applied on the reconstructed excitation signal. This enhancement augments the periodicity of voiced speech regions. The enhancement is optimized under the constraint that the modification signal (defined as the difference between the enhanced excitation and the excitation signal prior to enhancement) has a short-time energy that does not exceed a preset fraction of the short-time energy of the excitation signal prior to enhancement. A packet loss concealment (PLC) operation is easily embedded in the decoder. The PLC operation can, e.g., be based on repetition of LPC filters and obtaining the LPC residual signal using a long term prediction estimate from previous residual blocks. {+Andersen et. al. Experimental - Expires September 3rd, 2003 7 Internet Low Bit Rate Codec March 2003+} 3. ENCODER PRINCIPLES The following block diagram is an overview of all the components of the iLBC encoding procedure. The description of the blocks contains references to the section where that particular procedure is described further. [-Andersen et. al. Experimental - Expires March 20th, 2003 7 Internet Low Bit Rate Codec September 2002-] +-----------+ +---------+ +---------+ speech -> | 1. Pre P | -> | 2. LPC | -> | 3. Ana | -> +-----------+ +---------+ +---------+ +---------------+ +--------------+ -> | 4. Start Sel | ->| 5. Scalar Qu | -> +---------------+ +--------------+ +--------------+ +---------------+ -> |6. CB Search | -> | 7. Packetize | -> payload | +--------------+ | +---------------+ ----<---------<------ sub frame [-0..4-] {+0..2/4 (20ms/30ms)+} Figure 3.1. Flow chart of the iLBC encoder 1. Pre process speech with a HP filter if needed (section 3.1) 2. Compute LPC parameters, quantize and interpolate (section 3.2) 3. Use analysis filters on speech to compute residual (section 3.3) 4. Select position of [-58-] {+57/58+} sample start state (section 3.5) 5. Quantize the [-58-] {+57/58+} sample start state with scalar quantization (section 3.5) 6. Search the codebook for each subframe. Start with [-22-] {+23/22+} sample block, then encode sub blocks forward in time and then encode sub blocks backward in time. For each block the steps in figure 3.3 are performed (section 3.6) 7. Packetize the bits into the payload specified in table 3.2. The input to the encoder should be 16 bit uniform PCM sampled at 8 kHz. Also it should be partitioned into blocks of [-BLOCKL=240-] {+BLOCKL=160/240+} samples. Each block input to the encoder is divided into [-NSUB=6-] {+NSUB=4/6+} consecutive sub-blocks of SUBL=40 samples each. {+Andersen et. al. Experimental - Expires September 3rd, 2003 8 Internet Low Bit Rate Codec March 2003 0 39 79 119 159 +---------------------------------------+ | 1 | 2 | 3 | 4 | +---------------------------------------+ 20 ms frame+} 0 39 79 119 159 199 239 +-----------------------------------------------------------+ | 1 | 2 | 3 | 4 | 5 | 6 | +-----------------------------------------------------------+ {+30 ms frame+} Figure 3.2. One input block to the encoder {+for 20 ms (with 4 subframes)+} and [-its-] {+30 ms (with+} 6 [-sub blocks-] {+subframes).+} 3.1 Pre-processing In some applications the recorded speech signal contains DC level and/or 50/60 Hz noise. If these components have not been removed prior to the encoder call, they should be removed by a high-pass filter. A reference implementation of this, using a filter with cut off frequency 90 Hz, can be found in Appendix A.28. 3.2 LPC Analysis and Quantization The input to the LPC analysis module is a possibly high-pass filtered speech buffer, speech_hp, that contains [-300-] {+220/300+} (LPC_LOOKBACK + BLOCKL = 60 + [-240-] {+160/240+} = [-300)-] {+220/300)+} speech samples, where samples 0 through 59 [-Andersen et. al. Experimental - Expires March 20th, 2003 8 Internet Low Bit Rate Codec September 2002-] are from the previous block and samples 60 through [-299-] {+219/299+} are from the current block. No look-ahead into the next block is used. For the very first block processed, the look back samples are assumed to be zeros. For each input block, the LPC analysis calculates [-two sets-] {+one/two set(s)+} of LPC_FILTERORDER=10 LPC filter coefficients using the autocorrelation method and the Levinson-Durbin recursion. These coefficients are converted to the Line Spectrum Frequency representation. [-The-] {+In the 20 ms case the set, lsf, represents the spectral characteristics as measured at the center of the third subblock. For 30 ms frames the+} first set, lsf1, represents the spectral properties of the input signal at the center of the second subblock while the other set, lsf2, represents the spectral characteristics as measured at the center of the fifth subblock. The details of the computation {+for 30 ms frames+} are described in 3.2.1 through 3.2.6. {+Section 3.2.7 explains how the LPC Analysis and Quantization differs for 20 ms frames.+} 3.2.1 Computation of Autocorrelation Coefficients The first step in the LPC analysis procedure is to calculate autocorrelation coefficients using windowed speech samples. This windowing is the only difference in the LPC analysis procedure for the two sets of coefficients. For the first set, a 240 sample long {+Andersen et. al. Experimental - Expires September 3rd, 2003 9 Internet Low Bit Rate Codec March 2003+} standard symmetric Hanning window is applied to samples 0 through 239 of the input data. The first window, lpc_winTbl, is defined as: lpc_winTbl[i]= 0.5 * (1.0 - cos((2*PI*(i+1))/(BLOCKL+1))); i=0,...,119 lpc_winTbl[i] = winTbl[BLOCKL - i - 1]; i=120,...,239 The windowed speech speech_hp_win1 is then obtained by multiplying the 240 first samples of the input speech buffer with the window coefficients: speech_hp_win1[i] = speech_hp[i] * lpc_winTbl[i]; i=0,...,BLOCKL-1 From these 240 windowed speech samples, 11 (LPC_FILTERORDER + 1) autocorrelation coefficients, acf1, are calculated: acf1[lag] += speech_hp_win1[n] * speech_hp_win1[n + lag]; lag=0,...,LPC_FILTERORDER; n=0,...,BLOCKL-lag In order to make the analysis more robust against numerical precision problems, a spectral smoothing procedure is applied by windowing the autocorrelation coefficients with a window before the LPC coefficients are computed. Also, a white noise floor is added to the autocorrelation function by multiplying coefficient zero by 1.0001 (40dB below the energy of the windowed speech signal). These two steps are implemented by multiplying the autocorrelation coefficients with the following window: lpc_lagwinTbl[0] = 1.0001; lpc_lagwinTbl[i] = exp(-0.5 * ((2 * PI * 60.0 * i) /FS)^2); i=1,...,LPC_FILTERORDER [-Andersen et. al. Experimental - Expires March 20th, 2003 9 Internet Low Bit Rate Codec September 2002-] where FS=8000 is the sampling frequency Then, the windowed acf function acf1_win is obtained by: acf1_win[i] = acf1[i] * lpc_lagwinTbl[i]; i=0,...,LPC_FILTERORDER The second set of autocorrelation coefficients, acf2_win are obtained in a similar manner. The window, lpc_asymwinTbl, is applied to samples 60 through 299, i.e., the entire current block. The window consists of two segments; The first (samples 0 to 219) being half a Hanning window with length 440 and the second being a quarter of a cycle of a cosine wave. By using this asymmetric window, an LPC analysis centered in the fifth subblock is obtained without the need for any look-ahead, which would have added delay. The asymmetric window is defined as: lpc_asymwinTbl[i] = (sin(PI * (i + 1) / 441))^2; i=0,...,219 lpc_asymwinTbl[i] = cos((i - 220) * PI / 10); i=220,...,239 {+Andersen et. al. Experimental - Expires September 3rd, 2003 10 Internet Low Bit Rate Codec March 2003+} and the windowed speech is computed by: speech_hp_win2[i] = speech_hp[i + LPC_LOOKBACK] * lpc_asymwinTbl[i]; i=0,....BLOCKL-1 The windowed autocorrelation coefficients are then obtained in exactly the same way as for the first analysis instance. The generation of the windows lpc_winTbl, lpc_asymwinTbl, and lpc_lagwinTbl are typically done in advance and the arrays are stored in ROM rather than repeating the calculation for every block. 3.2.2 Computation of LPC Coefficients From the 11 smoothed autocorrelation coefficients, acf1_win and acf2_win, the 2 x 11 LPC coefficients, lp1 and lp2, are calculated in the same way for both analysis locations using the well known Levinson-Durbin recursion. The first LPC coefficient is always 1.0, resulting in 10 unique coefficients. After determining the LPC coefficients, a bandwidth expansion procedure is applied in order to smooth the spectral peaks in the short-term spectrum. The bandwidth addition is obtained by the following modification of the LPC coefficients: lp1_bw[i] = lp1[i] * chirp^i; i=0,...,LPC_FILTERORDER lp2_bw[i] = lp2[i] * chirp^i; i=0,...,LPC_FILTERORDER where "chirp" is a real number between 0 and 1. It is RECOMMENDED to use a value of 0.9. [-Andersen et. al. Experimental - Expires March 20th, 2003 10 Internet Low Bit Rate Codec September 2002-] 3.2.3 Computation of LSF Coefficients from LPC Coefficients Thusfar, two sets of LPC coefficients that represent the short-term spectral characteristics of the speech signal for two different time locations within the current block have been determined. These coefficients should be quantized and interpolated. Before doing so, it is advantageous to convert the LPC parameters into another type of representation called Line Spectral Frequencies (LSF). The LSF parameters are used because they are better suited for quantization and interpolation than the regular LPC coefficients. Many computationally efficient methods for calculating the LSFs from the LPC coefficients have been proposed in the literature. The detailed implementation of one applicable method can be found in Appendix A.40. The two arrays of LSF coefficients obtained, lsf1 and lsf2, are of dimension 10 (LPC_FILTERORDER). 3.2.4 Quantization of LSF Coefficients Since the LPC filters defined by the two sets of LSFs are needed also in the decoder, the LSF parameters needs to be quantized and transmitted as side information. The total number of bits required to represent the quantization of the two LSF representations for one {+Andersen et. al. Experimental - Expires September 3rd, 2003 11 Internet Low Bit Rate Codec March 2003+} block of speech is 40 with 20 bits used for each of lsf1 and lsf2. For computational and storage reasons, the LSF vectors are quantized using 3-split vector quantization (VQ). That is, the LSF vectors are split into three subvectors which are each quantized with a regular VQ. The quantized versions of lsf1 and lsf2, qlsf1 and qlsf2, are obtained by using the same memoryless split VQ. The length of each of these two LSF vectors are 10 and they are split into 3 sub vectors containing 3, 3 and 4 values respectively. For each of the sub-vectors, a separate codebook of quantized values has been designed using a standard VQ training method for a large database containing speech from a large number of speakers recorded under various conditions. The size of each of the three codebooks associated with the split definitions above is: int size_lsfCbTbl[LSF_NSPLIT] = {64,128,128}; The actual values of the vector quantization codebook that must be used can be found in the reference code of appendix A. Both sets of LSF coefficients, lsf1 and lsf2, are quantized with a standard memoryless split vector quantization (VQ) structure using the squared error criterion in the LSF domain. The split VQ quantization consists of the following steps: 1) Quantize the first 3 LSF coefficients (1 - 3) with a VQ codebook of size 64. 2) Quantize the LSF coefficients 4, 5, and 6 with VQ a codebook of size 128. 3) Quantize the last 4 LSF coefficients (7 - 10) with a VQ codebook of size 128. [-Andersen et. al. Experimental - Expires March 20th, 2003 11 Internet Low Bit Rate Codec September 2002-] This procedure, repeated for lsf1 and lsf2, gives 6 quantization indices and the quantized sets of LSF coefficients qlsf1 and qlsf2. Each set of three indices is encoded with 6 + 7 + 7 = 20 bits. The total number of bits used for LSF quantization in a block is thus 40 bits. 3.2.5 Stability Check of LSF Coefficients The LSF representation of the LPC filter has the nice property that the coefficients are ordered by increasing value, i.e., lsf(n-1) < lsf(n), 0 < n < 10, if the corresponding synthesis filter is stable. Since we are employing a split VQ scheme it is possible that at the split boundaries the LSF coefficients are not ordered correctly and hence the corresponding LP filter is unstable. To ensure that the filter used is stable, a stability check is performed for the quantized LSF vectors. If it turns out that the coefficients are not ordered appropriately (with a safety margin of 50 Hz to ensure that formant peaks are not too narrow) they will be moved apart. The detailed method for this can be found in Appendix A.40. The same procedure is performed in the decoder. This ensures that exactly the same LSF representations are used in both encoder and decoder. [-3.2.6 Interpolation of LSF Coefficients From-] {+Andersen et. al. Experimental - Expires September 3rd, 2003 12 Internet Low Bit Rate Codec March 2003 3.2.6 Interpolation of LSF Coefficients From+} the two sets of LSF coefficients that are computed for each block of speech, different LSFs are obtained for each subblock by means of interpolation. This procedure is performed for the original LSFs, lsf1 and lsf2 as well as the quantized versions qlsf1 and qlsf2 since both versions are used in the encoder. Here follows a brief summary of the interpolation scheme while the details are found in the c-code of Appendix A. In the first sub-block, the average of the second LSF vector from the previous block and the first LSF vector in the current block is used. For sub-blocks two through five the LSFs used are obtained by linear interpolation from lsf1 (and qlsf1) to lsf2 (and qlsf2) with lsf1 used in subblock two and lsf2 in subblock five. In the last subblock, lsf2 is used. For the very first block it is assumed that the last LSF vector of the previous block is equal to a predefined vector, lsfmeanTbl, that was obtained by calculating the mean LSF vector of the LSF design database. lsfmeanTbl[LPC_FILTERORDER] = {0.281738, 0.445801, 0.663330, 0.962524, 1.251831, 1.533081, 1.850586, 2.137817, 2.481445, 2.777344} The interpolation method is standard linear interpolation in the LSF domain. The interpolated LSF values are converted to LPC coefficients for each sub-block. The unquantized and quantized LPC coefficients forms two sets of filters respectively. The unquantized analysis filter for subblock k: [-Andersen et. al. Experimental - Expires March 20th, 2003 12 Internet Low Bit Rate Codec September 2002-] ___ \ Ak(z)= 1 + > aki*z^(-i) /__ i=1...LPC_FILTERORDER Quantized analysis filter for subblock k: ___ \ [-Èk(z)=-] {+Ãk(z)=+} 1 + > [-éki*z^(-i)-] {+‚ki*z^(-i)+} /__ i=1...LPC_FILTERORDER A reference implementation of the lsf encoding is given in Appendix A.38. A reference implementation of the corresponding decoding can be found in Appendix A.36. {+3.2.7 LPC Analysis and Quantization for 20 ms frames As stated before, the codec only calculates one set of LPC for the 20 ms frame size as opposed to two sets for 30 ms frames. A single set of autocorrelation coefficients is calculated on the LPC_LOOKBACK + BLOCKL = 60 + 160 = 240 samples. These samples are windowed with the asymmetric window lpc_asymwinTbl, centered over Andersen et. al. Experimental - Expires September 3rd, 2003 13 Internet Low Bit Rate Codec March 2003 the third subframe, to form speech_hp_win. Autocorrelation coefficients, acf, are calculated on the 240 samples in speech_hp_win and then windowed exactly as in 3.2.1 (resulting in acf_win). This single set of windowed autocorrelation coefficients is used to calculate LPC Coefficients, LSF Coefficients and quantized LSF coefficients in exactly the same manner as in 3.2.3 to 3.2.4. As for the 30 ms frame size, the 10 lsf coefficients are divided into three sub vectors of size 3, 3, 4 and quantized using the same scheme and codebook as in 3.2.4 to finally get 3 quantization indices. The quantized LSF coefficients are stabilized with the algorithm described in 3.2.5. From the set of LSF coefficients that was computed for this block together with the LSF coefficients from the previous block, different LSFs are obtained for each subblock by means of interpolation. The interpolation is done linearly in the LSF domain over the 4 sub blocks, so that the n'th subframe uses the weight (4- n)/4 for the LSF from old frame and the weight n/4 of the LSF from the current frame. For the very first block the mean LSF, lsfmeanTbl, is used as the LSF from the previous block. Similar to 3.2.6, both unquantized, A(z), and quantized, Ã(z), analysis filters are calculated for each of the four sub block.+} 3.3 Calculation of the Residual The block of speech samples is filtered by the quantized and interpolated LPC analysis filters to yield the residual signal. In particular, the corresponding LPC analysis filter for each 40 sample subblock is used to filter the speech samples for the same subblock. The filter memory at the end of each subblock is carried over to the LPC filter of the next subblock. The signal at the output of each LP analysis filter constitutes the residual signal for the corresponding subblock. A reference implementation of the LPC analysis filters are found in Appendix A.10. 3.4 Perceptual Weighting Filter In principle any good design of perceptual weighting filter can be applied in the encoder without compromising this codec definition. It is however RECOMMENDED to use the perceptual weighting filter specified below: Weighting filter for subblock k: Wk(z)=1/Ak(z/LPC_CHIRP_WEIGHTDENUM), where LPC_CHIRP_WEIGHTDENUM = 0.4222 {+Andersen et. al. Experimental - Expires September 3rd, 2003 14 Internet Low Bit Rate Codec March 2003+} This is a simple design with low complexity that is applied in the LPC residual domain. Here Ak(z) is the filter obtained from unquantized but interpolated LSF coefficients. 3.5 Start State Encoder The start state containing {+STATE_SHORT_LEN=57 for 20 ms frames and+} STATE_SHORT_LEN=58 maximum energy residual samples is quantized using a common 6-bit scale quantizer for the block and a 3-bit scalar quantizer operating on the scaled [-Andersen et. al. Experimental - Expires March 20th, 2003 13 Internet Low Bit Rate Codec September 2002-] samples in the weighted speech domain. Now we describe the state encoding in greater detail. 3.5.1 Start State Estimation The two sub-blocks containing the start state are determined by finding the two consecutive sub-blocks in the block having the highest power. Advantageously, down-weighting is used in the beginning and end of the sub-frames. I.e., the following measure is [-computed:-] {+computed (NSUB=4/6 for 20/30 ms frame size):+} nsub=1,...,NSUB-1 ssqn[nsub] = 0.0; for (i=(nsub-1)*SUBL; i<(nsub-1)*SUBL+5; i++) ssqn[nsub] += sampEn_win[i-(nsub-1)*SUBL]* residual[i]*residual[i]; for (i=(nsub-1)*SUBL+5; i [-éki*z^(i-(LPC_FILTERORDER-1))-] {+‚ki*z^(i-(LPC_FILTERORDER-1))+} /__ i=0...(LPC_FILTERORDER-1) and [-Èk(z)-] {+Ãk(z)+} is taken from the block where the start state begins in -> Pk(z) -> filtered out(k) = filtered(k) + filtered(k+STATE_SHORT_LEN), k=0..(STATE_SHORT_LEN-1) The all pass filtered block is searched for its largest magnitude sample. The 10-logarithm of this magnitude is quantized with a 6-bit quantizer, state_frgqTbl, by finding the nearest representation. This results in an index, idxForMax, corresponding to a quantized value, qmax. The all-pass filtered residual samples in the block are then multiplied with a scaling factor scal=4.5/(10^qmax) to yield normalized samples. {+Andersen et. al. Experimental - Expires September 3rd, 2003 16 Internet Low Bit Rate Codec March 2003+} state_frgqTbl[64] = {1.000085, 1.071695, 1.140395, 1.206868, 1.277188, 1.351503, 1.429380, 1.500727, 1.569049, 1.639599, 1.707071, 1.781531, 1.840799, 1.901550, 1.956695, 2.006750, 2.055474, 2.102787, 2.142819, 2.183592, 2.217962, 2.257177, 2.295739, 2.332967, 2.369248, 2.402792, 2.435080, 2.468598, 2.503394, 2.539284, 2.572944, 2.605036, 2.636331, 2.668939, 2.698780, 2.729101, 2.759786, 2.789834, 2.818679, 2.848074, 2.877470, 2.906899, 2.936655, 2.967804, 3.000115, 3.033367, 3.066355, 3.104231, 3.141499, 3.183012, 3.222952, 3.265433, 3.308441, 3.350823, 3.395275, 3.442793, 3.490801, 3.542514, 3.604064, 3.666050, 3.740994, 3.830749, 3.938770, 4.101764} 3.5.3 Scalar Quantization The normalized samples are quantized in the perceptually weighted [-Andersen et. al. Experimental - Expires March 20th, 2003 15 Internet Low Bit Rate Codec September 2002-] speech domain by a sample-by-sample scalar quantization. Each sample in[n] in the block is filtered by a weighting filter to form a weighted speech sample weightin[n]. The target sample target[n] is formed by subtracting a zero-input response sample of the weighting filter from weightin[n]. The coded state sample out[n] is obtained by quantizing target[n] with a 3-bit quantizer with quantization table state_sq3Tbl. state_sq3Tbl[8] = {-3.719849, -2.177490, -1.130005, -0.309692, 0.444214, 1.329712, 2.436279, 3.983887} The state of the weighting filter is then updated by filtering coded sample out[n]. The quantized samples are transformed back to the residual domain by 1) scaling with 1/scal 2) time-reversing the scaled samples 3) filtering the time-reversed samples by the same all-pass filter as in section 3.5.2, using circular convolution 4) time-reversing the filtered samples. (More detailed in section 4.2) A reference implementation of the start state encoding can be found in Appendix A.46. 3.6 Encoding the remaining samples A dynamic codebook is used to encode 1) the [-22-] {+23/22+} remaining samples in the 2 sub-blocks containing the start state; 2) encoding of the [-sub- blocks-] {+sub-blocks+} after the start state in time; 3) encoding of the [-sub-blocks-] {+sub- blocks+} before the start state in time. Thus, the encoding target can be either the [-22-] {+23/22+} samples remaining of the 2 sub-blocks containing the start state or a 40 sample sub-block. This target can consist of samples that are indexed forwards in time or backwards in time depending on the location of the start state. The length of the target is denoted by lTarget. The coding is based on an adaptive codebook that is built from a codebook memory which contains decoded LPC excitation samples from the already encoded part of the block. These samples are indexed in [-the same time direction as the target vector-] {+Andersen et. al. Experimental - Expires September 3rd, 2003 17 Internet Low Bit Rate Codec March 2003 the same time direction as the target vector+} and ending at the sample instant prior to the first sample instant represented in the target vector. The codebook memory has length lMem which is equal to CB_MEML=147 for the [-four-] {+two/four+} 40 sample sub-blocks and 85 for the [-22-] {+23/22+} sample sub-block. [-Andersen et. al. Experimental - Expires March 20th, 2003 16 Internet Low Bit Rate Codec September 2002-] The following figure shows an overview of the encoding procedure. +------------+ +---------------+ +-------------+ -> | 1. Decode | -> | 2. Mem setup | -> | 3. Perc. W. | -> +------------+ +---------------+ +-------------+ +------------+ +-----------------+ -> | 4. Search | -> | 5. Upd. Target | ------------------> | +------------+ +------------------ | ----<-------------<-----------<---------- stage=0..2 +----------------+ -> | 6. Recalc G[0] | ---------------> gains and CB indices +----------------+ Figure 3.3. Flow chart of the codebook search in the iLBC encoder 1. Decode the part of the residual that has been encoded so far, using the codebook without perceptual weighting 2. Set up the memory by taking data from the decoded residual. This memory is used to construct codebooks from. For blocks preceeding the start state, both the decoded residual and the target are time reversed (section 3.6.1) 3. Filter the memory + target with the perceptual weighting filter (section 3.6.2) 4. Search for the best match between the target and the codebook vector. Compute the optimal gain for this match and quantize that gain (section 3.6.4) 5. Update the perceptually weighted target by subtracting the contribution from the selected codebook vector from the perceptually weighted memory (quantized gain times selected vector). Repeat 4. and 5. for the 2 additional stages 6. Calculate the energy loss due to encoding of the residual. If needed, compensate for this loss by an upscaling and requantization of the gain for the first stage (section 3.7) The following sections provides an in-depth description of the different blocks of figure 3.3. 3.6.1 Codebook Memory The codebook memory is based on the already encoded sub-blocks so the avaible data for encoding increases for each new sub-block that has been encoded. Until enough sub-blocks has been encoded to fill the codebook memory with data it is padded with zeros. The following figure shows an example of the order in which the sub-blocks are {+Andersen et. al. Experimental - Expires September 3rd, 2003 18 Internet Low Bit Rate Codec March 2003+} encoded {+for the 30 ms frame size+} if the start state is located in the last 58 samples of [-sub- block-] {+sub-block+} 2 and 3. [-Andersen et. al. Experimental - Expires March 20th, 2003 17 Internet Low Bit Rate Codec September 2002-] +-----------------------------------------------------+ | 5 | 1 |///|////////| 2 | 3 | 4 | +-----------------------------------------------------+ Figure 3.4. The order from 1 to 5 in which the sub-blocks are encoded. The slashed area is the start state. The first target sub-block to be encoded is number 1 and the corresponding codebook memory is shown in the following figure. Since the target vector is before the start state in time the codebook memory and target vector are time reversed. By reversing them in time the search algorithm can be reused. Since only the start state has been encoded so far the last samples of the codebook memory are padded with zeros. +------------------------- |zeros|\\\\\\\\|\\\\| 1 | +------------------------- Figure 3.5. The codebook memory, length lMem=85 samples, and the target vector 1, length 22 samples. The next step is to encode sub-block 2 using the memory which now has increased since sub-block 1 has been encoded. The following figure shows the codebook memory for encoding of sub-block 2. +----------------------------------- | zeros | 1 |///|////////| 2 | +----------------------------------- Figure 3.6. The codebook memory, length lMem=147 samples, and the target vector 2, length 40 samples. The next step is to encode sub-block 3 using the memory which now has increased yet again since sub-block 1 and 2 has been encoded but it still has to be padded with a few zeros. The following figure shows the codebook memory for encoding of sub-block 3. +------------------------------------------ |zeros| 1 |///|////////| 2 | 3 | +------------------------------------------ Figure 3.7. The codebook memory, length lMem=147 samples, and the target vector 3, length 40 samples. The next step is to encode sub-block 4 using the memory which now has increased yet again since sub-block 1, 2 and 3 has been encoded. This time the memory does not have to be padded with zeros. The following figure shows the codebook memory for encoding of sub-block 4. Andersen et. al. Experimental - Expires [-March 20th,-] {+September 3rd,+} 2003 [-18-] {+19+} Internet Low Bit Rate Codec [-September 2002-] {+March 2003+} +------------------------------------------ |1|///|////////| 2 | 3 | 4 | +------------------------------------------ Figure 3.8. The codebook memory, length lMem=147 samples, and the target vector 4, length 40 samples. The final target sub-block to be encoded is number 5 and the corresponding codebook memory is shown in the following figure. Since the target vector is before the start state in time the codebook memory and target vector are time reversed. +------------------------------------------- | 3 | 2 |\\\\\\\\|\\\\| 1 | 5 | +------------------------------------------- Figure 3.9. The codebook memory, length lMem=147 samples, and the target vector 5, length 40 samples. {+For the case of 20 ms frames the encoding procedure looks almost exactly the same. The only difference is that the size of the start state is 57 samples and that there is only 3 sub blocks to be encoded. The encoding order is the same as above starting with the 23 sample target and then encoding the two remaining 40 sample sub blocks, first going forward in time and then going backwards in time relative to the start state.+} 3.6.2 Perceptual Weighting of Codebook Memory and Target To provide a perceptual weighting of the coding error, a concatenation of the codebook memory and the target to be coded is all pole filtered with the perceptual weighting filter specified in section 3.3. The filter state of the weighting filter is set to zero. in(0..(lMem-1)) = unweighted codebook memory in(lMem..(lMem+lTarget-1)) = unweighted target signal in -> Wk(z) -> filtered, where Wk(z) is taken from the subblock of the target weighted codebook memory = filtered(0..(lMem-1)) weighted target signal = filtered(lMem..(lMem+lTarget-1)) The codebook search is done using the weighted codebook memory and the weighted target, while the decoding and the codebook memory update uses the unweighted codebook memory. 3.6.3 Codebook Creation The codebook for the search is created from the perceptually weighted codebook memory. It consists of two sections where the {+Andersen et. al. Experimental - Expires September 3rd, 2003 20 Internet Low Bit Rate Codec March 2003+} first is referred to as the base codebook and the second as the expanded codebook since it is created by linear combinations of the first. Each of these two sections also have a subsection referred to as the augmented codebook. The augmented codebook is only created and used for the coding of the 40 sample sub-blocks and not for the [-22-] {+23/22+} sample sub-block case. The codebook size used for the different sub-blocks and different stages are summarized in the table below. Also the table shows, in parenthesis, how the number of codebook [-Andersen et. al. Experimental - Expires March 20th, 2003 19 Internet Low Bit Rate Codec September 2002-] vectors are divided within the two sections between base/expanded codebook and augmented codebook. Stage 1 2 & 3 _ -------------------------------------------- 22 128 (64+0)*2 128 (64+0)*2 Sub- 1:st 40 256 (108+20)*2 128 (44+20)*2 Blocks 2:nd 40 256 (108+20)*2 256 (108+20)*2 3:rd 40 256 (108+20)*2 256 (108+20)*2 4:th 40 256 (108+20)*2 256 (108+20)*2 Table 3.1. The table shows the codebook size for the different sub- blocks and [-stages.-] {+stages for 30 ms frames.+} Inside the parenthesis it shows how the number of codebook vectors are distributed, within the two sections, between the base/expanded codebook and the augmented base/expanded codebook. It should be interpreted in the following way: (base/expanded cb + augmented base/expanded cb). The total number of codebook vectors for a specific sub-block and stage is given by by the following formula: Tot. cb vectors = base cb + aug. base cb + exp. cb + aug. exp. [-cb-] {+Cb The corresponding values to figure 3.1 for 20 ms frames are only slightly modified. The short sub block is 23 instead of 22 samples and the 3:rd and 4:th sub frame are not present.+} 3.6.3.1 Creation of a Base Codebook The base codebook is given by the perceptually weigthed codebook memory that is mentioned in section 3.5.3. The different codebook vectors are given by sliding a window of length [-22-] {+23/22+} or 40, given by variable lTarget, over the lMem long perceptually weighted codebook memory. The indices are ordered so that the codebook vector containing sample(lMem-lTarget-n) to (lMem-n) of the codebook memory vector has index n. Thus the total number of base codebook vectors is lMem-lTarget+1 and the indices are ordered from sample delay lTarget [-(22-] {+(23/22+} or 40) to lMem+1 (86 or 148). 3.6.3.2 Codebook Expansion The base codebook is expanded a factor 2 by creating an additional section in the codebook. This new section is obtained by filtering the buffer buf above with a FIR filter with filter length CB_FILTERLEN=8. {+Andersen et. al. Experimental - Expires September 3rd, 2003 21 Internet Low Bit Rate Codec March 2003+} cbfiltersTbl[CB_FILTERLEN]={-0.033691, 0.083740, -0.144043, 0.713379, 0.806152, -0.184326, 0.108887, -0.034180}; Before filtering, the codebook buffer is padded with three zeros in the beginning to compensate for the filter delay. Also the buffer is padded with four zeros in the end to achieve a filtered output of the same size as the input. The individual codebook vectors of the new filtered codebook and their indices are obtained in the same fashion as described above for the base codebook. [-Andersen et. al. Experimental - Expires March 20th, 2003 20 Internet Low Bit Rate Codec September 2002-] 3.6.3.3 Codebook Augmentation For the cases when encoding entire sub-blocks, i.e. cbveclen=40, the base and expanded codebooks are augmented to increase codebook richness. The codebooks are augmented by vectors produced by interpolation of segments. The base and expanded codebook, constructed above, consists of vectors corresponding to sample delays in the range from cbveclen to lMem. The codebook augmentation attempts to augment these codebooks with vectors corresponding to sample delays from 20 to 39. However, not all of these samples are present in the base codebook and expanded codebook rspectively. Therefore, the augmentation vectors are constructed as linear combinations between samples corresponding to sample delays in the range 20 to 39. The general idea of this procedure is presented in the following figures and text. The procedure is performed for both the base codebook and the expanded codebook. - - ------------------------| codebook memory | - - ------------------------| |-5-|---15---|-5-| pi pp po | | Codebook vector |---15---|-5-|-----20-----| <- corresponding to i ii iii sample delay 20 Figure 1 The figure shows the codebook memory with pointers pi, pp and po where pi points to sample 25, pp to sample 20 and po to sample 5. Below the codebook memory, the augmented codebook vector corresponding to sample delay 20 is drawn. Segment i consists of 15 samples from pointer pp and forward in time. Segment ii consists of 5 interpolated samples from pi and forward and from po and forward. The samples are linearly interpolated with weights [0.0, 0.2, 0.4, 0.6, 0.8] for pi and weights [1.0, 0.8, 0.6, 0.4, 0.2] for po. Segment iii consists of 20 samples from pp and forward. The augmented codebook vector corresponding to sample delay 21 is produced by moving pointers pp and pi one sample backwards in time. That gives us the following figure. {+Andersen et. al. Experimental - Expires September 3rd, 2003 22 Internet Low Bit Rate Codec March 2003+} - - ------------------------| codebook memory | - - ------------------------| |-5-|---16---|-5-| pi pp po | | Codebook vector |---16---|-5-|-----19-----| <- corresponding to i ii iii sample delay 21 Figure 3.10. The figure shows the codebook memory with pointers pi, pp and po where pi points to sample 26, pp to sample 21 and po to sample 5. Below the codebook memory, the augmented codebook vector [-Andersen et. al. Experimental - Expires March 20th, 2003 21 Internet Low Bit Rate Codec September 2002-] corresponding to sample delay 21 is drawn. Segment i does now consist of 16 samples from pp and forward. Segment ii consists of 5 interpolated samples from pi and forward and po and forward and the interpolation weights are the same throughout the procedure. Segment iii consists of 19 samples from pp and forward. The same procedure of moving the two pointers is continued until the last augmented vector corresponding to sample delay 39 has been created. This gives a total of 20 new codebook vectors to each of the two sections. Thus the total number of codebook vectors for each of the two sections, when including the augmented codebook becomes lMem-SUBL+1+SUBL/2. This is provided that augmentation is evoked, i.e., that lTarget=SUBL. 3.6.4 Codebook Search The codebook search uses the codebooks described in the sections above to find the best match of the perceptually weighted target, see section 3.6.2. The search method is a multi-stage gain-shape matching performed as follows. At each stage the best shape vector is identified, then the gain is calculated and quantized, and finally the target is updated in preparation for the next codebook search stage. The number of stages is CB_NSTAGES=3. If the target is the [-22-] {+23/22+} sample vector the codebooks are searched in the order: base codebook followed by the expanded codebook. If the target is 40 samples the order is: base codebook, augmented base codebook, expanded codebook and finally augmented expanded codebook. The size of each codebook section and its corresponding augmented section is given by table 1 in section 3.5.3. For example when coding the second 40 sample sub-block indices 0-107 corresponds to the base codebook, 108-127 corresponds to the augmented base codebook, 128-235 corresponds to the expanded codebook and finally indices 236-255 corresponds to the augmented expanded codebook. The indices are divided in the same fashion for all stages in the example. Only in the case of coding the first 40 sample sub-block is there a difference between stages. {+Andersen et. al. Experimental - Expires September 3rd, 2003 23 Internet Low Bit Rate Codec March 2003+} 3.6.4.1 The Codebook Search at Each Stage The codebooks are searched to find the best match to the target at each stage. When the best match is found the target is updated and the next-stage search is started. The three chosen codebook vectors and their corresponding gain constitutes the encoded sub-block. The best match is decided by the following three criterions: 1. Compute the measure (target*cbvec)^2 / ||cbvec||^2 for all codebook vectors, cbvec, and choose the codebook vector maximizing the measure. The expression (target*cbvec) is the dot [-Andersen et. al. Experimental - Expires March 20th, 2003 22 Internet Low Bit Rate Codec September 2002-] product between the target vector to be coded and the codebook vector for which we compute the measure. 2. The absolute value of the gain, gain, corresponding to the chosen codebook vector, cbvec, must be smaller than a fixed limit, CB_MAXGAIN=1.3: |gain| < CB_MAXGAIN where the gain is computed in the following way: gain = (target*cbvec) / ||cbvec||^2 3. For the first stage the dot product of the chosen codebook vector and target must be positive: target*cbvec > 0 In practice the above criterions are used in a sequential search through all codebook vectors. The best match is found by registrering a new max measure and index whenever the previously registered max measure is surpassed and all other criterions are fulfilled. 3.6.3.2 The Gain Quantization at Each Stage The gain follows as a result of the computation: gain = (target*cbvec)^2 / ||cbvec||^2 for the optimal codebook vector that was found by the procedure from section 3.6.3.1. The three stages quantize the gain using 5, 4 and 3 bits respectively. In the first stage, the gain is limited to positive values. This gain is quantized by finding the nearest value in the quantization table gain_sq5Tbl. {+Andersen et. al. Experimental - Expires September 3rd, 2003 24 Internet Low Bit Rate Codec March 2003+} gain_sq5Tbl[32]={0.037476, 0.075012, 0.112488, 0.150024, 0.187500, 0.224976, 0.262512, 0.299988, 0.337524, 0.375000, 0.412476, 0.450012, 0.487488, 0.525024, 0.562500, 0.599976, 0.637512, 0.674988, 0.712524, 0.750000, 0.787476, 0.825012, 0.862488, 0.900024, 0.937500, 0.974976, 1.012512, 1.049988, 1.087524, 1.125000, 1.162476, 1.200012} The gains of the subsequent two stages can be either positive or negative. The gains are quantized using a quantization table times a scale factor. The second stage uses the table gain_sq4Tbl and the third stage use gain_sq3Tbl. The scale factor equates 0.1 or the absolute value of the quantized gain representation value obtained in the previous stage, whichever is the larger. Again, the resulting [-Andersen et. al. Experimental - Expires March 20th, 2003 23 Internet Low Bit Rate Codec September 2002-] gain index is the index to the nearest value of the quantization table times the scale factor. gainQ = scaleFact * gain_sqXTbl[index] gain_sq4Tbl[16]={-1.049988, -0.900024, -0.750000, -0.599976, -0.450012, -0.299988, -0.150024, 0.000000, 0.150024, 0.299988, 0.450012, 0.599976, 0.750000, 0.900024, 1.049988, 1.200012} gain_sq3Tbl[8]={-1.000000, -0.659973, -0.330017,0.000000, 0.250000, 0.500000, 0.750000, 1.00000} 3.6.3.3 Preparation of Target for Next Stage Before performing the search for the next stage the target vector is updated by subtracting from it the selected codebook vector (from the perceptually weighted codebook) times the corresponding quantized gain. target[i] = target[i] - gainQ * selected_vec[i]; A reference implementation of the codebook encoding is found in Appendix A.34. 3.7 Gain Correction Encoding The start state is quantized in a relatively model independent manner using 3 bits per sample. In contrast to this, the remaining parts of the block is encoded using an adaptive codebook. This codebook will produce high matching accuracy whenever there is a high correlation between the target and the best codebook vector. For unvoiced speech segments and background noises, this is not necessarily so, which, due to the nature of the squared error criterion, results in a coded signal with less power than the target signal. As the coded start state has good power mathing to the target, the result is power fluctuation within the encoded frame. Perceptually, the main problem with this is that the time envelope of the signal energy becomes unsteady. To overcome this problem, the {+Andersen et. al. Experimental - Expires September 3rd, 2003 25 Internet Low Bit Rate Codec March 2003+} gains for the codebooks are re-scaled after the codebook encoding by searching for a new gain factor for the first stage codebook that provides better power matching. First the energy for the target signal, tene, is computed along with the energy for the coded signal, cene, given by the addition of the the 3 gain scaled codebook vectors. Since the gains of the 2nd and 3rd stage scale with the gain of the first stage, by changing the first stage gain from gains[0] to gain_sq5Tbl[i], the energy of the coded signal changes from cene to cene*(gain_sq5Tbl[i]*gain_sq5Tbl[i])/(gain[0]*gains[0]) [-Andersen et. al. Experimental - Expires March 20th, 2003 24 Internet Low Bit Rate Codec September 2002-] where gains[0] is the gain for the first stage found in the original codebook search. A refined search is performed by testing the gain indices i=0 to 31, and as long as the new codebook energy as given above is less than tene, the gain index for stage 1 is increased. A restriction is applied so that the new gain value for stage 1 cannot be more than 2 times higher than the original value found in the codebook search. Note that by using this method the shape of the encoded vector is not changed, only the gain or amplitude. 3.8 Bitstream Definition The total number of bits used to describe one [-block-] {+frame of 20 ms speech is 303, which fits in 38 bytes and results in a bit rate of 15.20 kbit/s. For the case with a frame length+} of 30 ms speech {+the total number of bits used+} is 399, which fits in 50 bytes and results in a bit rate of 13.33 kbit/s. In the bitstream definition the bits are distributed into three classes according to their bit error or loss sensitivity. The most sensitive bits (class 1) is placed first in the bitstream for each frame. The less sensitive bits (class 2) is placed after the class 1 bits. The least sensitive bits (class 3) are placed at the end of the bitstream for each frame. {+Looking at the 20/30 ms frame length casees for each class:+} The class 1 bits occupy a total of [-8-] {+6/8+} bytes [-(64-] {+(48/64+} bits), the class 2 bits occupy [-12-] {+8/12+} bytes [-(96-] {+(64/96+} bits), and the class 3 bits occupy [-30-] {+24/30+} bytes [-(239-] {+(191/239+} bits). This distribution of the bits enable the use of uneven level protection (ULP) as is exploited in the payload format definition for iLBC [1]. The detailed bit allocation is shown in the table below. When a quantization index is distributed between more classes the more significant bits belong to the lowest class. {+Andersen et. al. Experimental - Expires September 3rd, 2003 26 Internet Low Bit Rate Codec March 2003+} Bitstream structure: {+------------------------------------------------------------------++} Parameter {+|+} Bits Class [-1,2,3 --------------------------------------------------------------------] {+<1,2,3> | | 20 ms frame | 30 ms frame | ----------------------------------+---------------+---------------++} Split 1 {+| 6 <6,0,0> |+} 6 [-6,0,0-] {+<6,0,0> |+} LSF 1 Split 2 {+| 7 <7,0,0> |+} 7 [-7,0,0-] {+<7,0,0> |+} LSF Split 3 {+| 7 <7,0,0> |+} 7 [-7,0,0 -----------------------------------------------------] {+<7,0,0> | ------------------+---------------+---------------++} Split 1 {+| NA (Not Appl.)|+} 6 [-6,0,0-] {+<6,0,0> |+} LSF 2 Split 2 {+| NA |+} 7 [-7,0,0-] {+<7,0,0> |+} Split 3 {+| NA |+} 7 [-7,0,0 -----------------------------------------------------] {+<7,0,0> | ------------------+---------------+---------------++} Sum {+| 20 <20,0,0> |+} 40 [-20,0,0 --------------------------------------------------------------------] {+<40,0,0> | ----------------------------------+---------------+---------------++} Block Class. {+| 2 <2,0,0> |+} 3 [-3,0,0 --------------------------------------------------------------------] {+<3,0,0> | ----------------------------------+---------------+---------------++} Position 22 sample segment {+|+} 1 [-1,0,0 --------------------------------------------------------------------] {+<1,0,0> | 1 <1,0,0> | ----------------------------------+---------------+---------------++} Scale Factor State Coder {+|+} 6 [-6,0,0 ------------------------------------------------------------------- Andersen et. al. Experimental - Expires March 20th, 2003 25 Internet Low Bit Rate Codec September 2002-] {+<6,0,0> | 6 <6,0,0> | ----------------------------------+---------------+---------------++} Sample 0 {+| 3 <0,1,2> |+} 3 [-0,1,2-] {+<0,1,2> |+} Quantized Sample 1 {+| 3 <0,1,2> |+} 3 [-0,1,2-] {+<0,1,2> |+} Residual : {+| : : |+} : : {+|+} State : {+| : : |+} : : {+|+} Samples : {+| : : |+} : : {+|+} Sample 56 {+|+} 3 [-0,1,2-] {+<0,1,2> | 3 <0,1,2> |+} Sample 57 {+| NA |+} 3 [-0,1,2 -----------------------------------------------------] {+<0,1,2> | ------------------+---------------+---------------++} Sum {+| 171 <0,57,114>|+} 174 [-0,58,116 --------------------------------------------------------------------] {+<0,58,116>| ----------------------------------+---------------+---------------++} Stage 1 {+| 7 <6,0,1> |+} 7 [-4,2,1-] {+<4,2,1> |+} CB for [-22 samples in start state-] {+22/23+} Stage 2 {+| 7 <0,0,7> |+} 7 [-0,0,7-] {+<0,0,7> | sample block+} Stage 3 {+|+} 7 [-0,0,7 -----------------------------------------------------] {+<0,0,7> | 7 <0,0,7> | ------------------+---------------+---------------++} Sum {+|+} 21 [-4,2,15 --------------------------------------------------------------------] {+<6,0,15> | 21 <4,2,15> | ----------------------------------+---------------+---------------++} Stage 1 {+| 5 <2,0,3> |+} 5 [-1,1,3-] {+<1,1,3> |+} Gain for [-22 samples in start state-] {+22/23+} Stage 2 {+| 4 <1,1,2> |+} 4 [-1,1,2-] {+<1,1,2> | sample block+} Stage 3 {+| 3 <0,0,3> |+} 3 [-0,0,3 -----------------------------------------------------] {+<0,0,3> | ------------------+---------------+---------------++} Sum {+|+} 12 [-2,2,8 --------------------------------------------------------------------] {+<3,1,8> | 12 <2,2,8> | Andersen et. al. Experimental - Expires September 3rd, 2003 27 Internet Low Bit Rate Codec March 2003 ----------------------------------+---------------+---------------++} Stage 1 {+|+} 8 [-6,1,1 Indices-] {+<7,0,1> | 8 <6,1,1> |+} sub-block 1 Stage 2 {+| 7 <0,0,7> |+} 7 [-0,0,7-] {+<0,0,7> |+} Stage 3 {+| 7 <0,0,7> |+} 7 [-0,0,7 -----------------------------------------------------] {+<0,0,7> | ------------------+---------------+---------------++} Stage 1 {+|+} 8 [-0,7,1 Indices-] {+<0,0,8> | 8 <0,7,1> |+} sub-block 2 Stage 2 {+| 8 <0,0,8> |+} 8 [-0,0,8-] {+<0,0,8> | Indices+} Stage 3 {+|+} 8 [-0,0,8-] {+<0,0,8> | 8 <0,0,8> | for+} CB {+------------------+---------------+---------------++} sub-blocks [------------------------------------------------------] Stage 1 {+| NA |+} 8 [-0,7,1 Indices-] {+<0,7,1> |+} sub-block 3 Stage 2 {+| NA |+} 8 [-0,0,8-] {+<0,0,8> |+} Stage 3 {+| NA |+} 8 [-0,0,8 -----------------------------------------------------] {+<0,0,8> | ------------------+---------------+---------------++} Stage 1 {+| NA |+} 8 [-0,7,1 Indices-] {+<0,7,1> |+} sub-block 4 Stage 2 {+| NA |+} 8 [-0,0,8-] {+<0,0,8> |+} Stage 3 {+| NA |+} 8 [-0,0,8 -----------------------------------------------------] {+<0,0,8> | ------------------+---------------+---------------++} Sum {+| 46 <7,0,39> |+} 94 [-6,22,66 --------------------------------------------------------------------] {+<6,22,66> | ----------------------------------+---------------+---------------++} Stage 1 {+|+} 5 [-1,2,2 Gains-] {+<1,2,2> | 5 <1,2,2> |+} sub-block 1 Stage 2 {+| 4 <1,1,2> |+} 4 [-1,2,1-] {+<1,2,1> |+} Stage 3 {+| 3 <0,0,3> |+} 3 [-0,0,3 -----------------------------------------------------] {+<0,0,3> | ------------------+---------------+---------------++} Stage 1 {+|+} 5 [-0,2,3 Gains-] {+<1,1,3> | 5 <0,2,3> |+} sub-block 2 Stage 2 {+|+} 4 [-0,2,2-] {+<0,2,2> | 4 <0,2,2> |+} Stage 3 {+| 3 <0,0,3> |+} 3 [-0,0,3 Gain-] {+<0,0,3> | Gains for ------------------+---------------+---------------++} sub-blocks [-----------------------------------------------------] Stage 1 {+| NA |+} 5 [-0,1,4 Gains-] {+<0,1,4> |+} sub-block 3 Stage 2 {+| NA |+} 4 [-0,1,3-] {+<0,1,3> |+} Stage 3 {+| NA |+} 3 [-0,0,3 ---------------------------------------------------- Andersen et. al. Experimental - Expires March 20th, 2003 26 Internet Low Bit Rate Codec September 2002-] {+<0,0,3> | ------------------+---------------+---------------++} Stage 1 {+| NA |+} 5 [-0,1,4 Gains-] {+<0,1,4> |+} sub-block 4 Stage 2 {+| NA |+} 4 [-0,1,3-] {+<0,1,3> |+} Stage 3 {+| NA |+} 3 [-0,0,3 -----------------------------------------------------] {+<0,0,3> | ------------------+---------------+---------------++} Sum {+| 24 <3,6,15> |+} 48 [-2,12,34-] {+<2,12,34> |+} ------------------------------------------------------------------- SUM {+303 <48,64,191>+} 399 [-64,96,239-] {+<64,96,239>+} Table 3.2. The bitstream definition for [-iLBC.-] {+iLBC for both the 20 ms frame size mode and the 30 ms frame size mode.+} When packetized into the payload the bits MUST be sorted as: All the class 1 bits in the order (from top and down) as they were specified in the table, all the class 2 bits (from top and down) and finally all the class 3 bits in the same sequential order. The last unused bit of the payload SHOULD be set to zero. {+Andersen et. al. Experimental - Expires September 3rd, 2003 28 Internet Low Bit Rate Codec March 2003+} 4. DECODER PRINCIPLES This section describes the principles of each component of the decoder algorithm. +-------------+ +--------+ +---------------+ payload -> | 1. Get para | -> | 2. LPC | -> | 3. Sc Dequant | -> +-------------+ +--------+ +---------------+ +-------------+ +------------------+ -> | 4 Mem setup | -> | 5. Construct res |-------> | +-------------+ +------------------- | ---------<-----------<-----------<------------ Sub frame [-0...4-] {+0...2/4 (20ms/30ms)+} +----------------+ +----------+ -> | 6. Enhance res | -> | 7. Synth | ------------> +----------------+ +----------+ +-----------------+ -> | 8. Post Process | ----------------> decoded speech +-----------------+ Figure 4.1. Flow chart of the iLBC decoder. If a frame was lost steps 1 to 5 SHOULD be replaced by a PLC algorithm. 1. Extract the parameters from the bitstream 2. Decode the LPC and interpolate (section 4.1) 3. Construct the [-58-] {+57/58+} sample start state (section 4.2) 4. Set up the memory using data from the decoded residual. This memory is used for codebook construction. For blocks preceeding the start state, both the decoded residual and the target are time reversed. Subframes are decoded in the same order as they were encoded 5. Construct the residual of this subframe (gain[0]*cbvec[0] + gain[1]*cbvec[1] + gain[2]*cbvec[2]). Repeat 4.4 and 4.5 until the residual of all sub blocks have been constructed [-Andersen et. al. Experimental - Expires March 20th, 2003 27 Internet Low Bit Rate Codec September 2002-] 6. Enhance the residual with the post filter (section 4.6) 7. Synthesis of the residual (section 4.7) 8. Post process with HP filter if desired (section 4.8) 4.1 LPC Filter Reconstruction The decoding of the LP filter parameters is very straightforward. For a set of [-six-] {+three/six+} indices the corresponding LSF [-vectors-] {+vector(s)+} are found by simple table look up. For each of the LSF vectors the three split vectors are concatenated to obtain qlsf1 and qlsf2, [-respectively.-] {+respectively (in the 20 ms mode only one LSF vector, qlsf, is constructed).+} The next step is the stability check described in Section 3.2.5 followed by the interpolation scheme described in Section [-3.2.6.-] {+3.2.6 (3.2.7 for 20 ms frames).+} The only difference is that only the quantized LSFs are known at the decoder and hence the unquantized LSFs are not processed. {+Andersen et. al. Experimental - Expires September 3rd, 2003 29 Internet Low Bit Rate Codec March 2003+} A reference implementation of the LPC filter reconstruction is given in Appendix A.36. 4.2 Start State Reconstruction The scalar encoded STATE_SHORT_LEN=58 {+(STATE_SHORT_LEN=57 in the 20 ms mode)+} state samples are reconstructed by 1) forming a set of samples (by table look-up) from the index stream idxVec[n] 2) multiplying the set with 1/scal=(10^qmax)/4.5 3) time reversing the [-58-] {+57/58+} samples 4) filtering the time inversed block with the dispersion (all-pass) filter used in the encoder (as described in section 3.5.2). This compensates for the phase distortion of the earlier filter operation. 5) Reversing the [-58-] {+57/58+} samples from previous step [-in(0..57)-] {+in(0..(STATE_SHORT_LEN-1))+} = time reversed samples from table look-up, [-idxVecDec(57..0) in(58..115)-] {+idxVecDec((STATE_SHORT_LEN-1)..0) in(STATE_SHORT_LEN..(2*STATE_SHORT_LEN-1))+} = 0 Pk(z) = [-Èrk(z)/Èk(z),-] {+Ãrk(z)/Ãk(z),+} where ___ \ [-Èrk(z)=-] {+Ãrk(z)=+} z^(-LPC_FILTERORDER) + > [-éki*z^(i-(LPC_FILTERORDER-1))-] {+‚ki*z^(i-(LPC_FILTERORDER-1))+} /__ i=0...(LPC_FILTERORDER-1) and [-Èk(z)-] {+Ãk(z)+} is taken from the block where the start state begins in -> Pk(z) -> filtered out(k) = filtered(STATE_SHORT_LEN-1-k) + filtered(2*STATE_SHORT_LEN-1-k), k=0..(STATE_SHORT_LEN-1) The remaining [-22-] {+23/22+} samples in the state are reconstructed by the same adaptive codebook technique as described in section 4.3. The location bit determines whether these are the first or the last [-22-] {+23/22+} samples of the 80 sample state vector. If the remaining [-22-] {+23/22+} samples are the first samples of the state vector, then the scalar [-Andersen et. al. Experimental - Expires March 20th, 2003 28 Internet Low Bit Rate Codec September 2002-] encoded STATE_SHORT_LEN state samples are time-reversed before initialization of the adaptive codebook memory vector. A reference implementation of the start state reconstruction is given in Appendix A.44. 4.3 Excitation Decoding Loop The decoding of the LPC excitation vector proceeds in the same order in which the residual was encoded at the encoder. That is, after the decoding of the entire 80 sample state vector, the forward subblocks {+Andersen et. al. Experimental - Expires September 3rd, 2003 30 Internet Low Bit Rate Codec March 2003+} (corresponding to samples occurring after the state vector samples) are decoded, and then the backward subblocks (corresponding to samples occurring before the state vector) are decoded, resulting in a fully decoded block of excitation signal samples. In particular, each subblock is decoded using the multistage adaptive codebook decoding module which is described in section 4.4. This module relies upon an adaptive codebook memory that is constructed before each run of the adaptive codebook decoding. The construction of the adaptive codebook memory in the decoder is identical to the method outlined in section 3.6.3, except that it is done on the codebook memory without perceptual weighting. For the initial forward subblock, the last STATE_LEN=80 samples of the length CB_LMEM=147 adaptive codebook memory are filled with the samples of the state vector. For subsequent forward subblocks, the first SUBL=40 samples of the adaptive codebook memory are discarded, the remaining samples are shifted by SUBL samples towards the beginning of the vector, while the newly decoded SUBL=40 samples are placed at the end of the adaptive codebook memory. For backward subblocks, the construction is similar except that every vector of samples involved is first time-reversed. A reference implementation of the excitation decoding loop is found in Appendix A.5. 4.4 Multistage Adaptive Codebook Decoding The Multistage Adaptive Codebook Decoding module is used at both the sender (encoder) and the receiver (decoder) ends to produce a synthetic signal in the residual domain that is eventually used to produce synthetic speech. The module takes the index values used to construct vectors that are scaled and summed together to produce a synthetic signal that is the output of the module. 4.4.1 Construction of the Decoded Excitation Signal The unpacked index values provided at the input to the module are references to extended codebooks, which are constructed as described in Section 3.6.3 with the only difference that it is based on the codebook memory without the perceptual weighting. The unpacked 3 indices are used to look up 3 codebook vectors. The unpacked 3 gain [-Andersen et. al. Experimental - Expires March 20th, 2003 29 Internet Low Bit Rate Codec September 2002-] indices are used to decode the corresponding 3 gains. In this decoding the successive rescaling as described in Section 3.6.3.2 is applied. A reference implementation of the adaptive codebook decoding is listed in Appendix A.32. {+Andersen et. al. Experimental - Expires September 3rd, 2003 31 Internet Low Bit Rate Codec March 2003+} 4.5 Packet Loss Concealment If packet loss occurs, the decoder receives a signal saying that information regarding a block is lost. For such blocks it is RECOMMENDED to use a Packet Loss Concealment (PLC) unit to create a decoded signal which mask the effect of that packet loss. In the following we will describe an example of a PLC unit that can be used with the iLBC codec. As the PLC unit is used only at the decoder, the PLC unit does not affect interoperability between implementations. Other PLC implementations MAY therefore be used. The example PLC described operates on the LP filters and the excitation signals and is based on the following principles: 4.5.1 Block Received Correctly and Previous Block also Received If the block is received correctly, the PLC only records state information of the current block that can be used in case the next block is lost. The LP filter coefficients for each subblock and the entire decoded excitation signal are all saved in the decoder state structure. All this information will be needed if the following block is lost. 4.5.2 Block Not Received If the block is not received, the block substitution is based on doing a pitch synchronous repetition of the excitation signal which is filtered by modified versions of the previous block's LP filters. The previous block's information is stored in the decoder state structure. First, the previous block's LP filters are bandwidth expanded (the effect of which is to pull the roots away from the unit circle to mute the resonance of the filters) to produce the LP filters that are used in the synthesis of the substituted block. A correlation analysis is performed on the previous block's excitation signal in order to detect the amount of pitch periodicity and a pitch value. The correlation measure is also used to decide on the voicing level (the degree to which the previous block's excitation was a voiced or roughly periodic signal). The excitation in the previous block is used to create an excitation for the block to be substituted such that the pitch of the previous block is maintained. Therefore, the new excitation is constructed in a pitch synchronous manner. In order to avoid a buzzy sounding substituted block, a random excitation is mixed with the new pitch periodic [-Andersen et. al. Experimental - Expires March 20th, 2003 30 Internet Low Bit Rate Codec September 2002-] excitation and the relative use of the two components is computed from the correlation measure (voicing level). For the block to be substituted, the newly constructed excitation signal is then passed through the newly constructed LP filters to produce the speech that will be substituted for the lost block. {+Andersen et. al. Experimental - Expires September 3rd, 2003 32 Internet Low Bit Rate Codec March 2003+} For several consecutive lost blocks, the packet loss concealment continues in a similar manner. The correlation measure of the last received block is still used along with the same pitch value. The LP filters of the last received block are also used again, but the bandwidth expansion is increased for consecutive lost blocks (as the length in time from the last received block increases). This increases the muting of the resonance of the spectral envelope. The energy of the substituted excitation for consecutive lost blocks is decreased, leading to a dampened excitation, and therefore dampened speech. 4.5.3 Block Received Correctly When Previous Block Not Received For the case in which a block is received correctly when the previous block was not received, the correctly received block's directly decoded speech (based solely on the received block) is not used as the actual output. The reason for this is that the directly decoded speech does not necessarily smoothly merge into the synthetic speech generated for the previous lost block. If the two signals are not smoothly merged, an audible discontinuity is accidentally produced. Therefore, a correlation analysis between the two blocks of excitation signal (the excitation of the previous concealed block and the excitation of the current received block) is performed to find the best phase match. Then a simple overlap-add procedure is performed to smoothly merge the previous excitation into the current block's excitation. The exact implementation of the packet loss concealment does not influence interoperability of the codec. A reference implementation of the packet loss concealment is suggested in Appendix A.14. Exact compliance with this suggested algorithm is not needed for a reference implementation to be fully compatible with the overall codec specification. 4.6 Enhancement The decoder contains an enhancement unit that operates on the reconstructed excitation signal. The enhancement unit increases the perceptual quality of the reconstructed signal by reducing the speech-correlated noise in the voiced speech segments. Compared to tratidional postfilters, the enhancer has the advantage that it can only modify the excitation signal slightly. This means that there is no risk of over enhancement. {+The enhancer works very similar for both the 20 ms frame size mode and for the 30 ms frame size mode. For the mode with 20 ms frame size, the enhancer uses a memory of six 80 sample excitation blocks prior in time plus the two new 80 sample excitation blocks. For each block of 160 new unenhanced excitation samples, 160 enhanced excitation samples is produced. The enhanced excitation is 40 sample delayed compared to the unenhanced excitation since the enhancer algorithm uses lookahead.+} Andersen et. al. Experimental - Expires [-March 20th,-] {+September 3rd,+} 2003 [-31-] {+33+} Internet Low Bit Rate Codec [-September 2002 The-] {+March 2003 For the mode with 30 ms frame size, the+} enhancer uses a memory of five 80 sample excitation blocks prior in time plus the three new 80 sample excitation blocks. For each block of 240 new unenhanced excitation samples, 240 enhanced excitation samples is produced. The enhanced excitation is 80 sample delayed compared to the unenhanced excitation since the enhancer algorithm uses lookahead. OUTLINE of Enhancer The speech enhancement unit operates on sub blocks of 80 samples, which means that there are [-3-] {+2/3 80 sample+} sub blocks per frame. Each of [-the three-] {+these two/three+} sub blocks are enhanced separately, but in an analogous manner. unenhanced residual | | +---------------+ +--------------+ +-> | 1. Pitch Est | -> | 2. Find PSSQ | --------> +---------------+ | +--------------+ +-----<-------<------<--+ +------------+ enh block [-0..2-] {+0..1/2+} | -> | 3. Smooth | | +------------+ | \ | /\ | / \ Already | / 4. \----------->----------->-----------+ | \Crit/ Fulfilled | | \? / v | \/ | | \ +-----------------+ +---------+ | | Not +->| 5. Use Constr. | -> | 6. Mix | -----> Fulfilled +-----------------+ +---------+ ---------------> enhanced residual Figure 4.2. Flow chart of the enhancer 1. Pitch estimation of each of the [-three-] {+two/three+} new 80 sample blocks 2. Find the pitch-period-synchronous sequence n (for block k) by a search around the estimated pitch value. Do this for [-n=1,2,3,-1,-2,- 3-] {+n=1,2,3,-1,-2,-3+} 3. Calculate the smoothed residual generated by the 6 pitch-period- synchronous sequence from prior step 4. Check if the smoothed residual satisfies the criterion (section 4.6.4) 5. Use constraint to calculate mixing factor (section 4.6.5) 6. Mix smoothed signal with unenhanced residual (pssq(n) n=0) The main idea of the enhancer is to find three 80 sample blocks before and three 80 sample blocks after the analyzed unenhanced sub block and use these to improve the quality of the exitation in that sub block. The six blocks are chosen so that they have the highest [-possible correlation with the unenhanced sub block that is beeing-] Andersen et. al. Experimental - Expires [-March 20th,-] {+September 3rd,+} 2003 [-32-] {+34+} Internet Low Bit Rate Codec [-September 2002-] {+March 2003 possible correlation with the unenhanced sub block that is beeing+} enhanced. In other words the 6 blocks are pitch-period-synchronous sequences to the unenhanced sub block. A linear combination of the six pitch-period-synchronous sequences is calculated that approximates the sub block. If the squared error between the approximation and the unenhanced sub block is small enough, the enhanced residual is set equal to this approximation. For the cases when the squared error criteria is not fulfilled, a linear combination of the approximation and the unenhanced residual forms the enhanced residual. 4.6.1 Estimating the pitch In order to determine the locations of the pitch-period-synchronous sequences in a complexity efficient way, pitch estimates are needed. For each of the new 3 sub blocks a pitch estimate is calculated by finding the maximum correlation in the range from lag 20 to lag 120. These pitch estimates are used to narrow down the search for the best possible pitch-period-synchronous sequences. 4.6.2 Determination of the Pitch-Synchronous Sequences Upon receiving the pitch estimates from the prior step, the enhancer analyzes and enhances one 80 sample sub block at a time. The pitch- period-synchronous-sequences pssq(n) can be viewed as vectors of length 80 samples each shifted n*lag samples from the current sub block. The six pitch-period-synchronous-sequences, pssq(-3) to pssq(-1) and pssq(1) to pssq(3), are found one at a time by the steps below: 1) Calculate the estimate of the position of the pssq(n). For pssq(n) in front of pssq(0) (n > 0), the location of the pssq(n) is estimated by moving one pitch estimate forward in time from the exact location of pssq(n-1). Similarly for pssq(n) behind pssq(0) (n < 0) is estimated by moving one pitch estimate backward in time from the exact location of pssq(n+1). If the estimated pssq(n) vector location is totally within the enhancer memory (figure 4.3) step 2,3, and 4 are performed, otherwise the pssq(n) is set to zeros. 2) Compute the correlation between the unenhanced excitation and vectors around the estimated location interval of pssq(n). The correlation is calculated in the interval estimated location +/- 2 samples. This results in 5 correlation values. 3) The 5 correlation values are upsampled by a factor 4, using sinc upsampling filters (four MA filters with coefficients upsFilter1 .. upsFilter4). Within these the maximum value is found, which specifies the best pitch-period with a resolution of a quarter of a sample. Andersen et. al. Experimental - Expires [-March 20th,-] {+September 3rd,+} 2003 [-33-] {+35+} Internet Low Bit Rate Codec [-September 2002-] {+March 2003+} upsFilter1[7]={0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000} upsFilter2[7]={0.015625 -0.076904 0.288330 0.862061 -0.106445 0.018799 -0.015625} upsFilter3[7]={0.023682 -0.124268 0.601563 0.601563 -0.124268 0.023682 -0.023682} upsFilter4[7]={0.018799 -0.106445 0.862061 0.288330 -0.076904 0.015625 -0.018799} 4) Generate the pssq(n) vector by upsampling of the excitation memory and extracting the sequence that corresponds to the lag delay that was calculated in prior step. With the steps above all the pssq(n) can be found in an iterative manner, first moving backward in time from pssq(0) and then forward in time from pssq(0). 0 159 319 479 639 [-+-------------------------------------------------------+-] {++---------------------------------------------------------------+ | -5+} | -4 | -3 | -2 | -1 | 0 | 1 | 2 | [-3-] {++---------------------------------------------------------------+ |pssq 0 | |pssq -1| |pssq 1+} | [-+-------------------------------------------------------+-] |pssq [-0| |pssq-1|-] {+-2|+} |pssq [-1| |pssq-2|-] {+2 |+} |pssq [-2| |pssq-3|-] {+-3|+} |pssq [-3|-] {+3 |+} Figure 4.3. Pitch-period-synchronous sequences in the enhancement of {+the first 80 sample+} block [-1.-] {+in the 20 ms frame size mode.+} The unenhanced signal input is stored in {+the two last sub-blocks (1-2), and the six other sub-blocks contain unenhanced residual prior-in- time. We perform the enhancement algorithm on two blocks of 80 samples, where the first of the two blocks consist of the last 40 samples of sub-block 0 and the first 40 samples of sub-block 1. The second 80 sample+} block [-1, 2, 3.-] {+consist of the last 40 samples of sub-block 1 and the first 40 samples of sub-block 2. 0 159 319 479 639 +---------------------------------------------------------------+ | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | +---------------------------------------------------------------+ |pssq 0 | |pssq -1| |pssq 1 | |pssq -2| |pssq 2 | |pssq -3| |pssq 3 | Figure 4.4. Pitch-period-synchronous sequences in the enhancement of the first 80 sample block in the 30 ms frame size mode. The unenhanced signal input is stored in the three last sub-blocks (1- 3).+} The five other [-blocks-] {+sub-blocks+} contain unenhanced residual [-prior-in-time.-] {+prior-in- time.+} The enhancement algorithm is performed [-on-] the three 80 sample [-blocks-] {+sub-blocks+} 0, 1 and 2. {+Andersen et. al. Experimental - Expires September 3rd, 2003 36 Internet Low Bit Rate Codec March 2003+} 4.6.3 Calculation of the smoothed excitation A linear combination of the six pssq(n) (n!=0) form a smoothed approximation, z, of pssq(0). Most of the weight is put on the sequences that are close to pssq(0) since these are most likely to be most similar to pssq(0). The smoothed vector is also rescaled, so that the energy of z is the same as the energy of pssq(0). ___ \ y = > pssq(i) * pssq_weight(i) /__ i=-3,-2,-1,1,2,3 pssq_weight(i) = 0.5*(1-cos(2*pi*(i+4)/(2*3+2))) z = C * y, where C = sqrt(||pssq(0)||/||y||) [-Andersen et. al. Experimental - Expires March 20th, 2003 34 Internet Low Bit Rate Codec September 2002-] 4.6.4 Enhancer criterion The criterion of the enhancer is that the enhanced excitation is not allowed to differ much from the unenhanced excitation. This criterion is checked for each 80 sample sub block. e < (b * ||pssq(0)||), where b=0.05 and (Constraint 1) e = (pssq(0)-z)*(pssq(0)-z), and "*" means the dot product 4.6.5 Enhancing the excitation From the criterion in the previous section it is clear that the excitation is not allowed to change much. The purpose of this constraint is to prevent the creation of an enhanced signal that is significantly different from the original signal. This also means that the constraint limits the numerical size of the errors that the enhancement procedure can make. That is especially important in unvoiced segments and background noise segments where increased periodicity could lead to lower perceived quality. When the constraint in the prior section is not met, the enhanced residual is instead calculated through a constrained optimization using the Lagrange multiplier technique. The new constraint is that: e = (b * ||pssq(0)||) (Constraint 2) We distinguish two solution regions for the optimization: 1) the region where the first constraint is fulfilled and 2) the region where the first constraint is not fulfilled so the second constraint must be used. In the first case, where the second constraint is not needed, the optimized re-estimated vector is simply z, the energy scaled version of y. {+Andersen et. al. Experimental - Expires September 3rd, 2003 37 Internet Low Bit Rate Codec March 2003+} In the second case, where the second constraint is activated and becomes an equality constraint, we have that z= A*y + B*pssq(0) where A = sqrt((b-b^2/4)*(w00*w00)/ (w11*w00 + w10*w10)) and w11 = pssq(0)*pssq(0) w00 = y*y w01 = y*pssq(0) (* symbolizes the dot product) and B = 1 - b/2 - A * w10/w00 [-Andersen et. al. Experimental - Expires March 20th, 2003 35 Internet Low Bit Rate Codec September 2002-] Appendix A.16 contains a listing of a reference implementation for the enhancement method. 4.7 Synthesis Filtering Upon decoding or PLC of the LP excitation block, the decoded speech block is obtained by running the decoded LP synthesis filter, [-1/Èk(z),-] {+1/Ãk(z),+} over the block. The synthesis filters have to be shifted [-two sub blocks-] to compensate for the delay in the [-enhancer-] {+enhancer. For 20 ms frame size mode they SHOULD be shifted one 40 sample sub block+} and [-the-] {+for 30 ms frame size mode they SHOULD be shifted two 40 sample sub blocks. The+} LP coefficients [-should-] {+SHOULD+} be changed at the first sample of every sub block while keeping the filter state. For PLC blocks, one solution is to apply the last LP coefficients of the last decoded speech block for all sub blocks. The reference implementation for the synthesis filtering can be found in Appendix A.48. 4.8 Post Filtering If desired the decoded block can be filtered by a high-pass filter. This removes the low frequencies of the decoded signal. A reference implementation of this, with cut off at 65 Hz, is shown in Appendix A.30. 5. SECURITY CONSIDERATIONS This algorithm for the coding of speech signals is not subject of any known security consideration; however, its RTP payload format [1] is subject of several considerations which are addressed there. {+Andersen et. al. Experimental - Expires September 3rd, 2003 38 Internet Low Bit Rate Codec March 2003+} 6. REFERENCES [1] A. Duric and S. V. Andersen, "RTP Payload Format for iLBC Speech", [-draft-duric-rtp-ilbc-01.txt, July 2002.-] {+draft-avt-rtp-ilbc-01.txt, March 2003.+} [2] S. Bradner, "Key words for use in RFCs to Indicate requirement Levels", BCP 14, RFC 2119, March 1997. [3] ITU-T Recommendation G.711, available online from the ITU bookstore at http://www.itu.int. 7. ACKNOWLEDGEMENTS The authors wish to thank Henry Sinnreich and Patrik Faltstrom for great support of the iLBC initiative and for the valuable feedback and comments. Andersen et. al. Experimental - Expires [-March 20th,-] {+September 3rd,+} 2003 [-36-] {+39+} Internet Low Bit Rate Codec [-September 2002-] {+March 2003+} 8. AUTHOR'S ADDRESSES {+Soren Andersen+} Department of Communication Technology Aalborg University Fredrik Bajers Vej 7A 9200 Aalborg Denmark Phone: ++45 9 6358627 Email: sva@kom.auc.dk Henrik [-×str•m-] {+Êstr÷m+} Global IP Sound AB Rosenlundsgatan 54 Stockholm, S-11863 Sweden Phone: +46 8 54553040 Email: henrik.astrom@globalipsound.com Alan Duric Global IP Sound AB Rosenlundsgatan 54 Stockholm, S-11863 Sweden Phone: +46 8 54553040 Email: alan.duric@globalipsound.com Fredrik [-Galschi•dt-] {+Galschi÷dt+} Global IP Sound AB Rosenlundsgatan 54 Stockholm, S-11863 Sweden Phone: +46 8 54553040 Email: fredrik.galschiodt@globalipsound.com Roar Hagen Global IP Sound AB Rosenlundsgatan 54 Stockholm, S-11863 Sweden Phone: +46 8 54553040 Email: roar.hagen@globalipsound.com W. Bastiaan Kleijn Global IP Sound AB Rosenlundsgatan 54 Stockholm, S-11863 Sweden Phone: +46 8 54553040 Email: bastiaan.kleijn@globalipsound.com Andersen et. al. Experimental - Expires [-March 20th,-] {+September 3rd,+} 2003 [-37-] {+40+} Internet Low Bit Rate Codec [-September 2002-] {+March 2003+} Jan Linden Global IP Sound Inc. 900 Kearny Street, suite 500 San Francisco, CA-94133 USA Phone: +1 415 397 2555 Email: jan.linden@globalipsound.com Manohar N. Murthi Department of Electrical and Computer Engineering University of Miami McArthur Engineering Building 406 1251 Memorial Dr Coral Gables, FL 33146-0640 USA Phone: +1 (305) 284-3342 Email: mmurthi@miami.edu Jan Skoglund Global IP Sound Inc. 900 Kearny Street, suite 500 San Francisco, CA-94133 USA Phone: +1 415 397 2555 Email: jan.skoglund@globalipsound.com Julian Spittka Global IP Sound Inc. 900 Kearny Street, suite 500 San Francisco, CA-94133 USA Phone: +1 415 397 2555 Email: julian.spittka@globalipsound.com Andersen et. al. Experimental - Expires [-March 20th,-] {+September 3rd,+} 2003 [-38-] {+41+} Internet Low Bit Rate Codec [-September 2002-] {+March 2003+} APPENDIX A REFERENCE IMPLEMENTATION This appendix contains the complete c-code for a reference implementation of encoder and decoder for the specified codec. The c-code consists of the following files with highest level functions: iLBC_test.c: main function for evaluation purpose iLBC_encode.h: encoder header iLBC_encode.c: encoder function iLBC_decode.h: decoder header iLBC_decode.c: decoder function the following files containing global defines and constants: iLBC_define.h: global defines constants.h: global constants header constants.c: global constants memory allocations and the following files containing subroutines: anaFilter.h: lpc analysis filter header anaFilter.c: lpc analysis filter function createCB.h: codebook construction header createCB.c: codebook construction function doCPLC.h: packet loss concealment header doCPLC.c: packet loss concealment function enhancer.h: signal enhancement header enhancer.c: signal enhancement function filter.h: general filter header filter.c: general filter functions FrameClassify.h: start state classification header FrameClassify.c: start state classification function gainquant.h: gain quantization header gainquant.c: gain quantization function getCBvec.h: codebook vector construction header getCBvec.c: codebook vector construction function helpfun.h: general purpose header helpfun.c: general purpose functions hpInput.h: input high pass filter header hpInput.c: input high pass filter function hpOutput.h: output high pass filter header hpOutput.c: output high pass filter function iCBConstruct.h: excitation decoding header iCBConstruct.c: excitation decoding function iCBSearch.h: excitation encoding header iCBSearch.c: excitation encoding function LPCdecode.h: lpc decoding header LPCdecode.c: lpc decoding function LPCencode.h: lpc encoding header LPCencode.c: lpc encoding function lsf.h: line spectral frequencies header Andersen et. al. Experimental - Expires [-March 20th,-] {+September 3rd,+} 2003 [-39-] {+42+} Internet Low Bit Rate Codec [-September 2002-] {+March 2003+} lsf.c: line spectral frequencies functions packing.h: bitstream packetization header packing.c: bitstream packetization functions StateConstructW.h: state decoding header StateConstructW.c: state decoding functions StateSearchW.h: state encoding header StateSearchW.c: state encoding function syntFilter.h: lpc synthesis filter header syntFilter.c: lpc synthesis filter function The implementation is portable and should work on many different platforms. However, it is not difficult to optimize the implementation on particular platforms, an exercise left to the reader. A.1 iLBC_test.c /****************************************************************** iLBC Speech Coder ANSI-C Source Code iLBC_test.c Copyright (c) 2001, Global IP Sound AB. All rights reserved. ******************************************************************/ #include #include #include #include #include "iLBC_define.h" #include "iLBC_encode.h" #include "iLBC_decode.h" /* Runtime statistics */ #include #define [-TIME_PER_FRAME 30 #define ILBCNOOFWORDS (NO_OF_BYTES/2)-] {+ILBCNOOFWORDS_MAX (NO_OF_BYTES_30MS/2)+} /*----------------------------------------------------------------* * Encoder interface function *---------------------------------------------------------------*/ short encode( /* (o) Number of bytes encoded */ iLBC_Enc_Inst_t *iLBCenc_inst, /* (i/o) Encoder instance */ short *encoded_data, /* (o) The encoded bytes */ short *data /* (i) The signal block to encode */ ){ Andersen et. al. Experimental - Expires [-March 20th,-] {+September 3rd,+} 2003 [-40-] {+43+} Internet Low Bit Rate Codec [-September 2002-] {+March 2003+} float [-block[BLOCKL];-] {+block[BLOCKL_MAX];+} int k; /* convert signal to float */ [-for(k=0;kblockl; k++)+} block[k] = (float)data[k]; /* do the actual encoding */ iLBC_encode((unsigned char *)encoded_data, block, iLBCenc_inst); return [-(NO_OF_BYTES);-] {+(iLBCenc_inst->no_of_bytes);+} } /*----------------------------------------------------------------* * Decoder interface function *---------------------------------------------------------------*/ short decode( /* (o) Number of decoded samples */ iLBC_Dec_Inst_t *iLBCdec_inst, /* (i/o) Decoder instance */ short *decoded_data, /* (o) De