Use of cosine transform in audio compression


Dean
03-29-2002, 08:29 PM
Eqn (12.3.22) in "NR in C++" gives a form of the Discrete Cosine Transform. I believe this is used in audio compression (such as MP3). In a simple scheme, you could take the cosine transform of a sequence of N points of audio data (one "frame"), and then compress the transform (e.g., by representing the coefficients in a small number of bits). To recreate the audio, you calculate the inverse transform. For a long stretch of audio, you do the above process for a large number of successive frames. However, this processing causes a discontinuity to occur at the boundary between successive frames, generating an irritating sequence of clicks.

To handle such "framing artifacts", various techniques can be imagined. I believe there's a technique of interleaving successive frames that deals with this problem. Presumably each frame has an accompanying window function, so that when two frames are joined, the sum of the two adjoining window functions is one everywhere. Then, the signal of one frame merges smoothly into the signal of the subsequent frame with no discontinuity.

However, because the frames overlap, the same data points are recorded twice, which works against the goal: to compress the data. Therefore, I believe a technique has been worked out wherein the cosine transform of a sequence of N points results in N/2 coefficients. I'm wondering how this is done. One way I can imagine is to take the transform of every other point in a frame. The missed-out points would then be obtained from the overlapping preceeding and following frames. [Requires 50% overlapping, so that every point falls in two frames.] Is this how it's done, usually? If it is done this way, it would appear that, upon decompression, inaccuracy in the transform coefficients would cause successive points in the time domain to jump back and forth; in other words, around a frame boundary, the sequence of even points would be "continuous", and so would the sequence of odd points, but there would be a discontinuity from each point to the next. How is this handled? Or, is the method of interleaving frames completely different from the way I've outlined?

mathwiz
03-30-2002, 11:34 AM
I'm no expert, but I think it's more complicated than just windowing and framing on DCTs. DCTs are indeed used in the "filter banks", but there is a lot of "perceptual coding" on their output. I presume that the inverse (playback) is also able to shift the phases of the individual filter banks to avoid clicks, etc., since the human ear is quite insensitive to phase.

You might look at any of the following URLs:
http://www.iis.fhg.de/amm/techinf/layer3/
http://www.mpeg.org/MPEG/audio.html
http://www.mp3-tech.org/programmer/programmers.html (this has source code for encoders and decoders)

Good luck!