Thursday, 24 December 2015

Algorithm(s) to mix audio signals without clipping


I'd like to mix two or more PCM audio channels (eg recorded samples) digitally in an acoustically-faithful manner, preferably in near-real-time (meaning little or no peek-ahead).


The physically "correct" way to do this is summing the samples. However when you add two arbitrary samples, the resulting value could be up to twice the maximum value.


For example, if your samples are 16-bit values, the result will be up to 65536*2. This results in clipping.


The naive solution here is to divide by N, where N is the number of channels being mixed. However, this results in each sample being 1/Nth as loud, which is completely unrealistic. In the real world, when two instruments play simultaneously, each instrument does not become half as loud.


From reading around, a common method of mixing is: result = A + B - AB, where A and B are the two normalized samples being mixed, and AB is a term to ensure louder sounds are increasingly "soft-clipped".



However, this introduces a distortion of the signal. Is this level of distortion acceptable in high-quality audio synthesis?


What other methods are there to solve this problem? I'm interested in efficient lesser-quality algorithms as well as less-efficient high-quality algorithms.


I'm asking my question in the context of digital music synthesis, for the purpose of mixing multiple instrument tracks together. The tracks could be synthesised audio, pre-recorded samples, or real-time microphone input.



Answer



It's very hard to point you to relevant techniques without knowing any context for your problem.


The obvious answer would be to tell you to adjust the gain of each sample so that clipping rarely occurs. It is not that unrealistic to assume that musicians would play softer in an ensemble than when asked to play solo.


The distortion introduced by A + B - AB is just not acceptable. It creates mirror images of A on each side of B's harmonics - equivalent to ring-modulation - which is pretty awful if A and B have a rich spectrum with harmonics which are not at integer ratios. Try it on two square waves at 220 and 400 Hz for example.


A more "natural" clipping function which works on a sample-per-sample basis, is the tanh function - it actually matches the soft-limiting behavior of some analog elements. Beyond that, you can look into classic dynamic compression techniques - if your system can look ahead and see peaks comings in advance this is even better.


No comments:

Post a Comment

readings - Appending 内 to a company name is read ない or うち?

For example, if I say マイクロソフト内のパートナーシップは強いです, is the 内 here read as うち or ない? Answer 「内」 in the form: 「Proper Noun + 内」 is always read 「ない...