Department of Computer Science and Engineering, University of Fudan

Shanghai, P. R. China, 200433

weili_fd@yahoo.com

Department of Computer Science and Engineering, University of Fudan

Shanghai, P. R. China, 200433

xyxue@fudan.edu.cn

In this paper, we propose a novel synchronization invariant audio watermarking scheme based on the statistical feature manipulation in wavelet domain. The experimental results demonstrate that this algorithm is robust to MP3 compression, low pass filtering, equalization, echo addition, resampling, noise addition, pitch shifting, random cropping, and besides, it also shows limited resistance to time-scale modifications. The watermarked audio has very high perceptual quality and is indistinguishable from the original signal. A blind watermark detection technique without resorting to the original signal and the original watermark is developed to extract the embedded watermark image under various types of attacks. In order to ensure the security of watermark. a random chaotic sequence is employed in the process of embedding and detection.

Digital audio watermarking, Discrete wavelet transform, Statistical feature, Random cropping, Chaos

In this research, we propose a novel synchronization invariant audio watermarking scheme based on the statistical feature in wavelet domain. Special attention is paid to the synchronization attack caused by casual audio editing or malicious random cropping, which is a low-cost yet effective attack to most existing watermarking algorithms based on classical spread-spectrum technique. Although some audio watermarking methods have been developed [1,2], most of them are vulnerable to random cropping such as jittering, and very few literatures have performed and published sufficient experiments against this malicious attack as shown in this paper.

In audio analysis and classification, the extracted wavelet coefficients provide a compact representation that shows the energy distribution of the signal in time and frequency domain. In order to further reduce the dimensionality of the extracted feature vectors, statistics over a set of wavelet coefficients can also be used to represent the statistical characteristics of the texture or the music surface of the audio piece[3].

In this paper, for the convenience of watermark embedding, we adopt the mean of the coefficients value rather than the mean of the absolute coefficients value at the coarsest approximation subband as the statistical feature. Because these statistical features are calculated from the wavelet coefficients at the coarsest approximation subband, which represents the perceptually most significant low frequency components of the audio signal, they are supposed to be relatively stable under common signal processing such as MP3 compression, low-pass filtering, equalization, noise addition, echo addition, resampling etc. Moreover, due to the high relevance between adjacent audio samples or small blocks, random cropping of a small clip of audio will not change this statistical feature greatly, although individual coefficient may experience a big change. In this way, the statistical feature can also be supposed to be invariant to little random cropping in time domain. Therefore, this statistical feature of the wavelet coefficients at the coarsest approximation subband serves as a good candidate for watermark embedding.

(1). The input audio signal is first segmented into overlapped frames. Given the sampling frequency of 44100 Hz, the frames are of 2048 samples each, with 75% (1536 samples) overlap between every two adjacent frames. Each frame is then hamming-windowed to minimize the Gibbs effects. Note that the frame size is a trade-off between perceptual transparency (small frame sizes) and detection reliability (large frame sizes), the experimental results demonstrate that a good compromise in this trade-off can be reached with the frame size of 2048-samples and 75% overlap between every two adjacent frames.

(2). For each audio frame, three-level wavelet decomposition is performed with the 'db4' or 'haar' wavelet basis, and then the mean of all the wavelet coefficients at the coarsest approximation subband (i.e. ca3) is calculated. Next, the mean is removed from all coefficients at the ca3 subband to facilitate the embedding process.

(3). The watermark data, which
is a 32 by 32 binary logo image shown in Figure.1 in our experiment, is
transformed into one dimensional sequence of ones and zeros and then encrypted
using a random chaotic sequence generated according to the literature [4]. Most
other existing watermarking algorithms use pseudorandom sequence as the
watermark, which is not so intuitionistic as the image and the correlation
detection highly depends on the selection of threshold. Next, each bit of
watermark data is embedded into the corresponding audio frame in the following
way: if *w(i)=1,* all the ca3 level wavelet coefficients in the i-th frame are
added by *a*, else if *w(i)=0*, they are subtracted by *a*,
where *a* is a small number in the same
order of magnitude with the mean of each frame, and is adjusted so as not to
introduce any audible artifacts into the watermarked audio.

(4). Inverse discrete wavelet transformation (IDWT) is applied to the modified wavelet coefficients in each frame to transform them back to the form in time domain.

(5). Steps 2 through 4 are repeated until all the watermark bits are embedded. Finally, all the modified frames are merged together to form the whole watermarked audio signal in time domain.

**
Figure 1. The
original watermark: a binary logo image**

The detection algorithm is straightforward and blind, without resorting to the original audio signal. For each segmented frame, if the mean of the wavelet coefficients at the coarsest approximation subband is larger than zero, a bit of '1' is extracted, while if the mean is lower than zero, a bit of '0' is extracted. This process is repeated until all watermark bits are detected. Finally, all the detected bits are decrypted and rearranged to form the extracted watermark image.

The algorithm was tested over a set of audio signals including pop, saxophone, rock, piano, electronic organ, guitar, and violin. Each music piece has a duration of 15 seconds and is mono, 16 bits/sample, with sampling rate 44.1kHz. The waveform of the original and watermarked piano music along with their difference is as follows.

**
Figure 2.
The original and watermarked piano waveform and the
difference between them**

The test conditions and results are listed in Table 1. Cropping some samples randomly will produce a disastrous synchronization problem for most time-domain or spread spectrum based watermarking methods. However, our approach is rather insensitive to the synchronization structure due to the high relevance between adjacent audio samples or small blocks. Even if several thousands of samples are cropped at different positions randomly, it will not seriously affect the mean of the wavelet coefficients value at the coarsest approximation subband, for example, make the sign of the mean value changed, and the binary watermark image can still be extracted and identified. Pitch-invariant time scale modifications are also tested over the watermarked audio. However, this approach has only limited resistance up to +2% or -2% against this type of attacks. If repetition coding is used, the result will be better.

- T. Muntean, E. Grivel, I. Nafornita, and M. Najim. Audio Digital Watermarking Based on Hybrid Spread Spectrum", IEEE Wedel Music, 2002.
- D. Kirovski and H. S. Malvar. Robust Spread-Spectrum Audio Watermarking. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1345-1348, 2001.
- George Tzanetakis, Georg Essl, and Perry Cook. Audio Analysis using the Discrete Wavelet Transform. Int. Conf. Acoustics and Music: Theory and Applications, 2001.
- Michael Peter Kennedy. Digital communication using chaos. Signal
Processing 80, pp. 1307-1320, 2000.
**Table 1. BER, Similarity, and Extracted Watermark for Piano****No****Type of attack****BER (%)****Sim****Extracted****Image****1**MP3 (wav->mp3->wav, 64kbps, 11:1)

5.73 %

0.9662

**2**Echo (delay: 100ms, decay: 40%)

12.85%

0.9260

**3**Noise (clearly audible)

14.93%

0.9119

**4**Equalization ( 'BaseBoost' of GoldWave)

9.72%

0.9424

**5**Lowpass (cutoff frequency: 4 kHz)

15.10%

0.9105

**6**Resample (44100->22050->44100)Hz

14.93%

0.9116

**7**Crop1 (crop 500 samples at 5 random positions)

16.5%

0.9033

**8**Crop2 (crop 1000 samples at 10 random positions)

18.06%

0.8931

**9**Crop3 (crop 5000 samples at 10 random positions)

26.86%

0.8431

**10**Crop4 (crop 10000 samples at 10 random positions)

26.27%

0.8453

**11**Jittering1 (crop one sample every 100 samples)

24.80%

0.8552

**12**Jittering2 (crop one sample every 500 samples)

16.50%

0.9038

**13**Time Scale Modification (pitch reserved, +2%)

24.61%

0.8553

**14**Time Scale Modification (pitch reserved, -2%)

25.00%

0.8465

**15**Pitch Shift (tempo reserved +10%)

16.89%

0.9013

**16**Pitch Shift (tempo reserved -10%)

19.92%

0.8840