Lab Overview

Learning Objectives

After completing this lab, students will be able to:

  1. Apply Fourier Transform to real audio signals for frequency analysis
  2. Implement a simplified version of Shazam's audio fingerprinting algorithm
  3. Understand how spectrograms are used in audio recognition systems
  4. Generate and compare audio fingerprints using peak extraction
  5. Evaluate the robustness of fingerprinting techniques to noise

Background

Shazam's audio recognition technology relies on the Fast Fourier Transform (FFT) to convert time-domain audio signals into frequency-domain representations. By identifying unique patterns in the frequency spectrum (audio fingerprints), Shazam can match short audio samples against a database of millions of songs.

In this lab, you will implement the core components of this system, focusing on the signal processing aspects relevant to electrical engineering.

Pre-Lab Preparation

Complete these tasks before the lab session:

  1. Review Fourier Transform theory and properties
  2. Understand the difference between DFT and FFT algorithms
  3. Install Python with NumPy, SciPy, and Matplotlib libraries
  4. Download the sample audio files provided for the lab

Pre-Lab Questions

1. Explain why frequency-domain analysis (using FFT) is more effective than time-domain analysis for audio fingerprinting.

2. What is the purpose of applying a window function (like Hann or Hamming) before performing FFT on audio signals?

3. Calculate the frequency resolution of an FFT with N=4096 points for audio sampled at 44.1 kHz.

Frequency Resolution Δf = Fs / N

Where Fs = Sampling Frequency, N = FFT size

Lab Procedure

Part 1: Audio Signal Generation

First, we'll generate synthetic audio signals to understand the FFT process. Create a Python script with the following functions:

import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile

# Generate a test audio signal with multiple frequencies
def generate_test_signal(duration=2, fs=44100):
    t = np.linspace(0, duration, int(fs * duration), endpoint=False)
    # Create signal with three frequency components
    freqs = [440, 880, 1320] # A4, A5, E6
    signal = np.zeros_like(t)
    for f in freqs:
        signal += 0.5 * np.sin(2 * np.pi * f * t)
    return t, signal, fs

# Add white noise to simulate real recording conditions
def add_noise(signal, snr_db=20):
    signal_power = np.mean(signal**2)
    noise_power = signal_power / (10**(snr_db/10))
    noise = np.random.normal(0, np.sqrt(noise_power), len(signal))
    return signal + noise

Part 2: FFT Implementation & Analysis

Implement FFT calculation and analyze the frequency components of the audio signal.

# Compute FFT and generate frequency axis
def compute_fft(signal, fs, apply_window=True):
    n = len(signal)
    # Apply Hann window to reduce spectral leakage
    if apply_window:
        window = np.hanning(n)
        signal = signal * window
    
    # Compute FFT
    fft_result = np.fft.fft(signal)
    fft_magnitude = np.abs(fft_result[:n//2])
    fft_freq = np.fft.fftfreq(n, 1/fs)[:n//2]
    return fft_freq, fft_magnitude

# Identify frequency peaks (simplified Shazam approach)
def find_peaks(frequencies, magnitude, threshold=0.1, min_distance=5):
    peaks = []
    max_mag = np.max(magnitude)
    for i in range(1, len(magnitude)-1):
        if (magnitude[i] > magnitude[i-1] and
            magnitude[i] > magnitude[i+1] and
            magnitude[i] > threshold * max_mag):
            peaks.append((frequencies[i], magnitude[i]))
    return peaks

Part 3: Spectrogram Generation

Create a spectrogram - a time-frequency representation essential for audio fingerprinting.

# Generate spectrogram using Short-Time Fourier Transform (STFT)
def generate_spectrogram(signal, fs, window_size=1024, hop_size=512):
    n_windows = (len(signal) - window_size) // hop_size + 1
    spectrogram = np.zeros((window_size//2, n_windows))
    
    for i in range(n_windows):
        start = i * hop_size
        end = start + window_size
        segment = signal[start:end]
        window = np.hanning(window_size)
        segment = segment * window
        
        # Compute FFT for this segment
        fft_result = np.fft.fft(segment)[:window_size//2]
        magnitude = np.abs(fft_result)
        spectrogram[:, i] = magnitude
    
    time_axis = np.arange(n_windows) * hop_size / fs
    freq_axis = np.fft.fftfreq(window_size, 1/fs)[:window_size//2]
    return time_axis, freq_axis, spectrogram

Note: The spectrogram is a 2D representation with time on the x-axis and frequency on the y-axis. Color intensity represents magnitude at each time-frequency point.

1024

Part 4: Audio Fingerprint Generation

Implement the core Shazam fingerprinting algorithm by identifying peak constellations in the spectrogram.

# Find peaks in spectrogram (Shazam's approach)
def find_spectrogram_peaks(spectrogram, time_axis, freq_axis, threshold=0.3):
    peaks = []
    max_val = np.max(spectrogram)
    rows, cols = spectrogram.shape
    
    for t in range(1, cols-1):
        for f in range(1, rows-1):
            val = spectrogram[f, t]
            # Check if it's a local maximum
            if (val > threshold * max_val and
                val > spectrogram[f-1, t] and
                val > spectrogram[f+1, t] and
                val > spectrogram[f, t-1] and
                val > spectrogram[f, t+1]):
                peaks.append((time_axis[t], freq_axis[f], val))
    return peaks

# Create fingerprint hashes from peak pairs (simplified)
def create_fingerprint_hashes(peaks, max_time_diff=1.0, max_freq_diff=500):
    hashes = []
    n = len(peaks)
    for i in range(n):
        t1, f1, m1 = peaks[i]
        for j in range(i+1, min(i+5, n)): # Limit pairs for efficiency
            t2, f2, m2 = peaks[j]
            time_diff = t2 - t1
            freq_diff = f2 - f1
            # Create hash from the pair
            hash_val = hash((int(f1), int(f2), int(time_diff*1000)))
            hashes.append(hash_val)
    return hashes

Generated Fingerprint Hashes

These hash values represent unique features of the audio signal:

Key Concept: Shazam stores these hashes in a database. When you record audio, it generates similar hashes and looks for matches in the database. The matching process is efficient because it compares hashes rather than the full audio signal.

Data Analysis & Results

Analysis Questions

1. How does the window size affect the spectrogram? Compare time resolution vs frequency resolution.

2. What happens to the fingerprint when you add noise to the signal? Test with different SNR values.

3. How many unique fingerprint hashes were generated from your test signal? How might this scale for a full song?

Experimental Results

Test Condition Peaks Found Fingerprint Hashes Computation Time (ms)
Clean Signal - - -
With Noise (SNR=20dB) - - -
Different Window Size - - -