Do not use for text-based phishing (email/SMS); use email header analysis or URL detonation tools instead.
Normalize and prepare audio samples for feature extraction:
import librosa
import numpy as np
# Load audio, resample to 16kHz mono
y, sr = librosa.load("suspect_call.wav", sr=16000, mono=True)
# Trim silence from beginning and end
y_trimmed, _ = librosa.effects.trim(y, top_db=25)
# Normalize amplitude to [-1, 1]
y_norm = y_trimmed / np.max(np.abs(y_trimmed))
Audio preprocessing ensures consistent feature extraction across different recording conditions, microphones, and codec artifacts.
Extract the feature set that distinguishes real from synthetic speech:
Mel-Frequency Cepstral Coefficients (MFCCs):
# Extract 20 MFCCs + delta and delta-delta
mfccs = librosa.feature.mfcc(y=y_norm, sr=sr, n_mfcc=20)
mfcc_delta = librosa.feature.delta(mfccs)
mfcc_delta2 = librosa.feature.delta(mfccs, order=2)
MFCCs capture the spectral envelope of speech, representing how the vocal tract shapes sound. Deepfake audio often shows unnatural smoothness in higher-order MFCCs because neural vocoders approximate but do not perfectly replicate the acoustic resonance of a physical vocal tract.
Spectral Features:
spectral_centroid = librosa.feature.spectral_centroid(y=y_norm, sr=sr)
spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y_norm, sr=sr)
spectral_contrast = librosa.feature.spectral_contrast(y=y_norm, sr=sr)
spectral_rolloff = librosa.feature.spectral_rolloff(y=y_norm, sr=sr)
zero_crossing_rate = librosa.feature.zero_crossing_rate(y_norm)
Key indicators of deepfake audio:
Aggregate frame-level features into a fixed-length vector and classify:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
def build_feature_vector(y, sr):
features = []
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
for coeff in mfccs:
features.extend([np.mean(coeff), np.std(coeff), np.min(coeff), np.max(coeff)])
for feat_fn in [librosa.feature.spectral_centroid,
librosa.feature.spectral_bandwidth,
librosa.feature.spectral_rolloff,
librosa.feature.zero_crossing_rate]:
feat = feat_fn(y=y, sr=sr) if feat_fn != librosa.feature.zero_crossing_rate else feat_fn(y)
features.extend([np.mean(feat), np.std(feat), np.min(feat), np.max(feat)])
contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
for band in contrast:
features.extend([np.mean(band), np.std(band)])
return np.array(features)
Classification uses an ensemble approach: Random Forest for robustness and Gradient Boosting for accuracy, with a voting mechanism to reduce false positives.
Examine time-domain artifacts that neural vocoders leave behind:
# Pitch stability analysis - deepfakes often have unnaturally stable F0
f0, voiced_flag, voiced_probs = librosa.pyin(y_norm, fmin=50, fmax=500, sr=sr)
f0_clean = f0[~np.isnan(f0)]
pitch_std = np.std(f0_clean) if len(f0_clean) > 0 else 0
pitch_jitter = np.mean(np.abs(np.diff(f0_clean))) if len(f0_clean) > 1 else 0
Real human speech exhibits natural pitch jitter (micro-variations in fundamental frequency) and shimmer (amplitude perturbations). Deepfake audio generated by Tacotron 2, VALL-E, or ElevenLabs typically shows reduced jitter and shimmer compared to genuine speech.
Generate spectrograms for manual forensic review:
import librosa.display
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
librosa.display.specshow(librosa.power_to_db(librosa.feature.melspectrogram(y=y_norm, sr=sr)),
sr=sr, ax=axes[0, 0], x_axis='time', y_axis='mel')
axes[0, 0].set_title('Mel Spectrogram')
librosa.display.specshow(mfccs, sr=sr, ax=axes[0, 1], x_axis='time')
axes[0, 1].set_title('MFCCs')
Visual inspection reveals banding artifacts in mel spectrograms, unnatural energy cutoffs above the vocoder's frequency ceiling, and periodic noise patterns in the high-frequency range that are characteristic of neural speech synthesis.
Compile findings into an actionable report:
DEEPFAKE AUDIO ANALYSIS REPORT
================================
File: suspect_executive_call.wav
Duration: 47.3 seconds
Sample Rate: 16000 Hz
Analysis Date: 2026-03-19
CLASSIFICATION RESULT
Verdict: LIKELY DEEPFAKE (confidence: 94.2%)
Ensemble Score: RF=0.91, GBT=0.97, Avg=0.94
FEATURE ANOMALIES DETECTED
- MFCC variance in coefficients 13-20: 62% below genuine baseline
- Spectral contrast (4-8 kHz): 0.23 (genuine avg: 0.41)
- Pitch jitter: 0.8 Hz (genuine avg: 2.4 Hz)
- Zero-crossing rate std: 0.003 (genuine avg: 0.011)
SPECTROGRAM ARTIFACTS
- Energy cutoff above 7.8 kHz (consistent with neural vocoder ceiling)
- Banding pattern at 50ms intervals in mel spectrogram
- Missing formant transitions at 12.4s, 23.1s, 35.7s timestamps
RECOMMENDATION
High confidence of AI-generated audio. Recommend out-of-band
verification with the purported speaker. Preserve original audio
file with chain of custody documentation for potential legal action.
| Term | Definition |
|---|---|
| MFCC | Mel-Frequency Cepstral Coefficients; representation of the short-term power spectrum on a mel (perceptual) frequency scale |
| Spectral Centroid | Weighted mean of frequencies present in the signal; indicates perceived brightness of a sound |
| Spectral Contrast | Difference in amplitude between peaks and valleys in the spectrum across frequency sub-bands |
| Vocoder | Signal processing component that synthesizes audio waveforms from acoustic features; used in TTS and voice cloning |
| Pitch Jitter | Cycle-to-cycle variation in fundamental frequency; natural in human speech, reduced in synthetic speech |
| Vishing | Voice phishing; social engineering attack conducted via phone calls, increasingly using AI-cloned voices |
| Formant | Resonant frequencies of the vocal tract that define vowel sounds; transitions between formants are difficult for AI to replicate perfectly |
Context: CFO receives a phone call appearing to be from the CEO requesting an urgent wire transfer of $2.3M. The call came from an unknown number but the voice sounded identical to the CEO. IT security was able to obtain a recording of the call from the phone system.
Approach:
Pitfalls: