Audio Quality Metrics: A Comprehensive Reference
Introduction
Audio quality evaluation spans multiple domains, each with specialized metrics designed for specific acoustic scenarios. This living reference consolidates metrics across speech processing, music analysis, spatial audio, and environmental soundscapes.
Last updated: January 14, 2025
Status: 🟢 Actively maintained
Speech Level Variation
Metrics for assessing speech level consistency and dynamics.
Active Speech Level (ASL)
Description: Measures the active speech level excluding pauses and silent segments.
How it works: Applies voice activity detection (VAD) to isolate speech regions, then calculates RMS level of active segments.
Formula: \(\text{ASL} = 10 \log_{10} \left( \frac{1}{N} \sum_{i=1}^{N} x_i^2 \right)\) where \(x_i\) are active speech samples.
Libraries:
- Python:
pydub,librosa - MATLAB: Audio Toolbox
References:
Loudness (ITU-R BS.1770)
Description: Perceptually weighted loudness measurement for broadcast audio.
How it works: Applies K-weighting filter to approximate human loudness perception, integrates over time.
Libraries:
- Python:
pyloudnorm - C++:
libebur128
Datasets:
References:
Overall Audio Quality
Broad metrics for general audio fidelity.
Signal-to-Noise Ratio (SNR)
Description: Ratio of signal power to noise power, expressed in dB.
How it works: \(\text{SNR} = 10 \log_{10} \left( \frac{P_{\text{signal}}}{P_{\text{noise}}} \right)\)
Limitations: Does not correlate well with perceptual quality.
Libraries:
- Python:
scipy.signal,numpy - MATLAB: Built-in
snr()function
References:
Perceptual Evaluation of Audio Quality (PEAQ)
Description: ITU standard for objective audio quality measurement, designed for codec evaluation.
How it works: Psychoacoustic model comparing reference and degraded signals across frequency bands.
Libraries:
- C: GstPEAQ (GStreamer plugin)
- MATLAB: Third-party implementations available
Datasets:
References:
Speech Quality
Metrics specifically for telephony and VoIP.
Perceptual Evaluation of Speech Quality (PESQ)
Description: ITU standard for predicting speech quality in telecom networks.
How it works: Time-aligned comparison of reference and degraded signals through perceptual model. Output: MOS-LQO (0.5 to 4.5 scale).
Libraries:
- Python:
pesq(via pip) - C: Official ITU implementation
Datasets:
References:
- ITU-T P.862: Perceptual evaluation of speech quality (PESQ)
- Rix, A.W., et al. (2001). “Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs”
Perceptual Objective Listening Quality Assessment (POLQA)
Description: Successor to PESQ, supporting wideband and super-wideband speech.
How it works: Advanced perceptual model with improved handling of time warping and codec artifacts.
Libraries:
- Commercial: POLQA by OPTICOM
- Python: Limited open-source implementations
References:
Speech Enhancement
Metrics for evaluating noise suppression and enhancement algorithms.
Short-Time Objective Intelligibility (STOI)
Description: Predicts speech intelligibility in noisy conditions.
How it works: Correlates time-frequency representations of clean and processed speech.
Libraries:
- Python:
pystoi - MATLAB: STOI Toolbox
Datasets:
References:
- Taal, C.H., et al. (2011). “An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech”
Perceptual Contrast Using Spectrograms (PCSS)
Description: Measures perceptual contrast enhancement in processed speech spectrograms.
How it works: Computes contrast ratio in time-frequency domain weighted by auditory masking.
Libraries:
- Custom implementations (research-specific)
References:
- Healy, E.W., et al. (2013). “An algorithm to increase speech intelligibility for hearing-impaired listeners”
DNSMOS (Deep Noise Suppression MOS)
Description: Deep learning-based predictor of subjective MOS for noise suppression systems.
How it works: Neural network trained on large-scale listening tests to predict MOS directly from audio.
Libraries:
- Python: DNSMOS by Microsoft
Datasets:
References:
- Reddy, C.K.A., et al. (2021). “DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric”
Speech Intelligibility
Metrics correlating with human speech understanding.
Speech Intelligibility Index (SII)
Description: ANSI standard for predicting speech intelligibility based on audibility.
How it works: Weights audible speech bands according to their importance for intelligibility.
Libraries:
- MATLAB: SII Toolbox
References:
Extended Short-Time Objective Intelligibility (ESTOI)
Description: Extension of STOI handling non-linear processing like spectral subtraction.
How it works: Applies intermediate intelligibility measure to improve correlation with subjective scores.
Libraries:
- Python:
pystoi(includes ESTOI) - MATLAB: ESTOI Toolbox
References:
- Jensen, J., & Taal, C.H. (2016). “An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers”
Speech in Reverberation
Metrics for reverberant environments.
Speech-to-Reverberation Modulation Energy Ratio (SRMR)
Description: Non-intrusive metric estimating intelligibility degradation due to reverberation.
How it works: Analyzes modulation spectrum energy ratio across frequency bands.
Libraries:
- MATLAB: SRMR Toolbox
Datasets:
References:
- Falk, T.H., et al. (2010). “A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech”
Room Acoustics Quality
Metrics related to spatial and architectural acoustics.
Reverberation Time (RT60)
Description: Time for sound to decay by 60 dB after source stops.
How it works: Measures decay slope of impulse response in frequency bands.
Libraries:
- Python:
pyroomacoustics - MATLAB: ITA-Toolbox
Datasets:
References:
Clarity (C50, C80)
Description: Ratio of early to late arriving sound energy.
How it works: \(C_{50} = 10 \log_{10} \left( \frac{\int_0^{50ms} p^2(t) dt}{\int_{50ms}^{\infty} p^2(t) dt} \right)\)
Applications: Speech clarity (C50), music clarity (C80).
Libraries:
- Python:
pyroomacoustics
References:
Speech in Noise
Metrics for speech masked by background noise.
Hearing Aid Speech Perception Index (HASPI)
Description: Predicts speech intelligibility for hearing-impaired listeners with and without hearing aids.
How it works: Models auditory processing including hearing loss and amplification.
Libraries:
- MATLAB: HASPI Toolbox
References:
- Kates, J.M., & Arehart, K.H. (2014). “The Hearing-Aid Speech Perception Index (HASPI)”
Blind Source Separation
Metrics for evaluating source separation quality.
Signal-to-Distortion Ratio (SDR)
Description: Measures separation quality as ratio of target signal to artifacts.
How it works: Decomposes error into target distortion, interference, and noise.
Libraries:
- Python:
mir_eval.separation
Datasets:
References:
- Vincent, E., et al. (2006). “Performance measurement in blind audio source separation”
Scale-Invariant SDR (SI-SDR)
Description: Scale-invariant version of SDR, more robust to amplitude differences.
How it works: Projects estimated signal onto reference, computes distortion ratio.
Libraries:
- Python:
torch_mir_evalor custom implementation
References:
- Le Roux, J., et al. (2019). “SDR – half-baked or well done?”
Music Quality
Metrics for music fidelity and artifact detection.
PEAQ (Perceptual Evaluation of Audio Quality)
Description: See “Overall Audio Quality” section above.
ViSQOL (Virtual Speech Quality Objective Listener)
Description: Perceptual quality metric for speech and audio, supporting music mode.
How it works: Spectrogram-based similarity using neurogram representation.
Libraries:
- C++/Python: ViSQOL by Google
Datasets:
- Custom music test sets (check ViSQOL repo)
References:
- Hines, A., et al. (2015). “ViSQOL: an objective speech quality model”
Distance-Based Metrics
Metrics measuring spectral or waveform distance.
Log-Spectral Distance (LSD)
Description: Euclidean distance between log-magnitude spectra.
How it works: \(\text{LSD} = \sqrt{ \frac{1}{K} \sum_{k=1}^{K} \left( 10 \log_{10} |X_k| - 10 \log_{10} |\hat{X}_k| \right)^2 }\)
Libraries:
- Python: Custom with
librosaornumpy
References:
- Gray, A., & Markel, J. (1976). “Distance measures for speech processing”
Mel-Cepstral Distortion (MCD)
Description: Distance between mel-frequency cepstral coefficients (MFCCs).
How it works: \(\text{MCD} = \frac{10}{\ln 10} \sqrt{2 \sum_{k=1}^{K} (c_k - \hat{c}_k)^2}\)
Applications: Voice conversion, TTS evaluation.
Libraries:
- Python:
librosa,scipy
References:
- Kubichek, R. (1993). “Mel-cepstral distance measure for objective speech quality assessment”
ASR & NLP-Based Metrics
Metrics using automatic speech recognition and language models.
Word Error Rate (WER)
Description: Percentage of word errors (substitutions, deletions, insertions) in ASR output.
How it works: \(\text{WER} = \frac{S + D + I}{N} \times 100\%\) where S=substitutions, D=deletions, I=insertions, N=total words.
Libraries:
- Python:
jiwer
Datasets:
References:
BERT Score
Description: Contextual embedding similarity between reference and hypothesis transcriptions.
How it works: Computes cosine similarity of BERT embeddings token-by-token.
Libraries:
- Python:
bert-score
References:
- Zhang, T., et al. (2020). “BERTScore: Evaluating Text Generation with BERT”
Hearing Aid Metrics
Metrics specific to hearing aid and assistive listening device evaluation.
Hearing Aid Speech Quality Index (HASQI)
Description: Predicts speech quality (not just intelligibility) for hearing aid users.
How it works: Models auditory processing with hearing loss, computes quality along multiple dimensions.
Libraries:
- MATLAB: HASQI Toolbox
References:
- Kates, J.M., & Arehart, K.H. (2010). “The Hearing-Aid Speech Quality Index (HASQI)”
Binaural Intelligibility Level Difference (BILD)
Description: Improvement in intelligibility from binaural vs. monaural listening.
How it works: Compares predicted intelligibility under binaural and monaural conditions.
Applications: Bilateral hearing aid fitting, spatial audio benefits.
References:
- Culling, J.F., et al. (2004). “The role of head-induced interaural time and level differences”
Soundscape Indices
Metrics for environmental and ecological acoustics.
Acoustic Complexity Index (ACI)
Description: Measures temporal variability in soundscapes, correlates with biodiversity.
How it works: Computes intensity differences across adjacent time frames in frequency bands.
Libraries:
- R:
soundecology - Python:
scikit-maad
Datasets:
- Xeno-canto (bird recordings)
- AudioSet
References:
- Pieretti, N., et al. (2011). “A new methodology to infer the singing activity of an avian community”
Normalized Difference Soundscape Index (NDSI)
Description: Ratio of biophony (1-2 kHz) to anthrophony (1-2 kHz and 2-11 kHz).
How it works: \(\text{NDSI} = \frac{\text{Biophony} - \text{Anthrophony}}{\text{Biophony} + \text{Anthrophony}}\)
Libraries:
- R:
soundecology - Python:
scikit-maad
References:
- Kasten, E.P., et al. (2012). “The remote environmental assessment laboratory’s acoustic library”
Bioacoustic Index (BI)
Description: Area under the spectrum curve within frequency range of biological sounds.
How it works: Integrates spectral energy in 2-8 kHz range (typical for birds/insects).
Libraries:
- R:
soundecology
References:
- Boelman, N.T., et al. (2007). “Multi-trophic invasion resistance in Hawaii”
Soundscape Pleasantness (ISO 12913-3)
Description: Subjective assessment of soundscape quality in urban environments.
How it works: Perceptual attributes evaluated via listening tests (pleasantness, eventfulness, etc.).
Standards:
Datasets:
References:
- Aletta, F., et al. (2016). “Soundscape descriptors and a conceptual framework for developing predictive soundscape models”
References
Standards
- ITU-T P.862: PESQ
- ITU-T P.863: POLQA
- ITU-R BS.1387: PEAQ
- ITU-R BS.1770: Loudness
- ANSI S3.5: Speech Intelligibility Index
- ISO 3382-1: Room acoustics
- ISO 12913: Soundscape assessment
Key Papers
- Rix, A.W., et al. (2001). “Perceptual evaluation of speech quality (PESQ)”
- Taal, C.H., et al. (2011). “An Algorithm for Intelligibility Prediction”
- Kates, J.M., & Arehart, K.H. (2014). “The Hearing-Aid Speech Perception Index”
- Vincent, E., et al. (2006). “Performance measurement in blind audio source separation”
- Zhang, T., et al. (2020). “BERTScore”
Toolboxes & Libraries
Last updated: January 14, 2025
This is a living document. Suggestions? Email me.