Audio Quality Metrics: A Comprehensive Reference

Randy Frans Fela
GN Group, Copenhagen
fransfela.github.io

Introduction

Audio quality evaluation spans multiple domains, each with specialized metrics designed for specific acoustic scenarios. This living reference consolidates metrics across speech processing, music analysis, spatial audio, and environmental soundscapes.

Last updated: January 14, 2025
Status: 🟢 Actively maintained

Speech Level Variation

Metrics for assessing speech level consistency and dynamics.

Active Speech Level (ASL)

Description: Measures the active speech level excluding pauses and silent segments.

How it works: Applies voice activity detection (VAD) to isolate speech regions, then calculates RMS level of active segments.

Formula: \(\text{ASL} = 10 \log_{10} \left( \frac{1}{N} \sum_{i=1}^{N} x_i^2 \right)\) where \(x_i\) are active speech samples.

Libraries:

  • Python: pydub, librosa
  • MATLAB: Audio Toolbox

References:

Loudness (ITU-R BS.1770)

Description: Perceptually weighted loudness measurement for broadcast audio.

How it works: Applies K-weighting filter to approximate human loudness perception, integrates over time.

Libraries:

  • Python: pyloudnorm
  • C++: libebur128

Datasets:

References:


Overall Audio Quality

Broad metrics for general audio fidelity.

Signal-to-Noise Ratio (SNR)

Description: Ratio of signal power to noise power, expressed in dB.

How it works: \(\text{SNR} = 10 \log_{10} \left( \frac{P_{\text{signal}}}{P_{\text{noise}}} \right)\)

Limitations: Does not correlate well with perceptual quality.

Libraries:

  • Python: scipy.signal, numpy
  • MATLAB: Built-in snr() function

References:

Perceptual Evaluation of Audio Quality (PEAQ)

Description: ITU standard for objective audio quality measurement, designed for codec evaluation.

How it works: Psychoacoustic model comparing reference and degraded signals across frequency bands.

Libraries:

Datasets:

References:


Speech Quality

Metrics specifically for telephony and VoIP.

Perceptual Evaluation of Speech Quality (PESQ)

Description: ITU standard for predicting speech quality in telecom networks.

How it works: Time-aligned comparison of reference and degraded signals through perceptual model. Output: MOS-LQO (0.5 to 4.5 scale).

Libraries:

Datasets:

References:

Perceptual Objective Listening Quality Assessment (POLQA)

Description: Successor to PESQ, supporting wideband and super-wideband speech.

How it works: Advanced perceptual model with improved handling of time warping and codec artifacts.

Libraries:

References:


Speech Enhancement

Metrics for evaluating noise suppression and enhancement algorithms.

Short-Time Objective Intelligibility (STOI)

Description: Predicts speech intelligibility in noisy conditions.

How it works: Correlates time-frequency representations of clean and processed speech.

Libraries:

Datasets:

References:

  • Taal, C.H., et al. (2011). “An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech”
Perceptual Contrast Using Spectrograms (PCSS)

Description: Measures perceptual contrast enhancement in processed speech spectrograms.

How it works: Computes contrast ratio in time-frequency domain weighted by auditory masking.

Libraries:

  • Custom implementations (research-specific)

References:

  • Healy, E.W., et al. (2013). “An algorithm to increase speech intelligibility for hearing-impaired listeners”
DNSMOS (Deep Noise Suppression MOS)

Description: Deep learning-based predictor of subjective MOS for noise suppression systems.

How it works: Neural network trained on large-scale listening tests to predict MOS directly from audio.

Libraries:

Datasets:

References:

  • Reddy, C.K.A., et al. (2021). “DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric”

Speech Intelligibility

Metrics correlating with human speech understanding.

Speech Intelligibility Index (SII)

Description: ANSI standard for predicting speech intelligibility based on audibility.

How it works: Weights audible speech bands according to their importance for intelligibility.

Libraries:

References:

Extended Short-Time Objective Intelligibility (ESTOI)

Description: Extension of STOI handling non-linear processing like spectral subtraction.

How it works: Applies intermediate intelligibility measure to improve correlation with subjective scores.

Libraries:

References:

  • Jensen, J., & Taal, C.H. (2016). “An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers”

Speech in Reverberation

Metrics for reverberant environments.

Speech-to-Reverberation Modulation Energy Ratio (SRMR)

Description: Non-intrusive metric estimating intelligibility degradation due to reverberation.

How it works: Analyzes modulation spectrum energy ratio across frequency bands.

Libraries:

Datasets:

References:

  • Falk, T.H., et al. (2010). “A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech”

Room Acoustics Quality

Metrics related to spatial and architectural acoustics.

Reverberation Time (RT60)

Description: Time for sound to decay by 60 dB after source stops.

How it works: Measures decay slope of impulse response in frequency bands.

Libraries:

Datasets:

References:

Clarity (C50, C80)

Description: Ratio of early to late arriving sound energy.

How it works: \(C_{50} = 10 \log_{10} \left( \frac{\int_0^{50ms} p^2(t) dt}{\int_{50ms}^{\infty} p^2(t) dt} \right)\)

Applications: Speech clarity (C50), music clarity (C80).

Libraries:

  • Python: pyroomacoustics

References:


Speech in Noise

Metrics for speech masked by background noise.

Hearing Aid Speech Perception Index (HASPI)

Description: Predicts speech intelligibility for hearing-impaired listeners with and without hearing aids.

How it works: Models auditory processing including hearing loss and amplification.

Libraries:

References:

  • Kates, J.M., & Arehart, K.H. (2014). “The Hearing-Aid Speech Perception Index (HASPI)”

Blind Source Separation

Metrics for evaluating source separation quality.

Signal-to-Distortion Ratio (SDR)

Description: Measures separation quality as ratio of target signal to artifacts.

How it works: Decomposes error into target distortion, interference, and noise.

Libraries:

  • Python: mir_eval.separation

Datasets:

References:

  • Vincent, E., et al. (2006). “Performance measurement in blind audio source separation”
Scale-Invariant SDR (SI-SDR)

Description: Scale-invariant version of SDR, more robust to amplitude differences.

How it works: Projects estimated signal onto reference, computes distortion ratio.

Libraries:

  • Python: torch_mir_eval or custom implementation

References:

  • Le Roux, J., et al. (2019). “SDR – half-baked or well done?”

Music Quality

Metrics for music fidelity and artifact detection.

PEAQ (Perceptual Evaluation of Audio Quality)

Description: See “Overall Audio Quality” section above.

ViSQOL (Virtual Speech Quality Objective Listener)

Description: Perceptual quality metric for speech and audio, supporting music mode.

How it works: Spectrogram-based similarity using neurogram representation.

Libraries:

Datasets:

  • Custom music test sets (check ViSQOL repo)

References:

  • Hines, A., et al. (2015). “ViSQOL: an objective speech quality model”

Distance-Based Metrics

Metrics measuring spectral or waveform distance.

Log-Spectral Distance (LSD)

Description: Euclidean distance between log-magnitude spectra.

How it works: \(\text{LSD} = \sqrt{ \frac{1}{K} \sum_{k=1}^{K} \left( 10 \log_{10} |X_k| - 10 \log_{10} |\hat{X}_k| \right)^2 }\)

Libraries:

  • Python: Custom with librosa or numpy

References:

  • Gray, A., & Markel, J. (1976). “Distance measures for speech processing”
Mel-Cepstral Distortion (MCD)

Description: Distance between mel-frequency cepstral coefficients (MFCCs).

How it works: \(\text{MCD} = \frac{10}{\ln 10} \sqrt{2 \sum_{k=1}^{K} (c_k - \hat{c}_k)^2}\)

Applications: Voice conversion, TTS evaluation.

Libraries:

  • Python: librosa, scipy

References:

  • Kubichek, R. (1993). “Mel-cepstral distance measure for objective speech quality assessment”

ASR & NLP-Based Metrics

Metrics using automatic speech recognition and language models.

Word Error Rate (WER)

Description: Percentage of word errors (substitutions, deletions, insertions) in ASR output.

How it works: \(\text{WER} = \frac{S + D + I}{N} \times 100\%\) where S=substitutions, D=deletions, I=insertions, N=total words.

Libraries:

  • Python: jiwer

Datasets:

References:

BERT Score

Description: Contextual embedding similarity between reference and hypothesis transcriptions.

How it works: Computes cosine similarity of BERT embeddings token-by-token.

Libraries:

  • Python: bert-score

References:

  • Zhang, T., et al. (2020). “BERTScore: Evaluating Text Generation with BERT”

Hearing Aid Metrics

Metrics specific to hearing aid and assistive listening device evaluation.

Hearing Aid Speech Quality Index (HASQI)

Description: Predicts speech quality (not just intelligibility) for hearing aid users.

How it works: Models auditory processing with hearing loss, computes quality along multiple dimensions.

Libraries:

References:

  • Kates, J.M., & Arehart, K.H. (2010). “The Hearing-Aid Speech Quality Index (HASQI)”
Binaural Intelligibility Level Difference (BILD)

Description: Improvement in intelligibility from binaural vs. monaural listening.

How it works: Compares predicted intelligibility under binaural and monaural conditions.

Applications: Bilateral hearing aid fitting, spatial audio benefits.

References:

  • Culling, J.F., et al. (2004). “The role of head-induced interaural time and level differences”

Soundscape Indices

Metrics for environmental and ecological acoustics.

Acoustic Complexity Index (ACI)

Description: Measures temporal variability in soundscapes, correlates with biodiversity.

How it works: Computes intensity differences across adjacent time frames in frequency bands.

Libraries:

  • R: soundecology
  • Python: scikit-maad

Datasets:

References:

  • Pieretti, N., et al. (2011). “A new methodology to infer the singing activity of an avian community”
Normalized Difference Soundscape Index (NDSI)

Description: Ratio of biophony (1-2 kHz) to anthrophony (1-2 kHz and 2-11 kHz).

How it works: \(\text{NDSI} = \frac{\text{Biophony} - \text{Anthrophony}}{\text{Biophony} + \text{Anthrophony}}\)

Libraries:

  • R: soundecology
  • Python: scikit-maad

References:

  • Kasten, E.P., et al. (2012). “The remote environmental assessment laboratory’s acoustic library”
Bioacoustic Index (BI)

Description: Area under the spectrum curve within frequency range of biological sounds.

How it works: Integrates spectral energy in 2-8 kHz range (typical for birds/insects).

Libraries:

  • R: soundecology

References:

  • Boelman, N.T., et al. (2007). “Multi-trophic invasion resistance in Hawaii”
Soundscape Pleasantness (ISO 12913-3)

Description: Subjective assessment of soundscape quality in urban environments.

How it works: Perceptual attributes evaluated via listening tests (pleasantness, eventfulness, etc.).

Standards:

Datasets:

References:

  • Aletta, F., et al. (2016). “Soundscape descriptors and a conceptual framework for developing predictive soundscape models”

References

Standards

  • ITU-T P.862: PESQ
  • ITU-T P.863: POLQA
  • ITU-R BS.1387: PEAQ
  • ITU-R BS.1770: Loudness
  • ANSI S3.5: Speech Intelligibility Index
  • ISO 3382-1: Room acoustics
  • ISO 12913: Soundscape assessment

Key Papers

  • Rix, A.W., et al. (2001). “Perceptual evaluation of speech quality (PESQ)”
  • Taal, C.H., et al. (2011). “An Algorithm for Intelligibility Prediction”
  • Kates, J.M., & Arehart, K.H. (2014). “The Hearing-Aid Speech Perception Index”
  • Vincent, E., et al. (2006). “Performance measurement in blind audio source separation”
  • Zhang, T., et al. (2020). “BERTScore”

Toolboxes & Libraries


Last updated: January 14, 2025
This is a living document. Suggestions? Email me.