Audio Quality Metrics: A Comprehensive Reference

Randy Frans Fela
GN Group, Copenhagen
fransfela.github.io

Introduction

Audio quality evaluation spans multiple domains, each with specialized metrics designed for specific acoustic scenarios. This living reference consolidates metrics across speech processing, music analysis, spatial audio, and environmental soundscapes.

Last updated: January 14, 2025
Status: 🟢 Actively maintained

Speech Level Variation

Metrics for assessing speech level consistency and dynamics.

Active Speech Level (ASL)

Description: Measures the active speech level excluding pauses and silent segments.

How it works: Applies voice activity detection (VAD) to isolate speech regions, then calculates RMS level of active segments.

Formula: \(\text{ASL} = 10 \log_{10} \left( \frac{1}{N} \sum_{i=1}^{N} x_i^2 \right)\) where \(x_i\) are active speech samples.

Libraries:

Python: pydub, librosa
MATLAB: Audio Toolbox

References:

ITU-T P.56: Objective measurement of active speech level

Loudness (ITU-R BS.1770)

Description: Perceptually weighted loudness measurement for broadcast audio.

How it works: Applies K-weighting filter to approximate human loudness perception, integrates over time.

Libraries:

Python: pyloudnorm
C++: libebur128

Datasets:

EBU Loudness Test Set

References:

ITU-R BS.1770-4: Algorithms to measure audio programme loudness

Overall Audio Quality

Broad metrics for general audio fidelity.

Signal-to-Noise Ratio (SNR)

Description: Ratio of signal power to noise power, expressed in dB.

How it works: \(\text{SNR} = 10 \log_{10} \left( \frac{P_{\text{signal}}}{P_{\text{noise}}} \right)\)

Limitations: Does not correlate well with perceptual quality.

Libraries:

Python: scipy.signal, numpy
MATLAB: Built-in snr() function

References:

IEEE Standard for Audio Quality Measurement

Perceptual Evaluation of Audio Quality (PEAQ)

Description: ITU standard for objective audio quality measurement, designed for codec evaluation.

How it works: Psychoacoustic model comparing reference and degraded signals across frequency bands.

Libraries:

C: GstPEAQ (GStreamer plugin)
MATLAB: Third-party implementations available

Datasets:

MUSHRA Test Corpus

References:

ITU-R BS.1387-1: Method for objective measurements of perceived audio quality

Speech Quality

Metrics specifically for telephony and VoIP.

Perceptual Evaluation of Speech Quality (PESQ)

Description: ITU standard for predicting speech quality in telecom networks.

How it works: Time-aligned comparison of reference and degraded signals through perceptual model. Output: MOS-LQO (0.5 to 4.5 scale).

Libraries:

Python: pesq (via pip)
C: Official ITU implementation

Datasets:

References:

ITU-T P.862: Perceptual evaluation of speech quality (PESQ)
Rix, A.W., et al. (2001). “Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs”

Perceptual Objective Listening Quality Assessment (POLQA)

Description: Successor to PESQ, supporting wideband and super-wideband speech.

How it works: Advanced perceptual model with improved handling of time warping and codec artifacts.

Libraries:

Commercial: POLQA by OPTICOM
Python: Limited open-source implementations

References:

ITU-T P.863: Perceptual objective listening quality assessment

Speech Enhancement

Metrics for evaluating noise suppression and enhancement algorithms.

Short-Time Objective Intelligibility (STOI)

Description: Predicts speech intelligibility in noisy conditions.

How it works: Correlates time-frequency representations of clean and processed speech.

Libraries:

Python: pystoi
MATLAB: STOI Toolbox

Datasets:

References:

Taal, C.H., et al. (2011). “An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech”

Perceptual Contrast Using Spectrograms (PCSS)

Description: Measures perceptual contrast enhancement in processed speech spectrograms.

How it works: Computes contrast ratio in time-frequency domain weighted by auditory masking.

Libraries:

Custom implementations (research-specific)

References:

Healy, E.W., et al. (2013). “An algorithm to increase speech intelligibility for hearing-impaired listeners”

DNSMOS (Deep Noise Suppression MOS)

Description: Deep learning-based predictor of subjective MOS for noise suppression systems.

How it works: Neural network trained on large-scale listening tests to predict MOS directly from audio.

Libraries:

Python: DNSMOS by Microsoft

Datasets:

Microsoft DNS Challenge Dataset

References:

Reddy, C.K.A., et al. (2021). “DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric”

Speech Intelligibility

Metrics correlating with human speech understanding.

Speech Intelligibility Index (SII)

Description: ANSI standard for predicting speech intelligibility based on audibility.

How it works: Weights audible speech bands according to their importance for intelligibility.

Libraries:

MATLAB: SII Toolbox

References:

ANSI S3.5-1997: Methods for Calculation of the Speech Intelligibility Index

Extended Short-Time Objective Intelligibility (ESTOI)

Description: Extension of STOI handling non-linear processing like spectral subtraction.

How it works: Applies intermediate intelligibility measure to improve correlation with subjective scores.

Libraries:

Python: pystoi (includes ESTOI)
MATLAB: ESTOI Toolbox

References:

Jensen, J., & Taal, C.H. (2016). “An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers”

Speech in Reverberation

Metrics for reverberant environments.

Speech-to-Reverberation Modulation Energy Ratio (SRMR)

Description: Non-intrusive metric estimating intelligibility degradation due to reverberation.

How it works: Analyzes modulation spectrum energy ratio across frequency bands.

Libraries:

MATLAB: SRMR Toolbox

Datasets:

ACE Challenge Corpus

References:

Falk, T.H., et al. (2010). “A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech”

Room Acoustics Quality

Metrics related to spatial and architectural acoustics.

Reverberation Time (RT60)

Description: Time for sound to decay by 60 dB after source stops.

How it works: Measures decay slope of impulse response in frequency bands.

Libraries:

Python: pyroomacoustics
MATLAB: ITA-Toolbox

Datasets:

References:

ISO 3382-1: Measurement of room acoustic parameters

Clarity (C50, C80)

Description: Ratio of early to late arriving sound energy.

How it works: \(C_{50} = 10 \log_{10} \left( \frac{\int_0^{50ms} p^2(t) dt}{\int_{50ms}^{\infty} p^2(t) dt} \right)\)

Applications: Speech clarity (C50), music clarity (C80).

Libraries:

Python: pyroomacoustics

References:

ISO 3382-1: Measurement of room acoustic parameters

Speech in Noise

Metrics for speech masked by background noise.

Hearing Aid Speech Perception Index (HASPI)

Description: Predicts speech intelligibility for hearing-impaired listeners with and without hearing aids.

How it works: Models auditory processing including hearing loss and amplification.

Libraries:

MATLAB: HASPI Toolbox

References:

Kates, J.M., & Arehart, K.H. (2014). “The Hearing-Aid Speech Perception Index (HASPI)”

Metrics for evaluating source separation quality.

Signal-to-Distortion Ratio (SDR)

Description: Measures separation quality as ratio of target signal to artifacts.

How it works: Decomposes error into target distortion, interference, and noise.

Libraries:

Python: mir_eval.separation

Datasets:

References:

Vincent, E., et al. (2006). “Performance measurement in blind audio source separation”

Scale-Invariant SDR (SI-SDR)

Description: Scale-invariant version of SDR, more robust to amplitude differences.

How it works: Projects estimated signal onto reference, computes distortion ratio.

Libraries:

Python: torch_mir_eval or custom implementation

References:

Le Roux, J., et al. (2019). “SDR – half-baked or well done?”

Music Quality

Metrics for music fidelity and artifact detection.

PEAQ (Perceptual Evaluation of Audio Quality)

Description: See “Overall Audio Quality” section above.

ViSQOL (Virtual Speech Quality Objective Listener)

Description: Perceptual quality metric for speech and audio, supporting music mode.

How it works: Spectrogram-based similarity using neurogram representation.

Libraries:

C++/Python: ViSQOL by Google

Datasets:

Custom music test sets (check ViSQOL repo)

References:

Hines, A., et al. (2015). “ViSQOL: an objective speech quality model”

Distance-Based Metrics

Metrics measuring spectral or waveform distance.

Log-Spectral Distance (LSD)

Description: Euclidean distance between log-magnitude spectra.

How it works: \(\text{LSD} = \sqrt{ \frac{1}{K} \sum_{k=1}^{K} \left( 10 \log_{10} |X_k| - 10 \log_{10} |\hat{X}_k| \right)^2 }\)

Libraries:

Python: Custom with librosa or numpy

References:

Gray, A., & Markel, J. (1976). “Distance measures for speech processing”

Mel-Cepstral Distortion (MCD)

Description: Distance between mel-frequency cepstral coefficients (MFCCs).

How it works: \(\text{MCD} = \frac{10}{\ln 10} \sqrt{2 \sum_{k=1}^{K} (c_k - \hat{c}_k)^2}\)

Applications: Voice conversion, TTS evaluation.

Libraries:

Python: librosa, scipy

References:

Kubichek, R. (1993). “Mel-cepstral distance measure for objective speech quality assessment”

ASR & NLP-Based Metrics

Metrics using automatic speech recognition and language models.

Word Error Rate (WER)

Description: Percentage of word errors (substitutions, deletions, insertions) in ASR output.

How it works: \(\text{WER} = \frac{S + D + I}{N} \times 100\%\) where S=substitutions, D=deletions, I=insertions, N=total words.

Libraries:

Python: jiwer

Datasets:

References:

NIST Speech Recognition Scoring Toolkit

BERT Score

Description: Contextual embedding similarity between reference and hypothesis transcriptions.

How it works: Computes cosine similarity of BERT embeddings token-by-token.

Libraries:

Python: bert-score

References:

Zhang, T., et al. (2020). “BERTScore: Evaluating Text Generation with BERT”

Hearing Aid Metrics

Metrics specific to hearing aid and assistive listening device evaluation.

Hearing Aid Speech Quality Index (HASQI)

Description: Predicts speech quality (not just intelligibility) for hearing aid users.

How it works: Models auditory processing with hearing loss, computes quality along multiple dimensions.

Libraries:

MATLAB: HASQI Toolbox

References:

Kates, J.M., & Arehart, K.H. (2010). “The Hearing-Aid Speech Quality Index (HASQI)”

Binaural Intelligibility Level Difference (BILD)

Description: Improvement in intelligibility from binaural vs. monaural listening.

How it works: Compares predicted intelligibility under binaural and monaural conditions.

Applications: Bilateral hearing aid fitting, spatial audio benefits.

References:

Culling, J.F., et al. (2004). “The role of head-induced interaural time and level differences”

Soundscape Indices

Metrics for environmental and ecological acoustics.

Acoustic Complexity Index (ACI)

Description: Measures temporal variability in soundscapes, correlates with biodiversity.

How it works: Computes intensity differences across adjacent time frames in frequency bands.

Libraries:

R: soundecology
Python: scikit-maad

Datasets:

Xeno-canto (bird recordings)
AudioSet

References:

Pieretti, N., et al. (2011). “A new methodology to infer the singing activity of an avian community”

Normalized Difference Soundscape Index (NDSI)

Description: Ratio of biophony (1-2 kHz) to anthrophony (1-2 kHz and 2-11 kHz).

How it works: \(\text{NDSI} = \frac{\text{Biophony} - \text{Anthrophony}}{\text{Biophony} + \text{Anthrophony}}\)

Libraries:

R: soundecology
Python: scikit-maad

References:

Kasten, E.P., et al. (2012). “The remote environmental assessment laboratory’s acoustic library”

Bioacoustic Index (BI)

Description: Area under the spectrum curve within frequency range of biological sounds.

How it works: Integrates spectral energy in 2-8 kHz range (typical for birds/insects).

Libraries:

R: soundecology

References:

Boelman, N.T., et al. (2007). “Multi-trophic invasion resistance in Hawaii”

Soundscape Pleasantness (ISO 12913-3)

Description: Subjective assessment of soundscape quality in urban environments.

How it works: Perceptual attributes evaluated via listening tests (pleasantness, eventfulness, etc.).

Standards:

ISO 12913-3: Data analysis and reporting

Datasets:

SATP Database (Soundscapes & Psychophysiology)

References:

Aletta, F., et al. (2016). “Soundscape descriptors and a conceptual framework for developing predictive soundscape models”

References

Standards

ITU-T P.862: PESQ
ITU-T P.863: POLQA
ITU-R BS.1387: PEAQ
ITU-R BS.1770: Loudness
ANSI S3.5: Speech Intelligibility Index
ISO 3382-1: Room acoustics
ISO 12913: Soundscape assessment

Key Papers

Rix, A.W., et al. (2001). “Perceptual evaluation of speech quality (PESQ)”
Taal, C.H., et al. (2011). “An Algorithm for Intelligibility Prediction”
Kates, J.M., & Arehart, K.H. (2014). “The Hearing-Aid Speech Perception Index”
Vincent, E., et al. (2006). “Performance measurement in blind audio source separation”
Zhang, T., et al. (2020). “BERTScore”

Toolboxes & Libraries

Last updated: January 14, 2025
This is a living document. Suggestions? Email me.

Introduction

Speech Level Variation

Overall Audio Quality

Speech Quality

Speech Enhancement

Speech Intelligibility

Speech in Reverberation

Room Acoustics Quality

Speech in Noise

Blind Source Separation

Music Quality

Distance-Based Metrics

ASR & NLP-Based Metrics

Hearing Aid Metrics

Soundscape Indices

References

Standards

Key Papers

Toolboxes & Libraries