Speech Processing Glossary
Acoustic model A model describing the
probabilistic behavior of the encoding of the linguistic information
in a speech signal. LVCSR systems use acoustic
units corresponding to phones or phones in context. The most
predominant approach uses continuous density hidden Markov models (HMM) to represent context dependent phones.
Acoustic parametrization (or acoustic front-end)
see Speech Analysis
ASR Accuracy The speech recognition accuracy is
defined as the 1-WER.
Automatic Language Recognition Process
by which a computer identify the language being spoken in a speech signal.
Automatic Speaker Recognition Process
by which a computer identify the speaker from a speech signal.
Automatic Speech Recognition (ASR) Process
by which a computer convert a speech signal into a sequence of words.
Backoff Mechanism for smoothing the estimates of the probabilities of rare
events by relying on less specific models (acoustic or language models)
CDHMM Continuous Density HMM (usually based on Gaussian mixtures)
Filler word Words like uhm, euh, ...
FIR filter A Finite Impulse Response (FIR) filter produces
an output that is the weighted sum of the current and past inputs.
Frame An acoustic feature vector
(usually MFCC) estimated on a 20-30ms signal
window (see also Speech Analysis).
Frame Rate Number of frames per second (typically 100).
GMM Gaussian Mixture Model (i.e. a 1-state CDHMM)
HMM Hidden Markov Models (or Probabilistic functions of Markov chains)
HMM state Usually an GMM. An HMM contains one or more
states, typically 3 states for a phone model.
IIR filter An Infinite Impulse Response (IIR) filter
produces an output that is the weighted sum of the current and past
inputs, and past outputs.
Language model A language model captures
the regularities in the spoken language and is used by the speech
recognizer to estimate the probability of word sequences. One of the
most popular method is the so called n-gram
model, which attempts to capture the syntactic and semantic
constraints of the language by estimating the frequencies of sequences
of n words.
Lattice A word lattice is a weighted acyclic graph where
word labels are either assigned to the graphs edges (or links) or to
the graph vertices (or nodes). Acoustic and language model weights are associated
to each edge, and a time position is associated to each vertex
Lexicon or pronunciation dictionary A word list with pronunciations
LVCSR Large Vocabulary Speech Recognition (large vocabulary means 10k words or more).
The size of the recognizer vocabulary affects the processing requirements.
MAP estimation (Maximum A Posteriori) A training
procedure that attempts to maximize the posterior probability of the
model parameters (which are therefore seen as random variables)
Pr(M|X,W) (X is the speech signal, W is the word transcription, and M
represents the model parameters).
MAP decoding A decoding procedure (speech recognition)
which attempts to maximize the posterior probability Pr(W|X,M) of the
word transcription given the speech signal X and the model M.
MLE (Maximum Likelihood Estimation) A training procedure
(the estimation of the model parameters) that attempts to maximize the
training data likelihood given the model f(X|W,M) (X is the speech
signal, W is the word transcription, and M is the model).
MMIE (Maximum Mutual Information Estimation) A
discriminative training procedure that attempts to maximize the
posterior probability of the word transcription Pr(W|X,M) (X is the
speech signal, W is the word transcription, and M is the model). This
training procedure is also called Conditional Maximum Likelihood
Estimation.
MFCC Mel Frequency Cepstrum Coefficients. The Mel scale
approximates the sensitivity of the human ear. Note that there are many
other frequency scales "approximating" the human ear (e.g. the Bark
scale).
MF-PLP PLP coefficients obtained from a Mel frequency power spectrum
(see also MFCC and PLP).
MLP Multi-Layer Perceptron is a class of artificial neural network. It
is a feedforward network mapping some input data to some desired output representation. It is composed
of three or more layers with nonlinear activation functions (usually sigmoids).
N-Gram Probabilistic language model based on an N-1 order Markov chain
N-best Top N hypotheses
OOV word Out Of Vocabulary word -- Each
OOV word causes more than one recognition error (usually between 1.5
and 2 errors). An obvious way to reduce the error rate due to OOVs is
to increase the size of the vocabulary.
%OOV Out Of Vocabulary word rate.
Perplexity The relevance of a language model is often measured in terms of test set
perplexity defined as pow(Prob(text|language-model),-1/n), where is n
is the number of words in the test text. The test perplexity depends
on both the language being modeled and the model. It gives a combined
estimate of how good the model is and how complex the language is.
Phone Symbols used to represent the
pronunciations in the lexicon
Pitch or F0 The pitch is the fundamental
frequency of a (periodic or nearly periodic) speech signal. In
practice, the pitch period can be obtained from the position of the
maximum of the autocorrelation function of the signal. See also
degree of voicing, periodicity and harmonicity.
(In psychoacoustics the pitch is a subjective auditory attribute)
PLP analysis Perceptual Linear
Prediction: Compute perceptual power spectral density (Bark scale),
perform equal loudness preemphasis and take the cube root of the
intensity (intensity-loundness power law), apply the IDFT to get the
equivalent of the autocorrelation function, fit a LP model and
transform into cepstral coefficients (LPCC analysis).
Quinphone (or pentaphone) Phone in
context where the context includes the 2 left phones and the 2 right
phones
Recording channel Means by which the
audio signal is recorded (direct microphone, telephone, radio, etc.)
Sampling Rate Number of samples
per second used to code the speech signal (usually 16000, i.e. 16 kHz
for a bandwidth of 8 kHz). Telephone speech is sampled at 8 kHz. 16
kHz is generally regarded as sufficient for speech recognition and
synthesis. The audio standards use sample rates of 44.1 kHz (Compact
Disc) and 48 kHz (Digital Audio Tape). Note that signals must be
filtered prior to sampling, and the maximum frequency that can be
represented is half the sampling frequency. In practice a higher
sample rate is used to allow for non-ideal filters.
Sampling Resolution Number
of bits used to code each signal sample. Speech is normally stored in
16 bits. Telephony quality speech is sampled at 8 kHz with a 12 bit
dynamic range (stored in 8 bits with a non-linear function, i.e. A-law
or U-law). The dynamic range of the ear is about 20 bits.
Spectrogram A spectrogram is
a plot of the short-term power of the signal in different frequency
bands as a function of time.
Speech Analysis
Feature vector extraction from a windowed signal (20-30ms). It is
assumed that speech has short time stationarity and that a feature
vector representation captures the needed information (depending of
the task) for future processing. The most popular set of features are
cepstrum coefficients obtained with a Mel Frequency Cepstral (MFC) analysis or with a Perceptual Linear Prediction
(PLP) analysis.
Speech-to-Text Transcription
A synonym of Automatic Speech Recognition.
Triphone (or Phone in context) A
context-dependent HMM phone model (the context usually includes the
left and right phones)
Voicing The degree of voicing is a
measure of the degree to which a signal is periodic (also called
periodicity, harmonicity or HNR). In practice, the degree of
periodicity can be obtained from the relative height of the maximum of
the autocorrelation function of the signal.
Word Error Rate The word error rate is the
commonly used metric to evaluation speech recognizers. It is a
measure of the average number of word errors taking into account three
error types: substitution (the reference word is replaced by
another word), insertion (a word is hypothesized that was not in
the reference) and deletion (a word in the reference
transcription is missed). The word error rate is defined as the sum of
these errors divided by the number of reference words. Given this
definition the word error can be more than 100%.