This lecture is based on chapter 7 of [Quatieri, 2002]
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
1
Overview • Recap from previous lectures – Discrete time Fourier transform (DTFT) • Taking the expression of the Fourier transform π ππ = the DTFT can be derived by numerical integration
π π
ππ
=
∞
−∞
∞ π₯(π‘)π −πππ‘ ππ‘, −∞
π₯ π π −πππ
– where π₯ π = π₯ πππ and π = 2ππΉ πΉπ
– Discrete Fourier transform (DFT) • The DFT is obtained by “sampling” the DTFT at π discrete frequencies ππ = 2ππΉπ π, which yields the transform
ππ =
π−1 π=0
π₯ππ
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
−π
2π ππ π
2
• Why is another Fourier transform needed? – The spectral content of speech changes over time (non stationary) • As an example, formants change as a function of the spoken phonemes • Applying the DFT over a long window does not reveal transitions in spectral content
– To avoid this issue, we apply the DFT over short periods of time • For short enough windows, speech can be considered to be stationary • Remember, though, that there is a time-frequency tradeoff here 1.5
490 50
1
390 40
0
30
-0.5
20
-1
10
-1.5
SFTF (Hz)
X(f)
x(t)
0.5
290 200 100 0
50
100 150 time (sa.)
200
0
500 frequency (Hz)
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
1000
20 40 time (frames)
60
3
• The short-time Fourier transform in a nutshell – – – – –
Define analysis window (e.g., 30ms narrowband, 5 ms wideband) Define the amount of overlap between windows (e.g., 30%) Define a windowing function (e.g., Hann, Gaussian) Generate windowed segments (multiply signal by windowing function) Apply the FFT to each windowed segment
STFT: Fourier analysis view • Windowing function – To “localize” the speech signal in time, we define a windowing function π€ π, π , which is generally tapered at its ends to avoid unnatural discontinuities in the speech segment – Any window affects the spectral estimate computed on it • The window is selected to trade off the width of its main lobe and attenuation of its side lobes
– The most common are the Hann and Hamming windows (raised cosines) 2π π − π ππ€ − 1 2π π − π π€ π, π = 0.5 1 − cos π−1 π€ π, π = 0.54 − 0.4 cos
http://en.wikipedia.org/wiki/Window_function
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
5
Rectangular
Hann
Hamming
http://en.wikipedia.org/wiki/Window_function Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
6
• Discrete-time Short-time Fourier transform – The Fourier transform of the windowed speech waveform is defined as π π, π =
∞
π=−∞
π₯ π π€ π − π π −πππ
• where the sequence ππ π = π₯ π π€ π − π is a short-time section of the speech signal π₯ π at time n
• Discrete STFT – By analogy with the DTFT/DFT, the discrete STFT is defined as π π, π = π π, π
π=
2π π π
– The spectrogram we saw in previous lectures is a graphical display of the magnitude of the discrete STFT, generally in log scale π π, π = log π π, π 2 • This can be thought of as a 2D plot of the relative energy content in frequency at different time locations
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
7
– For a long window π€ π , the result is the narrowband spectrogram, which exhibits the harmonic structure in the form of horizontal striations – For a short window π€ π , the result is the wideband spectrogram, which exhibits periodic temporal structure in the form of vertical striations 5000
1
5000
wideband
narrowband 4000
4000
3000
3000
SFTF (Hz)
x(t)
SFTF (Hz)
0.5
2000
2000
0 1000
-0.5
1000
0 500 1000 1500 2000 2500 time (sa.)
0 50
100 150 time (frames)
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
200
50
100 150 200 250 time (frames)
8
STFT: filtering view • The STFT can also be interpreted as a filtering operation – In this case, the analysis window π€ π plays the role of the filter impulse response – To illustrate this view, we fix the value of π at π0 , and rewrite π π, π0 =
∞
π=−∞
π₯ π π −ππ0 π π€ π − π
• which can be interpreted as the convolution of the signal π₯ π π −ππ0π with the sequence π€ π : π π, π0 = π₯ π π −ππ0π ∗ π€ π • and the product π₯ π π −ππ0π can be interpreted as the modulation of π₯ π up to frequency π0 (i.e., per the frequency shift property of the FT)
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
9
[Quatieri, 2002]
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
10
– Alternatively, we can rearrange as (without proof) π π, π0 = π −ππ0π π₯ π ∗ π€ π π ππ0 π • In this case, the sequence π₯ π is first passed through the same filter (with a linear phase factor π ππ0π ), and the filter output is demodulated by π −ππ0π
[Quatieri, 2002]
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
11
– This later rearrangement allows us to interpret the discrete STFT as the output of a filter bank π π, π = π
−π
2π ππ π
π₯ π ∗π€ π π
−π
2π ππ π
• Note that each filter is acting as a bandpass filter centered around its selected frequency
– Thus, the discrete STFT can be viewed as a collection of sequences, each corresponding to the frequency components of π₯ π falling within a particular frequency band • This filtering view is shown in the next slide, both from the analysis side and from the synthesis (reconstruction) side
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
12
analysis
synthesis [Quatieri, 2002]
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
13
• Examples
ex6p1.m Generate STFT using Matlab functions ex6p2.m Generate filterbank outputs using the filtering view of the STFT
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
14
Short-time synthesis • Under what conditions is the STFT invertible? – The discrete-time STFT π π, π is generally invertible • Recall that π π, π =
∞
ππ π π −πππ
π=−∞
with ππ π = π₯ π π€ π − π • Evaluating ππ [π] at π = π we obtain ππ [π] = π₯ π π€ 0 • So assuming that π€ 0 ≠ 0, we can estimate π₯ π as π₯π =
1 2ππ€ 0
π π −π
π, π π πππ ππ
– This is known as a synthesis equation for the DT STFT
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
15
– Redundancy of the discrete-time STFT • There are many synthesis equations that map π π, π uniquely to π₯ π • Therefore, the STFT is very redundant if we move the analysis window one sample at a time π = 1,2,3 … • For this reason, the STFT is generally computed by decimating over time, that is, at integer multiples π = πΏ, 2πΏ, 3πΏ …
– For large πΏ, however, the DT STFT may become non-invertible
Amplitude
• As an example, assume that π€ π is nonzero over its length ππ€ • In this case, when πΏ > ππ€ , there are some samples of π₯ π that are not included in the computation of π π, π • Thus, these samples can have arbitrary values yet yield the same π ππΏ, π • Since π ππΏ, π is not uniquely defined, it is not invertible ππ€
πΏ
Unaccounted temporal samples
2πΏ
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
3πΏ
π 16
– Likewise, the discrete STFT π₯ π, π is not always invertible
Amplitude
• Consider the case where π€ π is band-limited with bandwidth π΅ • If the sampling interval 2π π is greater than π΅, some of the frequency components in π₯ π do not pass through any of the filters of the STFT • Thus, those frequency components can have any arbitrary values yet produce the same discrete STFT • In consequence, depending on the frequency sampling resolution, the discrete STFT may become non invertible
Synthesis: filter bank summation • FBS is based on the filtering interpretation of the STFT – As we saw earlier, according to this interpretation the discrete STFT is considered to be the set of outputs from a bank of filters – In the FSB method, the output of each filter is modulated with a complex exponential, and these outputs are summed to recover the original signal π¦ π =
1 ππ€ 0
∞ π=−∞ π
π, π π
π
2π ππ π
[Quatieri, 2002]
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
18
– Under which conditions does FBS yield exact synthesis? • It can be shown that π¦ π = π₯ π if either 1. The length of π€ π is less than or equal to the no. of filters ππ€ > π , or 2. For ππ€ > π: π−1
2π π = ππ€ 0 π π=0 • The latter is known as the BFS constraint, and states that the frequency response of the analysis filters should sum to a constant across the entire bandwidth π π−
[Quatieri, 2002]
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
19
Synthesis: Overlap-add • OLA is based on the Fourier transform view of the STFT – In the OLA method, we take the inverse DFT for each fixed time in the discrete STFT – In principle, we could then divide by the analysis window • This method is not used, however, as small perturbations in the STFT can become amplified in the estimated signal π¦ π
– Instead, we perform an OLA operation between the sections • This works provided that π€ π is designed such that the OLA effectively eliminates the analysis windows from the synthesized sequence • The intuition is that the redundancy within overlapping segments and the averaging of the redundant samples averages out the effect of windowing
– Thus, the OLA method can be expressed as 1 π¦π = π 0
∞
π=−∞
π−1 π=0
π π, π π
π
2π ππ π
– where the term inside the square brackets is the IDFT
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
20
– Under which conditions does OLA yield exact synthesis? • It can be shown that if the discrete STFT has been decimated by a factor πΏ, the condition π¦ π = π₯ π is met when ∞ π=−∞ π€
ππΏ − π =
π 0 πΏ
• which holds when either 1. The analysis window has finite bandwidth with maximum frequency ππ less than 2π/πΏ, or 2. The sum of all the analysis windows (obtained by sliding π€ π with πΏ-point increments) adds up to a constant
• In this case, π₯ π can then be resynthesized as ∞ π−1 2π πΏ 1 π ππ π₯π = π ππΏ, π π π π 0 π π=0 π=−∞
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
22
STFT magnitude • The spectrogram (STFT magnitude) is widely used in speech – For one, evidence suggests that the human ear extracts information strictly from a spectrogram representation of the speech signal – Likewise, trained researchers can visually “read” spectrograms, which further indicates that the spectrogram retains most of the information in the speech signal (at least at the phonetic level) – Hence, one may question whether the original signal π₯ π can be recovered from π π, π , that is, by ignoring phase information
• Inversion of the STFTM – Several methods may be used to estimate π₯ π from the STFTM – Here we focus on a fairly intuitive least-squares approximation
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
23
• Least-squares estimation from the STFT magnitude – In this approach, we seek to estimate a sequence π₯π π whose STFT magnitude ππ π, π is “closest” (in a least-squared-error sense) to the known STFT magnitude π π, π – The iteration takes place as follows • An arbitrary sequence (usually white noise) is selected as the first estimate π₯π1 π • We then compute the STFT of π₯π1 π and modify it by replacing its magnitude by that of π π, π πππ π, π 1 π π, π = π π, π πππ π, π • From this, we obtain a new signal estimate as ∞ π−1 π π=−∞ π€ π − π ππ π π₯π π = ∞ 2 π=−∞ π€ π − π π−1 whereππ π is the inverse DFT of π π−1 π, π • And the process continues iteratively until convergence or a stopping criterion is met Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
24
• It can be shown that this process reduces the distance between ππ π, π and π π, π at each iteration • Thus, the process converges to a local minimum, though not necessarily a global minimum
– All steps in the iteration can be summarized as (Quatieri, 2002; p. 342) 1 π π ∞ πππ π€ π − π π π, π π ππ π=−∞ −π 2π π+1 π₯π π = ∞ 2 π=−∞ π€ π − π π
where π π, π = π π, π
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU
πππ π,π πππ π,π
25
• Example
ex6p4.m Estimate a signal from its STFT magnitude
Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU