Mel Spectrogram

In short: A mel spectrogram is a visual representation of audio frequency content over time, scaled to match human hearing perception, commonly used as input to lip sync neural networks.

About Mel Spectrogram

Mel spectrograms convert raw audio waveforms into a 2D representation where the x-axis represents time, the y-axis represents frequency bands (scaled according to the mel scale to match human auditory perception), and the color intensity represents energy at each frequency. Most neural lip sync models use mel spectrograms as their primary audio input rather than raw waveforms, because the mel scale emphasizes the frequency ranges most relevant to speech perception.

This representation helps models focus on the phonetically meaningful components of speech that determine mouth shapes.

How Mel Spectrogram Connects to Lip Sync

Mel Spectrogram relates to several other concepts in the AI lip sync pipeline: Phoneme , and Audio-Driven Animation .

Explore More

Related Terms

Try AI Lip Sync

Experience studio-quality lip synchronization for videos in any language.