Mel Spectrogram
In short: A mel spectrogram is a visual representation of audio frequency content over time, scaled to match human hearing perception, commonly used as input to lip sync neural networks.
About Mel Spectrogram
Mel spectrograms convert raw audio waveforms into a 2D representation where the x-axis represents time, the y-axis represents frequency bands (scaled according to the mel scale to match human auditory perception), and the color intensity represents energy at each frequency. Most neural lip sync models use mel spectrograms as their primary audio input rather than raw waveforms, because the mel scale emphasizes the frequency ranges most relevant to speech perception.
This representation helps models focus on the phonetically meaningful components of speech that determine mouth shapes.
How Mel Spectrogram Connects to Lip Sync
Mel Spectrogram relates to several other concepts in the AI lip sync pipeline: Phoneme , and Audio-Driven Animation .
Explore More
Related Terms
Try AI Lip Sync
Experience studio-quality lip synchronization for videos in any language.