Google
Jun 14, 2024We propose Whisper-Flamingo which integrates visual features into the Whisper speech recognition and translation model with gated cross attention.
We propose Whisper-Flamingo which integrates visual features into the Whisper speech recognition and translation model with gated cross attention. Our audio-�...
In this work, we propose to integrate visual features from AV-HuBERT into Whisper [1], an audio-only model trained on 680k hours of speech with a strong�...
Video for Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation.
Jun 15, 2024[Interspeech 2024] Whisper-Flamingo: Integrating Visual Features into Whisper � Comments.
Duration: 10:04
Posted: Jun 15, 2024
Our audio-visual Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and En-X translation for 6 languages in noisy conditions.
We convert Whisper into an audio-visual speech recognition model so that it can use both audio and lip-based video as input.
Missing: Integrating | Show results with:Integrating
Sep 4, 2024Our audio-visual Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and En-X translation for 6 languages in noisy�...
Jun 16, 2024Whisper-Flamingo is a new artificial intelligence (AI) model that combines visual information with audio data to improve speech recognition and translation.
People also ask
Our audio-visual Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and En-X translation for 6 languages in noisy conditions.
The enhanced audio features are fused with the visual features and taken to an encoder-decoder model composed of Conformer and Transformer for speech�...