AlignNet: A Unifying Approach to Audio-Visual Alignment

Jianren Wang*
Zhaoyuan Fang*
Hang Zhao
CMU
University of Notre Dame
MIT
WACV 2020
[Download Paper]
[GitHub Code]
[Data (Dance50)]
[Slides]
[Poster]



We present AlignNet, a model designed to synchronize a video with a reference audio under non-uniform and irregular misalignment. AlignNet learns the end-to-end dense correspondence between each frame of a video and an audio. Our method is designed according to simple and well-established principles: attention, pyramidal processing, warping, and affinity function. Together with the model, we release a dancing dataset Dance50 for training and evaluation. Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that our method far outperforms the state-of-the-art methods.




Paper and Bibtex

Citation
 
Jianren Wang*, Zhaoyuan Fang*, Hang Zhao. AlignNet: A Unifying Approach to Audio-Visual Alignment
In WACV 2020.

[Bibtex] [Paper] [ArXiv]
@inproceedings{jianren20alignnet,
    Author = {Wang, Jianren and Fang, Zhaoyuan
            and Zhao, Hang},
    Title = {AlignNet: A Unifying Approach to Audio-Visual Alignment},
    Booktitle = {WACV},
    Year = {2020}
}


Acknowledgements

We would like to thank David Held, Antonio Torralba and members of the CMU Rpad and MIT CSAIL for fruitful discussions. The work was carried out when JW/FZ was at CMU and HZ was at MIT. This work was supported by PanGU Young Investigator Award to JW.