SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion

Inkyu Park

 

Jaewoong Cho

Preprint

Abstract

Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets despite extensive research. Most prior works, typically focused on learning regression models on a small dataset using the method of least squares, encounter difficulties generating diverse lip movements from speech and require substantial effort in refining the generated outputs. To address these issues, we propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization. Moreover, we introduce BlendVOCA, a benchmark dataset of pairs of speech audio and parameters of a blendshape facial model, to address the scarcity of public resources. Our experimental results demonstrate that the proposed approach achieves comparable or superior performance in lip synchronization to baselines, ensures more diverse lip movements, and streamlines the animation editing process.

Comparison with baseline methods

GT SAiD (Ours) end2end_AU_speech VOCA+QP MeshTalk+QP FaceFormer+QP CodeTalker+QP
FaceTalk_170731_00024_TA/
sentence01.wav
FaceTalk_170809_00138_TA/
sentence02.wav

Ablation studies

SAiD (Base) train w/ squared error train w/o velocity loss train w/o alignment bias finetune pre-trained Wav2Vec 2.0
FaceTalk_170731_00024_TA/
sentence01.wav
FaceTalk_170809_00138_TA/
sentence02.wav

Diversity

We visualize the vertex position differences in SAiD outputs over the mean output. We use viridis colormap with a range of [0, 0.001].

Output 1 Output 2 Output 3 Output 4 Output 5
FaceTalk_170731_00024_TA/
sentence01.wav
FaceTalk_170809_00138_TA/
sentence02.wav

Editability

We visualize the editing results of SAiD with two different cases:

  1. Motion in-betweening
  2. Motion generation with blendshape-specific constraints by masking coefficients corresponding to certain blendshapes

Hatched boxes indicate the masked areas that should be invariant during the editing.

In-betweening Blendshape-specific constraints
FaceTalk_170731_00024_TA/
sentence01.wav
FaceTalk_170809_00138_TA/
sentence02.wav

Visualization on different blendshape facial models

Since the MetaHuman does not support the mouthClose blendshape, we use the editing algorithm to ensure the corresponding blendshape coefficients of the outputs are all zero.

VOCASET - FaceTalk_170725_00137_TA VRoid Studio - AvatarSample_A MetaHuman - Ada Unity_ARKitFacialCapture - Sloth
FaceTalk_170731_00024_TA/
sentence01.wav
FaceTalk_170809_00138_TA/
sentence02.wav

BibTeX

@misc{park2023said,
    title={SAiD: Speech-driven Blendshape Facial Animation with Diffusion},
    author={Inkyu Park and Jaewoong Cho},
    year={2023},
    eprint={2401.08655},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}