SAiD

SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion

Inkyu Park

Jaewoong Cho

Preprint

Abstract

Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets despite extensive research. Most prior works, typically focused on learning regression models on a small dataset using the method of least squares, encounter difficulties generating diverse lip movements from speech and require substantial effort in refining the generated outputs. To address these issues, we propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization. Moreover, we introduce BlendVOCA, a benchmark dataset of pairs of speech audio and parameters of a blendshape facial model, to address the scarcity of public resources. Our experimental results demonstrate that the proposed approach achieves comparable or superior performance in lip synchronization to baselines, ensures more diverse lip movements, and streamlines the animation editing process.

SAiD (Ours)

end2end_AU_speech

VOCA+QP

MeshTalk+QP

FaceFormer+QP

CodeTalker+QP

FaceTalk_170731_00024_TA/
sentence01.wav

FaceTalk_170809_00138_TA/
sentence02.wav

SAiD (Base)

train w/ squared error

train w/o velocity loss

train w/o alignment bias

finetune pre-trained Wav2Vec 2.0

FaceTalk_170731_00024_TA/
sentence01.wav

FaceTalk_170809_00138_TA/
sentence02.wav

Output 1

Output 2

Output 3

Output 4

Output 5

FaceTalk_170731_00024_TA/
sentence01.wav

FaceTalk_170809_00138_TA/
sentence02.wav

Editability

We visualize the editing results of SAiD with two different cases:

Motion in-betweening

Motion generation with blendshape-specific constraints by masking coefficients corresponding to certain blendshapes

Hatched boxes indicate the masked areas that should be invariant during the editing.

In-betweening

Blendshape-specific constraints

FaceTalk_170731_00024_TA/
sentence01.wav

FaceTalk_170809_00138_TA/
sentence02.wav

VOCASET - FaceTalk_170725_00137_TA

VRoid Studio - AvatarSample_A

MetaHuman - Ada

Unity_ARKitFacialCapture - Sloth

FaceTalk_170731_00024_TA/
sentence01.wav

FaceTalk_170809_00138_TA/
sentence02.wav

@misc{park2023said, title={SAiD: Speech-driven Blendshape Facial Animation with Diffusion}, author={Inkyu Park and Jaewoong Cho}, year={2023}, eprint={2401.08655}, archivePrefix={arXiv}, primaryClass={cs.CV} }

SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion

Abstract

Comparison with baseline methods

Ablation studies

Diversity

Editability

Visualization on different blendshape facial models

BibTeX