Practical lip syncing models have existed for several years, for example, wav2lip, although not diffusion based, it can do a decent job at syncing lips in any spoken languages. Recently I built a simple interface of it in the aiTransformer app, called Speech Synthesizer (https://aiTransformer.net/SpeechSynthesizer), so that anyone can easily do lip syncing. Nevertheless, I'll try the new model once it's open sourced, see if it can do a better job in terms of result quality and performance. I also like the potential uses of this technology the author mentioned, that do get me thinking.