In what has become a familiar refrain when discussing artificial intelligence (AI)-enabled technologies, voice cloning makes possible beneficial advances in accessibility and creativity while also enabling increasingly sophisticated scams and deepfakes. To combat the potential negative impacts of voice cloning technology, the U.S. Federal Trade Commission (FTC) challenged researchers and technology experts to develop breakthrough ideas on preventing, monitoring and evaluating malicious voice cloning.
Ning Zhang, an assistant professor of computer science and engineering in the McKelvey School of Engineering at Washington University in St. Louis, was one of three winners of the FTC’s Voice Cloning Challenge announced April 8. Zhang explained his winning project, DeFake, which deploys a kind of watermarking for voice recordings. DeFake embeds carefully crafted distortions that are imperceptible to the human ear into recordings, making criminal cloning more difficult by eliminating usable voice samples.
“DeFake uses a technique of adversarial AI that was originally part of the cybercriminals’ toolbox, but now we’re using it to defend against them,” Zhang said. “Voice cloning relies on the use of pre-existing speech samples to clone a voice, which are generally collected from social media and other platforms. By perturbing the recorded audio signal just a little bit, just enough that it still sounds right to human listeners, but it’s completely different to AI, DeFake obstructs cloning by making criminally synthesized speech sound like other voices, not the intended victim.”