The Center for Data Innovation spoke with Hao Li, associate professor of computer science at the University of Southern California and CEO of Pinscreen, a digital avatar startup based in Los Angeles. Li discussed his work in building deepfake detection technology and how his company uses AI to create 3D avatars from 2D pictures.
This interview has been lightly edited.
Joshua New: You recently developed a digital forensics technique that can detect if a video is genuine or a deepfake. Can you explain how this works?
Hao Li: We collaborated with Hany Farid from UC Berkeley on an approach that looks at facial motion and expressions of specific subjects instead of image signals. The idea is to recognize particular motion signatures of known people like world leaders and celebrities and to determine whether or not there is another person puppeteering the target subject.
To identify these motion signatures, we developed a software based on facial tracking technology, which records the movement of someone’s face and analyzes how individual parts correlate with each other. For example, if someone would blink when they start smiling or nod their head when starting to speak. When analyzing facial movements and how they relate with each other in different parts of the face, we can build a statistical profile of someone’s motion signature that is unique to that person. Consequently, if an impersonator tries to puppeteer President Trump, he may not be able to replicate all the subtle motions and coordinate his facial expressions as naturally as the ones from President Trump. In this way, regardless of how the video has been manipulated we can tell from the puppeteer that the motion cannot be President Trump’s face. However, it may be possible to even generate videos in the future that can replicate facial movements that matches the target subject accurately.
New: Deepfake technology is new and by all accounts will continue to improve. Do you think it’s possible for the technology to get so good that a deepfake video is indistinguishable from a genuine video, even with the best digital forensics technology?
Li: Yes I believe this will be possible, and in some ways it already it. We can easily degrade the quality of certain videos and also add more manual resources with the help of a visual effects team to professionally alter the content. In the near future, this is likely to be producible by end-users without much effort and the fidelity of video manipulations will increase. Since everything we watch is digital it should be possible to create a perfect fake.
I wouldn’t be surprised if completely convincing deepfakes can be consistently generated in a year or two. Also I would expect other capabilities to be possible, such as complete head replacement, or even the generation of entire scenes surrounding the head of someone. There are already recent research publications that demonstrate early results that these scenarios are possible. While the quality and resolution isn’t there yet, the rapid advancements in deep learning are likely to make this possible.
New: From a technical perspective, how similar is the technology you’ve developed at Pinscreen technology to the kind of software used to make deepfakes?
Li: The technology we developed at Pinscreen is based on deep generative models, in particular the use of GANs (generative adversarial networks) to significantly increase photorealism of synthesized facial expressions. Some deepfake solutions use a similar approach, such as faceswap-gan. While deepfakes are optimized to swap and blend a person’s face onto another, what we develop at Pinscreen is used to generate a complete 3D avatar of a person from a single input picture. Our technology at Pinscreen has additional capabilities, such that we do not require hours of training and large amounts of training data. Only one input photo is sufficient and the synthesis happens in real-time.
Our goal at Pinscreen is also not to create deepfakes. This is mostly a side project for us to create awareness of these issues, but our goal is to create a platform for digitizing humans. In the near future, our technology will enable people to easily create their own avatars for immersive telepresence using AR/VR, new forms of shopping experiences—imagine trying out new clothes without going to the store but by simply uploading a picture to a virtual fitting room—personalized gaming, as well as the creation of personalized live bloggers and virtual assistants driven by AI.
New: A lot of your research focuses on improving 3D digitization of hair. Why is this such a complex challenge compared to other aspects of 3D modeling?
Li: Compared to faces, the digitization of 3D hair poses additional challenges. Hair can have an extremely wide range of style variations: long, short, curly, straight, wavy, fringes, and so on. They are highly convoluted volumetric structures, and they are highly specular, making them gloss in a wide range of ways under different lighting conditions. In order to create a 3D model from a 2D photo, one needs to create a model that can not only represent all different styles, but also be able to determine and predict the right 3D shape from the 2D picture. Only through recent advances in deep learning we were able to build an effective data representation for this complex inference task.
New: Can you describe the research you’re leading on virtual reality as director of the Vision and Graphics Lab at the USC Institute for Creative Technologies (USC ICT)? How valuable has deep learning been to VR development?
Li: At USC ICT, we are developing fundamental algorithms and end-to-end systems at the intersection of graphics, vision, and machine learning. We focus on human digitization, AR/VR-based telepresence, photorealistic facial capture, deepfake technologies, motion synthesis algorithms, and general 3D object modeling techniques. We have a highly sophisticated multi-view photometric capture system called Light Stage, developed originally by Dr. Paul Debevec, who is now at Google, and we use it to scan celebrities for digital clones in blockbuster movies, but also to build high-fidelity datasets for nextgen AI algorithms.