By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
Viral Trending contentViral Trending content
  • Home
  • World News
  • Politics
  • Sports
  • Celebrity
  • Business
  • Crypto
  • Gaming News
  • Tech News
  • Travel
Reading: A New System for Temporally Consistent Stable Diffusion Video Characters
Notification Show More
Viral Trending contentViral Trending content
  • Home
  • Categories
    • World News
    • Politics
    • Sports
    • Celebrity
    • Business
    • Crypto
    • Tech News
    • Gaming News
    • Travel
  • Bookmarks
© 2024 All Rights reserved | Powered by Viraltrendingcontent
Viral Trending content > Blog > Tech News > A New System for Temporally Consistent Stable Diffusion Video Characters
Tech News

A New System for Temporally Consistent Stable Diffusion Video Characters

By Viral Trending Content 12 Min Read
Share
SHARE

A new initiative from the Alibaba Group offers one of the best methods I have seen for generating full-body human avatars from a Stable Diffusion-based foundation model.

Contents
MethodData and TestsConclusion

Titled MIMO (MIMicking with Object Interactions), the system uses a range of popular technologies and modules, including CGI-based human models and AnimateDiff, to enable temporally consistent character replacement in videos – or else to drive a character with a user-defined skeletal pose.

Here we see characters interpolated from a single image source, and driven by a predefined motion:

[Click video below to play]

From single source images, three diverse characters are driven by a 3D pose sequence (far left) using the MIMO system. See the project website and the accompanying YouTube video (embedded at the end of this article) for more examples and superior resolution. Source: https://menyifang.github.io/projects/MIMO/index.html

Generated characters, which can also be sourced from frames in videos and in diverse other ways, can be integrated into real-world footage.

MIMO offers a novel system which generates three discrete encodings, each for character, scene, and occlusion (i.e., matting, when some object or person passes in front of the character being depicted). These encodings are integrated at inference time.

[Click video below to play]

MIMO can replace original characters with photorealistic or stylized characters that follow the motion from the target video. See the project website and the accompanying YouTube video (embedded at the end of this article) for more examples and superior resolution.

The system is trained over the Stable Diffusion V1.5 model, using a custom dataset curated by the researchers, and composed equally of real-world and simulated videos.

The great bugbear of diffusion-based video is temporal stability, where the content of the video either flickers or ‘evolves’ in ways that are not desired for consistent character representation.

MIMO, instead, effectively uses a single image as a map for consistent guidance, which can be orchestrated and constrained by the interstitial SMPL CGI model.

Since the source reference is consistent, and the base model over which the system is trained has been enhanced with adequate representative motion examples, the system’s capabilities for temporally consistent output are well above the general standard for diffusion-based avatars.

[Click video below to play]

Further examples of pose-driven MIMO characters. See the project website and the accompanying YouTube video (embedded at the end of this article) for more examples and superior resolution.

It is becoming more common for single images to be used as a source for effective neural representations, either by themselves, or in a multimodal way, combined with text prompts. For example, the popular LivePortrait facial-transfer system can also generate highly plausible deepfaked faces from single face images.

The researchers believe that the principles used in the MIMO system can be extended into other and novel types of generative systems and frameworks.

The new paper is titled MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling, and comes from four researchers at Alibaba Group’s Institute for Intelligent Computing. The work has a video-laden project page and an accompanying YouTube video, which is also embedded at the bottom of this article.

Method

MIMO achieves automatic and unsupervised separation of the aforementioned three spatial components, in an end-to-end architecture (i.e., all the sub-processes are integrated into the system, and the user need only provide the input material).

The conceptual schema for MIMO. Source: https://arxiv.org/pdf/2409.16160

Objects in source videos are translated from 2D to 3D, initially using the monocular depth estimator Depth Anything. The human element in any frame is extracted with methods adapted from the Tune-A-Video project.

These features are then translated into video-based volumetric facets via Facebook Research’s Segment Anything 2 architecture.

The scene layer itself is obtained by removing objects detected in the other two layers, effectively providing a rotoscope-style mask automatically.

For the motion, a set of extracted latent codes for the human element are anchored to a default human CGI-based SMPL model, whose movements provide the context for the rendered human content.

A 2D feature map for the human content is obtained by a differentiable rasterizer derived from a 2020 initiative from NVIDIA. Combining the obtained 3D data from SMPL with the 2D data obtained by the NVIDIA method, the latent codes representing the ‘neural person’ have a solid correspondence to their eventual context.

At this point, it is necessary to establish a reference commonly needed in architectures that use SMPL – a canonical pose. This is broadly similar to Da Vinci’s ‘Vitruvian man’, in that it represents a zero-pose template which can accept content and then be deformed, bringing the (effectively) texture-mapped content with it.

These deformations, or ‘deviations from the norm’, represent human movement, while the SMPL model preserves the latent codes that constitute the human identity that has been extracted, and thus represents the resulting avatar correctly in terms of pose and texture.

An example of a canonical pose in an SMPL figure. Source: https://www.researchgate.net/figure/Layout-of-23-joints-in-the-SMPL-models_fig2_351179264

Regarding the issue of entanglement (the extent to which trained data can turn out to be inflexible when you stretch it beyond its trained confines and associations), the authors state*:

‘To fully disentangle the appearance from posed video frames, an ideal solution is to learn the dynamic human representation from the monocular video and transform it from the posed space to the canonical space.

‘Considering the efficiency, we employ a simplified method that directly transforms the posed human image to the canonical result in standard A-pose using a pretrained human repose model. The synthesized canonical appearance image is fed to ID encoders to obtain the identity .

‘This simple design enables full disentanglement of identity and motion attributes. Following [Animate Anyone], the ID encoders include a CLIP image encoder and a reference-net architecture to embed for the global and local feature, [respectively].’

For the scene and occlusion aspects, a shared and fixed Variational Autoencoder (VAE – in this case derived from a 2013 publication) is used to embed the scene and occlusion elements into the latent space. Incongruities are handled by an inpainting method from the 2023 ProPainter project.

Once assembled and retouched in this way, both the background and any occluding objects in the video will provide a matte for the moving human avatar.

These decomposed attributes are then fed into a U-Net backbone based on the Stable Diffusion V1.5 architecture. The complete scene code is concatenated with the host system’s native latent noise. The human component is integrated via self-attention and cross-attention layers, respectively.

Then, the denoised result is output via the VAE decoder.

Data and Tests

For training, the researchers created human video dataset titled HUD-7K, which consisted of 5,000 real character videos and 2,000 synthetic animations created by the En3D system. The real videos required no annotation, due to the non-semantic nature of the figure extraction procedures in MIMO’s architecture. The synthetic data was fully annotated.

The model was trained on eight NVIDIA A100 GPUs (though the paper does not specify whether these were the 40GB or 80GB VRAM models), for 50 iterations, using 24 video frames and a batch size of four, until convergence.

The motion module for the system was trained on the weights of AnimateDiff. During the training process, the weights of the VAE encoder/decoder, and the CLIP image encoder were frozen (in contrast to full fine-tuning, which will have a much broader effect on a foundation model).

Though MIMO was not trialed against analogous systems, the researchers tested it on difficult out-of-distribution motion sequence sourced from AMASS and Mixamo. These movements included climbing, playing, and dancing.

They also tested the system on in-the-wild human videos. In both cases, the paper reports ‘high robustness’ for these unseen 3D motions, from different viewpoints.

Though the paper offers multiple static image results demonstrating the effectiveness of the system, the true performance of MIMO is best assessed with the extensive video results provided at the project page, and in the YouTube video embedded below (from which the videos at the start of this article have been derived).

The authors conclude:

‘Experimental results [demonstrate] that our method enables not only flexible character, motion and scene control, but also advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive scenes.

‘We also [believe] that our solution, which considers inherent 3D nature and automatically encodes the 2D video to hierarchical spatial components could inspire future researches for 3D-aware video synthesis.

‘Furthermore, our framework is not only well suited to generate character videos but also can be potentially adapted to other controllable video synthesis tasks.’

Conclusion

It’s refreshing to see an avatar system based on Stable Diffusion that appears capable of such temporal stability –  not least because Gaussian Avatars seem to be gaining the high ground in this particular research sector.

The stylized avatars represented in the results are effective, and while the level of photorealism that MIMO can produce is not currently equal to what Gaussian Splatting is capable of, the diverse advantages of creating temporally consistent humans in a semantically-based Latent Diffusion Network (LDM) are considerable.

 

* My conversion of the authors’ inline citations to hyperlinks, and where necessary, external explanatory hyperlinks.

First published Wednesday, September 25, 2024

You Might Also Like

Apple AI Pin Specs Leak: Dual Cameras, No Screen & More

The diverse responsibilities of a principal software engineer

OpenAI Backs Bill That Would Limit Liability for AI-Enabled Mass Deaths or Financial Disasters

Google’s Fitbit Tease has me More Excited for Garmin’s Whoop Rival

Why the TCL NXTPAPER 14 Is One of the Best Tablets for Musicians and Sheet Music Reading

TAGGED: #AI, AnimateDiff, MIMO, Stable Diffusion, temporal stability
Share This Article
Facebook Twitter Copy Link
Previous Article Teenager Nwaneri and Sterling score first Arsenal goals as Bolton are brushed aside
Next Article US, allies call for ‘immediate’ 21-day cease-fire between Israel and Hezbollah
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

- Advertisement -
Ad image

Latest News

JPMorgan CEO Jamie Dimon says he’s ‘learned and relearned’ to not make big decisions when he’s tired on Fridays
Business
Apple AI Pin Specs Leak: Dual Cameras, No Screen & More
Tech News
A ‘glass-like’ battlefield: German Army chief on the future of warfare
World News
Polymarket Sees Record $153M Daily Volume After Chainlink Integration
Crypto
Natasha Lyonne Then & Now: See Before & After Photos of the Actress Here
Celebrity
Cult Hit Doki Doki Literature Club Fights Removal From Google Play Store Over ‘Depiction Of Sensitive Themes’
Gaming News
Dead as Disco Launches Into Early Access on May 5th, Groovy New Gameplay Released
Gaming News

About Us

Welcome to Viraltrendingcontent, your go-to source for the latest updates on world news, politics, sports, celebrity, tech, travel, gaming, crypto news, and business news. We are dedicated to providing you with accurate, timely, and engaging content from around the globe.

Quick Links

  • Home
  • World News
  • Politics
  • Celebrity
  • Business
  • Home
  • World News
  • Politics
  • Sports
  • Celebrity
  • Business
  • Crypto
  • Gaming News
  • Tech News
  • Travel
  • Sports
  • Crypto
  • Tech News
  • Gaming News
  • Travel

Trending News

cageside seats

Unlocking the Ultimate WWE Experience: Cageside Seats News 2024

Investing £5 a day could help me build a second income of £329 a month!

JPMorgan CEO Jamie Dimon says he’s ‘learned and relearned’ to not make big decisions when he’s tired on Fridays

cageside seats
Unlocking the Ultimate WWE Experience: Cageside Seats News 2024
May 22, 2024
Investing £5 a day could help me build a second income of £329 a month!
March 27, 2024
JPMorgan CEO Jamie Dimon says he’s ‘learned and relearned’ to not make big decisions when he’s tired on Fridays
April 10, 2026
Brussels unveils plans for a European Degree but struggles to explain why
March 27, 2024
© 2024 All Rights reserved | Powered by Vraltrendingcontent
  • About Us
  • Contact US
  • Disclaimer
  • Privacy Policy
  • Terms of Service
Welcome Back!

Sign in to your account

Lost your password?