Sapiens: Foundation for Human Vision Models

The remarkable success of large-scale pretraining followed by task-specific fine-tuning for language modeling has established this approach as a standard practice. Similarly, computer vision methods are progressively embracing extensive data scales for pretraining. The emergence of large datasets, such as LAION5B, Instagram-3.5B, JFT-300M, LVD142M, Visual Genome, and YFCC100M, has enabled the exploration of a data corpus well beyond the scope of traditional benchmarks. Salient work in this domain includes DINOv2, MAWS, and AIM. DINOv2 achieves state-of-the-art performance in generating self-supervised features by scaling the contrastive iBot method on the LDV-142M dataset. MAWS studies the scaling of masked-autoencoders (MAE) on billion images. AIM explores the scalability of autoregressive visual pretraining similar to BERT for vision transformers. In contrast to these methods, which mainly focus on general image pretraining or zero-shot image classification, Sapiens takes a distinctly human-centric approach: Sapiens’ models leverage a vast collection of human images for pretraining, subsequently fine-tuning for a range of human-related tasks. The pursuit of large-scale 3D human digitization remains a pivotal goal in computer vision.

Contents

Sapiens : Method and Architecture 2D Pose Estimation Sapien : Experiment and Results Conclusion

Significant progress has been made within controlled or studio environments, yet challenges persist in extending these methods to unconstrained environments. To address these challenges, developing versatile models capable of multiple fundamental tasks, such as key popoint estimation, body-part segmentation, depth estimation, and surface normal prediction from images in natural settings, is crucial. In this work, Sapiens aims to develop models for these essential human vision tasks that generalize to in-the-wild settings. Currently, the largest publicly accessible language models contain upwards of 100B parameters, while the more commonly used language models contain around 7B parameters. In contrast, Vision Transformers (ViT), despite sharing a similar architecture, have not been scaled to this extent successfully. While there are notable endeavors in this direction, including the development of a dense ViT-4B trained on both text and images, and the formulation of techniques for the stable training of a ViT-22B, commonly utilized vision backbones still range between 300M to 600M parameters and are primarily pre-trained at an image resolution of about 224 pixels. Similarly, existing transformer-based image generation models, such as DiT, use less than 700M parameters and operate on a highly compressed latent space. To address this gap, Sapiens introduces a collection of large, high-resolution ViT models that are pretrained natively at a 1024-pixel image resolution on millions of human images.

Sapiens presents a family of models for four fundamental human-centric vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Sapiens models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. Sapiens observes that, given the same computational budget, self-supervised pre-training on a curated dataset of human images significantly boosts performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. The simple model design also brings scalability—model performance across tasks improves as the number of parameters scales from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks, achieving significant improvements over prior state-of-the-art results: 7.6 mAP on Humans-5K (pose), 17.1 mIoU on Humans-2K (part-seg), 22.4% relative RMSE on Hi4D (depth), and 53.5% relative angular error on THuman2 (normal).

Recent years have witnessed remarkable strides toward generating photorealistic humans in 2D and 3D. The success of these methods is greatly attributed to the robust estimation of various assets such as 2D key points, fine-grained body-part segmentation, depth, and surface normals. However, robust and accurate estimation of these assets remains an active research area, and complicated systems to boost performance for individual tasks often hinder wider adoption. Moreover, obtaining accurate ground-truth annotation in-the-wild is notoriously difficult to scale. Sapiens’ goal is to provide a unified framework and models to infer these assets in-the-wild, unlocking a wide range of human-centric applications for everyone.

Sapiens argues that such human-centric models should satisfy three criteria: generalization, broad applicability, and high fidelity. Generalization ensures robustness to unseen conditions, enabling the model to perform consistently across varied environments. Broad applicability indicates the versatility of the model, making it suitable for a wide range of tasks with minimal modifications. High fidelity denotes the ability of the model to produce precise, high-resolution outputs, essential for faithful human generation tasks. This paper details the development of models that embody these attributes, collectively referred to as Sapiens.

Following insights, Sapiens leverages large datasets and scalable model architectures, key for generalization. For broader applicability, Sapiens adopts the pretrain-then-finetune approach, enabling post-pretraining adaptation to specific tasks with minimal adjustments. This approach raises a critical question: What type of data is most effective for pretraining? Given computational limits, should the emphasis be on collecting as many human images as possible, or is it preferable to pretrain on a less curated set to better reflect real-world variability? Existing methods often overlook the pretraining data distribution in the context of downstream tasks. To study the influence of pretraining data distribution on human-specific tasks, Sapiens collects the Humans-300M dataset, featuring 300 million diverse human images. These un-labelled images are used to pre-train a family of vision transformers from scratch, with parameter counts ranging from 300M to 2B.

Among various self-supervision methods for learning general-purpose visual features from large datasets, Sapiens chooses the masked-autoencoder (MAE) approach for its simplicity and efficiency in pretraining. MAE, having a single-pass inference model compared to contrastive or multi-inference strategies, allows processing a larger volume of images with the same computational resources. For higher fidelity, in contrast to prior methods, Sapiens increases the native input resolution of its pretraining to 1024 pixels, resulting in approximately a 4× increase in FLOPs compared to the largest existing vision backbone. Each model is pretrained on 1.2 trillion tokens. For fine-tuning on human-centric tasks, Sapiens uses a consistent encoder-decoder architecture. The encoder is initialized with weights from pretraining, while the decoder, a lightweight and task-specific head, is initialized randomly. Both components are then fine-tuned end-to-end. Sapiens focuses on four key tasks: 2D pose estimation, body-part segmentation, depth, and normal estimation, as demonstrated in the following figure.

Consistent with prior studies, Sapiens affirms the critical impact of label quality on the model’s in-the-wild performance. Public benchmarks often contain noisy labels, providing inconsistent supervisory signals during model fine-tuning. At the same time, it is important to utilize fine-grained and precise annotations to align closely with Sapiens’ primary goal of 3D human digitization. To this end, Sapiens proposes a substantially denser set of 2D whole-body key points for pose estimation and a detailed class vocabulary for body part segmentation, surpassing the scope of previous datasets. Specifically, Sapiens introduces a comprehensive collection of 308 key points encompassing the body, hands, feet, surface, and face. Additionally, Sapiens expands the segmentation class vocabulary to 28 classes, covering body parts such as the hair, tongue, teeth, upper/lower lip, and torso. To guarantee the quality and consistency of annotations and a high degree of automation, Sapiens utilizes a multi-view capture setup to collect pose and segmentation annotations. Sapiens also utilizes human-centric synthetic data for depth and normal estimation, leveraging 600 detailed scans from RenderPeople to generate high-resolution depth maps and surface normals. Sapiens demonstrates that the combination of domain-specific large-scale pretraining with limited, yet high-quality annotations leads to robust in-the-wild generalization. Overall, Sapiens’ method shows an effective strategy for developing highly precise discriminative models capable of performing in real-world scenarios without the need for collecting a costly and diverse set of annotations.

Sapiens : Method and Architecture

Sapiens follows the masked-autoencoder (MAE) approach for pretraining. The model is trained to reconstruct the original human image given its partial observation. Like all autoencoders, Sapiens’ model has an encoder that maps the visible image to a latent representation and a decoder that reconstructs the original image from this latent representation. The pretraining dataset consists of both single and multi-human images, with each image resized to a fixed size with a square aspect ratio. Similar to ViT, the image is divided into regular non-overlapping patches with a fixed patch size. A subset of these patches is randomly selected and masked, leaving the rest visible. The proportion of masked patches to visible ones, known as the masking ratio, remains fixed throughout training.

Sapiens’ models exhibit generalization across a variety of image characteristics, including scales, crops, the age and ethnicity of subjects, and the number of subjects. Each patch token in the model accounts for 0.02% of the image area compared to 0.4% in standard ViTs, a 16× reduction—providing fine-grained inter-token reasoning for the models. Even with an increased mask ratio of 95%, Sapiens’ model achieves a plausible reconstruction of human anatomy on held-out samples. The reconstruction of Sapien’s pre-trained model on unseen human images is demonstrated in the following image.

Furthermore, Sapiens utilizes a large proprietary dataset for pretraining, consisting of approximately 1 billion in-the-wild images, focusing exclusively on human images. The preprocessing involves discarding images with watermarks, text, artistic depictions, or unnatural elements. Sapiens then uses an off-the-shelf person bounding-box detector to filter images, retaining those with a detection score above 0.9 and bounding box dimensions exceeding 300 pixels. Over 248 million images in the dataset contain multiple subjects.

2D Pose Estimation

The Sapien framework finetunes the encoder and decoder in P across multiple skeletons, including K = 17 [67], K = 133 [55] and a new highly-detailed skeleton, with K = 308, as shown in the following figure.

Compared to existing formats with at most 68 facial key points, Sapien’s annotations consist of 243 facial key points, including representative points around the eyes, lips, nose, and ears. This design is tailored to meticulously capture the nuanced details of facial expressions in the real world. With these key points, the Sapien framework manually annotated 1 million images at 4K resolution from an indoor capture setup. Similar to previous tasks, we set the decoder output channels of the normal estimator N to be 3, corresponding to the xyz components of the normal vector at each pixel. The generated synthetic data is also used as supervision for surface normal estimation.

Sapien : Experiment and Results

Sapiens-2B is pretrained using 1024 A100 GPUs for 18 days with PyTorch. Sapiens uses the AdamW optimizer for all experiments. The learning schedule includes a brief linear warm-up, followed by cosine annealing for pretraining and linear decay for finetuning. All models are pretrained from scratch at a resolution of 1024 × 1024 with a patch size of 16. For finetuning, the input image is resized to a 4:3 ratio, i.e., 1024 × 768. Sapiens applies standard augmentations like cropping, scaling, flipping, and photometric distortions. A random background from non-human COCO images is added for segmentation, depth, and normal prediction tasks. Importantly, Sapiens uses differential learning rates to preserve generalization, with lower learning rates for initial layers and progressively higher rates for subsequent layers. The layer-wise learning rate decay is set to 0.85 with a weight decay of 0.1 for the encoder.

The design specifications of Sapiens are detailed in the following table. Following a specific approach, Sapiens prioritizes scaling models by width rather than depth. Notably, the Sapiens-0.3B model, while architecturally similar to the traditional ViT-Large, consists of twentyfold more FLOPs due to its higher resolution.

Sapiens is fine-tuned for face, body, feet, and hand (K = 308) pose estimation using high-fidelity annotations. For training, Sapiens uses the train set with 1M images, and for evaluation, it uses the test set, named Humans5K, with 5K images. The evaluation follows a top-down approach, where Sapiens uses an off-the-shelf detector for bounding boxes and conducts single human pose inference. Table 3 shows a comparison of Sapiens models with existing methods for whole-body pose estimation. All methods are evaluated on 114 common key points between Sapiens’ 308 key point vocabulary and the 133 key point vocabulary from COCO-WholeBody. Sapiens-0.6B surpasses the current state-of-the-art, DWPose-l, by +2.8 AP. Unlike DWPose, which utilizes a complex student-teacher framework with feature distillation tailored for the task, Sapiens adopts a general encoder-decoder architecture with large human-centric pretraining.

Interestingly, even with the same parameter count, Sapiens models demonstrate superior performance compared to their counterparts. For instance, Sapiens-0.3B exceeds VitPose+-L by +5.6 AP, and Sapiens-0.6B outperforms VitPose+-H by +7.9 AP. Within the Sapiens family, results indicate a direct correlation between model size and performance. Sapiens-2B sets a new state-of-the-art with 61.1 AP, a significant improvement of +7.6 AP over the prior art. Despite fine-tuning with annotations from an indoor capture studio, Sapiens demonstrates robust generalization to real-world scenarios, as shown in the following figure.

Sapiens is fine-tuned and evaluated using a segmentation vocabulary of 28 classes. The train set consists of 100K images, while the test set, Humans-2K, consists of 2K images. Sapiens is compared with existing body-part segmentation methods fine-tuned on the same train set, using the suggested pretrained checkpoints by each method as initialization. Similar to pose estimation, Sapiens shows generalization in segmentation, as demonstrated in the following table.

Interestingly, the smallest model, Sapiens-0.3B, outperforms existing state-of-the-art segmentation methods like Mask2Former and DeepLabV3+ by 12.6 mIoU due to its higher resolution and large human-centric pretraining. Furthermore, increasing the model size further improves segmentation performance. Sapiens-2B achieves the best performance, with 81.2 mIoU and 89.4 mAcc on the test set, in the following figure shows the qualitative results of Sapiens models.

Conclusion

Sapiens represents a significant step toward advancing human-centric vision models into the realm of foundation models. Sapiens models demonstrate strong generalization capabilities across a variety of human-centric tasks. The state-of-the-art performance is attributed to: (i) large-scale pretraining on a curated dataset specifically tailored to understanding humans, (ii) scaled high-resolution and high-capacity vision transformer backbones, and (iii) high-quality annotations on augmented studio and synthetic data. Sapiens models have the potential to become a key building block for a multitude of downstream tasks and provide access to high-quality vision backbones to a significantly wider part of the community.