Building a Synthetic Data Pipeline to Power Human Motion Intelligence

To train a robust and generalized video-to-animation model, our research team needs massive datasets of perfectly paired 3D human motion and 2D video. To do so, we built a foundational dataset of hundreds of hours of real-world 3D motion data through extensive Motion Capture (MoCap) sessions.
However, manually filming matching 2D video for this data across enough diverse environments, body types, and lighting conditions to train a generalized AI is a physical impossibility.
We had to build the Kinetix Synthetic Data Pipeline (SDP) that would allow us to create video pairs on demand, while ensuring modular coverage of multiple parameters to ensure our human pose estimation would not be biased.
Collecting high-quality 3D human motion data
In machine learning, a model is strictly bounded by the quality of its ground truth. To establish this foundation, we recorded over hundreds of hours of human motion using studio-grade optical tracking technology. We captured the actors' full bodies (including hands and fingers) with sub-centimeter precision.
Capturing high-fidelity data on a single body type, however, leads to AI bias. Models often overfit to "average" proportions, failing when estimating poses for unusually tall, short, or broad individuals. To prevent this, we collected motion from over 75+ actors with diverse body morphologies. By training on different torso heights, limb lengths, and shoulder widths, we force the AI to learn the universal physics of human movement rather than memorizing one specific body shape.
Beyond morphology, we required high movement diversity. We captured everything from casual walking to complex stunts, using professional athletes, dancers, comedians, but also everyday people. While average individuals provided a natural motion baseline, professionals pushed physical boundaries. Athletes supplied edge-case kinematics like high acceleration and complex balancing, while dancers added highly expressive, nuanced poses.
Finally, raw MoCap data needs contextual anchoring. We meticulously annotated our dataset with action types, action description, marker positions, ground-contact, object or obstacle interaction labels, and more.
However, capturing this morphological diversity introduced a new challenge: motion-to-morphology entanglement. If we trained our HPE model directly on this raw 3D data, it risked developing body-type biases for specific movements. For example, if our dataset only featured tall actors performing backflips, the model would inextricably link the kinematics of a backflip to a tall skeleton. Later, if the AI processed a video of a shorter person doing the same stunt, it would hallucinate the limb lengths of a tall person, leading to severe geometric and projection errors.
Developing a proprietary Mesh-Aware Retargeting Technology
To solve this entanglement, we engineered a proprietary "mesh-aware" retargeting system.
Supervised HPE training requires two distinct streams of data to succeed:
First, the model needs a standardized target to predict against. We used our retargeting technology to transfer all hundred hours of our diverse raw motion onto a single, unified 3D skeleton. By stripping away the varying corpulence and heights of the original actors, we provide the AI with a normalized baseline: a backflip is a backflip, regardless of the body performing it.
Second, to generate our synthetic video pairs, we had to do the exact opposite: take that raw motion and project it onto 1,000+ custom 3D characters with drastically different morphologies (tall, lean, heavy, short).
This is where the "mesh-aware" aspect becomes necessary. Traditional retargeting only maps skeletal joints to joints. If you apply a slender actor's motion to a heavily built 3D character using standard retargeting, the resulting animation breaks: arms clip through the torso, thighs intersect, and feet float or sink into the floor.
Our mesh-aware technology jointly combines skeletal structure with the actual surface geometry—the mesh—of the target character. It explicitly understands the physical volume and surface constraints of the new body. If a character has a wider torso, the algorithm dynamically adjusts the movement around shoulders and arms to wrap around that volume, preventing interpenetration and self-collision. By respecting these volumetric constraints, we maintain the semantic intent of the original motion (e.g., hands resting perfectly on hips) while ensuring geometric consistency and accurate ground contacts. This guarantees that the synthetic videos we feed our AI are physically flawless.
Scaling 3D raw human motion with our proprietary synthetic data pipeline
With our motion data standardized and our retargeting engine ready, the final hurdle was generating the massive volume of corresponding 2D video pairs. To achieve this, we built a fully automated SDP driven by Unreal Engine, allowing us to procedurally generate millions of unique video pairs.
Here is how the SDP is architected, and why every step is critical for training an unbiased computer vision model:
FIGURE 4
Kinetix Synthetic Database Pipeline
Step 1: Associating the 3D animation raw data with a specific 3D character
First, for each 3D animation in our mocap dataset, we select a high-fidelity 3D character mesh from our Human Mesh library of 1000+ characters with diverse morphologies, skin tones, and clothing styles. Thanks to our automated mesh-aware retargeting technology, we build a new library of retargeted animations.
Step 2: Placing the character into a specific location in a specific scene
Next, the animated character is procedurally placed in diverse Unreal Engine environments, spawning at pseudo-random locations across different maps. By dropping characters into cluttered rooms or dense urban streets, we teach the AI to flawlessly segment the human subject from complex visual noise. Furthermore, because the scene is fully 3D, we automatically extract precise depth data, teaching the model to resolve scale and distance ambiguities.
Step 3: Adjusting the lighting conditions
Once the character is placed, the pipeline dynamically alters the environmental lighting. We simulate everything from harsh midday sun and soft overcast skies to deep night and high-contrast artificial indoor lighting. Lighting fundamentally changes the pixel values of an image, and heavy shadows can completely obscure limbs. Training across these endless, extreme lighting scenarios guarantees our model can accurately estimate poses even in poorly lit or highly saturated conditions.
Step 4: Setting up the camera parameters
Real-world user videos are rarely shot on steady tripods at perfect eye level. To make our model robust to actual user behavior, the Kinetix SDP randomizes camera parameters for every generation. We simulate all kinds of situations, including extreme high and low angles, dynamic panning, or the erratic shake of a handheld smartphone. This variation teaches the model to understand severe perspective distortion and depth, ensuring it doesn't break when a user uploads a video shot from an unusual angle.
Of course, all the metadata and labels are safeguarded and enriched along the process.
By combining these modular parameters, the SDP creates an infinite, perfectly paired dataset. Every rendered 2D video output is linked to the exact 3D animation data that generated it, providing our AI with the ultimate, unassailable ground truth.
Bridging the Reality Gap: Exploring Enhanced Realism with Kamo-1
While our Unreal Engine pipeline produces incredibly diverse and geometrically accurate data, it still faces what machine learning researchers call a "domain gap"—the visual mismatch between clean, computer-generated 3D environments and the messy reality of everyday video. Real-world footage contains heavy film grain, diverse visual styles, organic motion blur, and unpredictable lens noise that traditional 3D rendering engines simply cannot replicate. If an AI is only trained on crisp synthetic visuals, it can easily get confused by the chaos of the real world.
To make our training data indistinguishable from reality, we are actively exploring the potential integration of Kamo-1, our proprietary video diffusion model, at the very end of the SDP.
Kamo-1 is uniquely built for 3D-conditioning. Instead of relying solely on the final Unreal Engine visual render, we envision feeding the rich, modular metadata the SDP already generates—specifically the depth maps, randomized camera parameters, and the raw 3D animation—directly into Kamo-1 as control inputs. By conditioning the diffusion process on these exact 3D parameters, we could effectively "reskin" our synthetic 3D environments into highly photorealistic, lifelike footage.
However, introducing a generative diffusion model into a strict ground-truth pipeline carries significant risks. Generative AI is inherently prone to hallucinations; it might spontaneously invent extra limbs, deteriorate the physical precision of the original movement, or alter the character's geometry so drastically that our perfectly paired 3D labels no longer match the video output. Balancing this enhanced visual realism with the absolute geometric accuracy of our SDP remains our current, most exciting research challenge.
The Kinetix SDP beyond Human Pose Estimation
While the Kinetix SDP was primarily engineered to solve the data bottleneck in Human Pose Estimation, annotated and physically-grounded 3D data is a"holy grail" for two other major frontiers in AI: Embodied AI and World Modeling.
Motion Imitation Learning in Embodied AI
In the field of robotics, motion imitation is currently the leading technique for teaching embodied agents to execute realistic, human-like movement. To succeed, these models require massive volumes of cleanly annotated 3D motion capture data to act as expert demonstrations.
This is exactly what the Kinetix SDP provides. Our dataset offers groundtruth 3D human motion priors that are essential for replicating lifelike movement. Furthermore, because our pipeline is infinitely scalable, it allows researchers to train embodied agents on vast, diverse datasets within simulated digital environments before deploying them to physical hardware.
Crucially, our raw motion data, when paired with our mesh-aware retargeting technology, addresses a major quality bottleneck in robotics. By explicitly accounting for volumetric constraints and precise ground contacts, our pipeline instantly transfers expert human demonstrations onto virtually any target humanoid skeleton while maintaining strict physical plausibility.
By providing a ready-made, highly diverse dictionary of physically plausible motions, utilizing our SDP gives research teams a massive head start in teaching humanoid agents how to seamlessly navigate the world.
Generative World Modeling
World Models (such as V-JEPA or Genie-3-style architectures) are designed to understand the physics of our reality. They do this through action-conditioned modeling: predicting the next state of a video environment based on the specific actions taking place within it. The absolute greatest bottleneck in training these models is the severe scarcity of video data paired with precise, structural action labels (like 3D joint states, depth, and camera trajectories).
Our SDP solves this from two angles:
First, the pipeline itself inherently generates the exact data world models crave: millions of diverse videos paired with ground-truth 3D motion, camera, and depth metadata.
Second, our HPE model can be deployed at scale to process unstructured internet video. By utilizing our HPE to extract precise 3D human motion representations from wild footage, we can effectively turn the vast, unlabeled internet into a structurally annotated database. This transforms raw pixels into the grounded action-state transitions necessary to train the next generation of physically accurate world models.