Motion Retargeting: Building a Better Retraining Dataset

Learning-based retargeting methods represent the field’s most active research direction for retargeting (e.g. R2ET, SMTNet). Rather than hand-crafting rules for each skeleton, these methods train neural networks to learn the distribution of motion semantics: what a walk looks like, how a jump should land, where hands meet during a clap. The promise is a system that generalizes across body types, automatically.

But these methods face a data problem. Training a retargeting network requires paired motion data: the same animation applied to characters with different morphologies, so the model can learn what "correct retargeting" looks like across body types. Without large-scale, high-quality paired datasets, public research has been forced to rely on unsupervised training frameworks, rather than direct supervision.

To fill this gap, the field adopted what is commonly referred to as "the Mixamo dataset." The foundational NKN retargeting paper (Villegas et al., 2018) was the first to assemble a subset of animations from Adobe's Mixamo platform and use it to train a retargeting network. Every major retargeting paper since, from SAN to R2ET to SMTNet, has built on this same dataset.

The problem is that this dataset, the only available source of paired retargeting data, is inherently flawed and closed-source.

Mixamo is not a fixed dataset. It is a web platform with roughly 95 characters and 2,500 animations which anyone can download. But the field never used all of it. The NKN paper picked 7 characters for training and 6 for validation, all with similar skeleton structure, and that narrow subset became the standard dataset. Every major recent paper (e.g. SAN, R2ET, SMTNet) trains and evaluates on the same subset or something close to it. That is not enough diversity to generalize, out-of-distribution animations are almost impossible to generate. The motion library is also small (2,500 clips) compared to AMASS (11,400 sequences, 42 hours) or Motion-X (81,100 sequences, 144 hours). And the motions themselves are not filled with artifacts like ground penetrations, floating feet, missing contacts, self-penetration.

Mixamo is also entirely closed-source. Adobe's retargeting algorithm is proprietary, and since June 2021, Adobe's terms of service explicitly prohibit using Mixamo content for machine learning purposes. Nevertheless, the field has been training retargeting neural networks with this opaque, restricted dataset for years, treating it as the ground-truth standard.

We built an alternative on industry-grade tools

We developed an industry-grade retargeting method that uses control rigs to generate large-scale, high-quality paired motion datasets. Rather than treating retargeting as a black box, our approach is fully explainable: every step of the motion transfer is transparent and reproducible.

We designed the method to meet three requirements: 

  1. It must accept any humanoid character as input, regardless of morphology or the number of bones in the skeleton, as long as it remains bipedal. 

  2. It must faithfully replicate the source motion on the target character without introducing new structural failures (foot sliding, self-penetration, loss of contact).

  3. And, it must be explainable, so that it can rationally be trusted to generate new data for training purposes, with minimal data curation.

To the best of our knowledge, our framework is the first method to build on industry-grade rigging techniques to perform motion retargeting in rig space.

A detailed step-by-step explanation of how our alternative works

Step 1

Automatic Control Rigs

Professional animators use control rigs: interfaces built on top of a skeleton that let them work with intuitive controls rather than individual joint rotations. A foot roll controller replaces the need to manually rotate the ankle, toe, and heel bones. An inverse kinematic (IK) handle lets an animator place a hand in space and let the system figure out how the shoulder and elbow should behave - imagine reaching out for a cup on your desk, you only think about where your hand should go, your arm and shoulder naturally follow the motion, that’s how IK works. 

Prior work operates at the bone level, copying or adapting joint rotations between skeletons. We work one level higher: at the control rig level, where animation semantics are encoded directly into the rig's mechanisms.

We built a template control rig in Blender and engineered a method to automatically adapt it to any input character. Given a source and target character (each with their mesh, bones, and skinning weights), the system automatically generates two control rigs adapted to their respective morphologies.

FIGURE 1

Control Rig Placement

Control rig automatically adapted to an input character.

These control rigs provide mechanisms for every major body region:

Heel bone placement. A mechanical bone is placed at the heel of each character, computed automatically from the foot mesh geometry. This serves as the anchor point for the foot roll system.

Foot roll mechanism. A standard dual-hinge system handles the full walking cycle, from first heel impact to last toe contact. The result is that a single controller rotation can model both the heel strike and the toe push-off of a walk cycle, preserving ground contact interaction while remaining consistent with the source motion.

FIGURE 2

Heel and Foot Rigging

Dual-hinge system that handles the full walking cycle.

Hip control. A control bone placed at the pelvis level controls the character's root position and rotation in space. The pelvis has its own controller which only parents the upper body, leaving the feet grounded when the hips are rotated or moved.

Legs and arms inverse kinematics. Rather than setting each joint's rotation along a limb, we place a target controller at the hand or foot position and use IK to automatically compute the joint angles needed to reach it. IK works backward from a target position: given where the hand or foot should be, it solves for the shoulder and arm configuration that puts the hand in the target position. 

Spine and neck rotation. The rig splits the spine controller's rotation equally among all spine bones in the chain. Despite its simplicity, this mechanism proved sufficient for producing natural curvature, and is what makes the rig portable across characters with different numbers of spine bones.

FIGURE 3

Spine Rotation Splitting

Rig controlling rotation for all spine bones.

Fingers. Each finger is controlled by a single rig placed at the proximal phalanx (closest bone to the hand of a finger). The controller's rotation drives the first knuckle, and its scale controls how curled the remaining phalanges are. This compact representation captures the full range of finger motion with minimal controller complexity.

FIGURE 4

Finger Rigging

Single rig placed in the proximal phalanx for each finger.

Step 2

Rig Inversion Algorithm

With control rigs in place on both characters, the next step is to extract the source character's animation into control rig parameters. This is the task of a rig inversion algorithm: given a bone-level animation, find the controller values (position, rotation, scale) that reproduce it automatically. 

FIGURE 5

Rig Inversion

Extracting plausible motion through rig inversion.

Rig inversion is inherently ambiguous. Multiple different controller configurations can produce the same bone pose.

We solve this with a set of opinionated rules: strict priorities that resolve every ambiguity based on what produces the best retargeting results. For instance, matching precise hand and foot positions takes precedence over preserving exact elbow or knee angles. Matching the first and last finger phalanges takes priority over the middle joints. These priorities, or rules, in the algorithm are mathematically optimal solutions and, unlike a neural network, completely inspectable.

Step 3

Rig-Space Retargeting

At this point, both characters have been rigged and the source animation has been transferred onto the source character's control rig. Now we transfer the motion from the source rig to the target rig.

FIGURE 6

Retargeting the Control Rig

Extracting motion semantics into a diverse morphology.

Working at the control rig level rather than the bone level allows the retargeting to operate on high-level motion concepts, such as precise foot placement, hand positions relative to each other, or spine curvature, rather than raw joint rotations. This abstraction is what makes it possible to transfer the most semantically important aspects of motion across different morphologies.

The first step is identifying which controllers are driving the motion at each frame.

In most retargeting methods, the entire body's movement is driven from the root bone, typically placed near the hips. But this does not reflect how real human motion works. When a person walks, the motion of the hips is a consequence of the feet pushing against the ground. The feet drive the hips, not the other way around.

FIGURE 7

Identification of Rig Controllers

Automatic identification of driving controllers on the source motion.

To capture this natural dynamic, we assign each controller a per-frame weight based on three factors: how close it is to the ground (height), how stationary it is (velocity), and how much of the body's weight it supports (estimated by its distance to the projected center of mass). These three factors are multiplied together, so a controller scores high only when it is simultaneously close to the ground, still, and load-bearing, like a planted foot during a walk cycle. The highest-weighted controller is then treated as the anchor from which the rest of the pose is computed.

FIGURE 8

Frame by Frame Weight Calculation

Computing the weight for each candidate controller per frame.

For each frame, we compute a set of candidate controllers parameters (position and rotation), under the hypothesis that a certain source controller is the anchor. Would the source’s left foot be the most important (anchor) controller, the target right foot and hands would be placed there.

This operation is repeated for each hand and foot from the source controller. For two feet and two hands, that yields 4 candidates for each target controller, each of them is scored by the importance of their corresponding source anchor. We finally aggregate these candidates by a weighted average over those scores. Only the most important one is perfectly preserved by a speed transfer procedure, adapted for limb length (leg or arm depending on the cases) variations.

Step 4

Morphology-based Correction

By this step, the structural movement of the skeleton has been identified on the source character, inverted to fit a diverse skeleton, and transferred into the target character's rig. But the notion of skin has been left out. The retargeting so far operates purely on the skeleton, ignoring the surface geometry of the characters. When body proportions change significantly, this creates semantic failures: arms clipping through a wider torso, hands passing through thighs, limbs that should be in contact floating apart.

To preserve the semantic quality of motion after retargeting, we correct the position of the effectors, the terminal points of the skeleton that interact with the world (feet, hands, head), whenever the source animation drives them close to the torso of the character.

Our correction algorithm begins by approximating the torso of both source and target characters. We select the mesh vertices skinned to the hip and spine bones, slice the torso evenly along the spine axis from top to bottom, and fit a circle to each slice using least-squares. The successive circles form a set of truncated cones: a lightweight geometric proxy that captures the essential shape of the torso without the computational cost of the full mesh. When the torso deforms during movement, we use the skinning weights of the source and target characters to compute a deformation of this proxy, keeping the approximation aligned with the deforming mesh.

FIGURE 9

Cone Primitives: Non Corrected Target

Cone Primitives on Source Character

Cone Primitives on Non-Corrected Target Character

We chose the torso as the primary reference for collision correction because it is the most common site of unwanted interpenetration, particularly with the arms. When an effector is detected near or inside the torso approximation, we compute its relative position in the source character's torso coordinate system, identify the corresponding region on the target's torso, and define the controller relative to the most relevant torso slice on the source. Once knowing the closest source's slice, we can find the corresponding slice on the target, and apply the same relative transform from the source's slice perspective, but to the target slice instead. This preserves the spatial relationship between the limb and the body while respecting the target's different proportions.

FIGURE 10

Cone Primitives: Corrected Character

Cone Primitives on Non-Corrected Target Character

Cone Primitives on Corrected Target Character

Our framework contributes to training retargeting models

Compared to Mixamo's retargeting, which the field currently treats as ground truth, our framework produces measurably better results.

Our method generates more precise hand contacts. In motions like cartwheels and clapping, our method maintains hand-floor and hand-hand contact where Mixamo's retargeting loses it. The control rig's foot roll mechanism allows precise heel and palm placement on the ground. Our torso approximation lets us adjust and refine limbs that overlap with the body regardless of the target character's morphology.

FIGURE 11

Kinetix vs Mixamo's Retargeting

Source Animation (Top), Mixamo's Target Character (Middle), Kinetix's Target Character (Bottom)

We are pushing the boundaries even further

Our framework handles the majority of humanoid motion well, but we are actively working to extend its capabilities.

Hybrid retargeting for edge cases. The driving controller assumptions work well when motion follows standard ground-contact dynamics, but in motions like pull-ups, where the hands drive the body from above rather than the feet from below, all controllers receive similar weights and no single one is identified as the privileged anchor. We are exploring kinematic-learning hybrid methods that use a learned component to identify driving controllers in these edge cases.

Physical realism through reinforcement learning. Today, our framework faithfully retargets the source motion as-is, including any minor artifacts it may contain. We are currently researching reinforcement-learning techniques that would go a step further: repairing broken input data and transforming it into clean, physically plausible motion that respects gravity, momentum, and balance, even when the source animation is imperfect. 

More expressive rig setups. The current spine mechanism splits rotation uniformly across bones. Spline Inverse Kinematics could replace this with curve-based control, allowing non-uniform bending that more accurately reflects natural spine deformation and finer control of limbs.

A foundation for what comes next

Motion retargeting is a quality bottleneck in downstream tasks across the field. In human pose estimation, retargeted motion serves as training data: artifacts in the data directly propagate into the model's predictions. In robotics, the stakes are higher. Human-to-robot motion transfer is increasingly used to teach robotic systems how to navigate and interact with their environment, through techniques such as imitation learning, teleoperation, and locomotion transfer. 

Retargeting human motion onto robots with fundamentally different morphologies demands artifact-free data: foot sliding becomes loss of balance, ground penetration becomes a motor error, and self-penetration becomes a physical collision. If the retargeting pipeline produces flawed motion, the downstream system inherits those flaws.

Learning-based methods are increasingly explored as a path toward capturing the full complexity of motion semantics, but progress remains limited by the lack of open, high-quality training data. These methods require large, high-quality datasets, and our framework represents a foundational contribution in building them. Because it relies solely on transparent, explainable rig operations, and because it integrates directly into industry-standard animation software like Blender, our framework can be used to generate reliable ground-truth motion for any humanoid character, providing a reproducible, open-source foundation for the next generation of retargeting research.

Want to find out more about our retargeting framework to build large-scale contact aware datasets? Read our MIRRORED-anims paper here: