Motion Retargeting: Beyond The Skeleton and Into the Mesh

Traditional retargeting methods operate on skeletons: they map bones from one character to another and call it done. But contact does not live on the skeleton. When a character crosses their arms, clasps their hands, or plants a foot on the ground, those interactions happen on the skin, the mesh. A skeleton-based method has no way to see, measure, or preserve them.
Kinetix developed its retargeting framework to solve a recurrent problem in the field and industry: transferring movement generated through human pose estimation models onto characters with complex and widely varied morphologies, accounting for the mesh of the characters instead of the skeleton, preserving the semantic value of contact. Retargeting needs to work on any morphology, automatically, and in real time. This is what real-time mesh-aware retargeting was designed to do.
Kinetix’s proprietary contact-aware motion retargeting approach using optimal transport
Placing reference points on the skin, rather than referencing the skeleton.
Rather than reasoning about bones, our retargeting method places its reference points (or key vertices) directly on the character's skin. We identify 41 key vertices on a generic humanoid template mesh, chosen to provide sparse but comprehensive coverage of the body's surface, with particular attention to areas prone to contact: hands, feet, head, and torso.
Here, our goal is to identify key-vertices on both the source and target meshes. To do this, we manually select key-vertices on a template mesh and deform it to match the source and target. This deformation process, which uses optimal transport, establishes a correspondence between the template’s vertices and those of the source and target, allowing us to locate the key-vertices on both.
This process is entirely automatic: given any humanoid character with a skeleton and a skinned mesh, our method can identify where its key vertices should be, regardless of the character's proportions or mesh density - as long as it keeps a bipedal structure.
Focusing on what really matters each frame.
With 41 key vertices, our method computes a set of motion descriptors that capture the relationships between the point’s: distances, directions, penetration depth, height relative to the ground, and horizontal sliding velocity. These descriptors capture what the motion looks like at each frame. But here is the critical insight: at any given moment in an animation, only a few key relationships carry all the semantic meaning.
We do this by using an optimization-based retargeting method which works by defining a set of rules that the target character's pose must satisfy, then iteratively adjusting the pose until those rules are met as closely as possible. The more rules the system tries to satisfy at once, the slower and harder the process becomes.
Therefore, reducing the amounts of constraints the framework follows actually eases the optimization process significantly. When a character is mid-stride, the foot-ground relationship is essential; the distance between the left hand and the right knee is irrelevant. Trying to preserve every relationship could produce conflicting constraints, especially when source and target characters have very different morphologies.
Our framework solves this by automatically deciding what matters at each frame. At each frame, the system checks two things: which body parts are near each other, and which are near the ground. Only those relationships get flagged as important. Everything else is ignored. This means the system is only ever solving a small, focused problem, which is what makes it fast without losing accuracy.
This sparsity is not just a shortcut that sacrifices accuracy for speed. It is what makes the system both fast and accurate: fast because the optimizer works with a small, focused set of constraints, and accurate because the constraints it does enforce are precisely the ones that matter for preserving the perceived quality of the motion.
Reaching real-time with three simple rules
Our framework achieves near real-time performance (a batch of 75 frames every 3 seconds) by processing retargeting as a lightweight optimization problem, solving for a small number of constraints rather than the full set. Given the sparse motion descriptors (e.g. distance between pairs, direction, penetration depth, etc.) and their adaptive weights, it optimizes the target character's pose to minimize three losses/criteria simultaneously:
A semantic loss ensures that the weighted motion descriptors of the target match those of the source, preserving the meaning of contacts, distances, directions, and ground interactions.
A regularization loss that keeps the result close to a plausible starting pose, preventing the optimizer from drifting into unrealistic configurations.
A smoothness loss that minimizes jerk across frames, ensuring the retargeted animation remains temporally coherent and free of artifacts.
The optimization runs on GPU using differentiable computation, the same math powering neural network training, processing all frames of an animation by batches. On a mid-range setup, with an NVIDIA RTX 3060 GPU, the system achieves a stable speed of 67 frames per second for animations longer than three seconds. This makes it suitable for real-time applications such as motion capture pre-visualization, gaming pipelines, live avatar animation, and interactive retargeting workflows.
Motion Aware Retargeting in Practice
In a typical animation pipeline, retargeting a single set of animations across a diverse character cast produces dozens of artifacts: arms clipping through torsos, feet sliding on the ground, hands that should be clasped floating apart. Each one requires manual cleanup, multiplied across every character and every animation. The system eliminates this class of problems at the source. Because it sees the mesh, contacts that hold on the source character hold on the target, regardless of body type.
This framework strongly accelerates retargeting cleanup for animators. When a motion is retargeted without mesh-aware properties, an animator must manually correct structural failures: self-penetrations, ground penetrations, and foot sliding. Fixing self-penetration alone can take a long time for a short clip. Accounting for foot sliding correction, ground penetration, and lost contact restoration, a full retargeting cleanup on a 10-second complex motion clip, like dance or martial arts, can take over 3 hours. This framework produces contact-correct output in seconds, reducing or eliminating this cleanup step entirely.
But the transformation does not stop at traditional pipelines. A new generation of animation workflows is emerging, driven by generative AI. Gaming studios, live avatar platforms, robotics pipelines, and synthetic data generation now consider contact-aware retargeting a foundational requirement. These new pipelines demand a retargeting layer that is precise enough to preserve the physical plausibility of generated motion, and flexible enough to handle any output morphology. Mesh-aware retargeting is not an improvement to these pipelines. It is a foundational step towards enabling them.
Motion retargeting is no longer a post-processing step. It is infrastructure. As character diversity grows across gaming, live avatars, and embodied AI, the ability to transfer motion accurately onto any morphology, in real time, and with contact preserved, becomes a foundational layer that everything else builds on.
Want to find out more about our Mesh-Aware retargeting method? Read our ReConForM paper here:

