Motion Retargeting: Beyond The Skeleton and Into the Mesh

Traditional retargeting methods operate on skeletons: they map bones from one character to another and call it done. But contact does not live on the skeleton. When a character crosses their arms, clasps their hands, or plants a foot on the ground, those interactions happen on the skin, the mesh. A skeleton-based method has no way to see, measure, or preserve them. 

Kinetix developed its retargeting framework to solve a recurrent problem in the field and industry: transferring movement generated through human pose estimation models onto characters with complex and widely varied morphologies, accounting for the mesh of the characters instead of the skeleton, preserving the semantic value of contact. Retargeting needs to work on any morphology, automatically, and in real time. This is what real-time mesh-aware retargeting was designed to do.

FIGURE 1

Source to Target Motion Retargeting

Source

Target: Example #1

Target: Example #2

FIGURE 1

Source to Target Motion Retargeting

Source

Target: Example #1

Target: Example #2

Kinetix’s proprietary contact-aware motion retargeting approach using optimal transport

Placing reference points on the skin, rather than referencing the skeleton.

Rather than reasoning about bones, our retargeting method places its reference points (or key vertices) directly on the character's skin. We identify 41 key vertices on a generic humanoid template mesh, chosen to provide sparse but comprehensive coverage of the body's surface, with particular attention to areas prone to contact: hands, feet, head, and torso.

Here, our goal is to identify key-vertices on both the source and target meshes. To do this, we manually select key-vertices on a template mesh and deform it to match the source and target. This deformation process, which uses optimal transport, establishes a correspondence between the template’s vertices and those of the source and target, allowing us to locate the key-vertices on both.

This process is entirely automatic: given any humanoid character with a skeleton and a skinned mesh, our method can identify where its key vertices should be, regardless of the character's proportions or mesh density - as long as it keeps a bipedal structure.

FIGURE 2

Key Point Transfer to Diverse Meshes

FIGURE 2

Key Point Transfer to Diverse Meshes

Focusing on what really matters each frame. 

With 41 key vertices, our method computes a set of motion descriptors that capture the relationships between the point’s: distances, directions, penetration depth, height relative to the ground, and horizontal sliding velocity. These descriptors capture what the motion looks like at each frame. But here is the critical insight: at any given moment in an animation, only a few key relationships carry all the semantic meaning.

We do this by using an optimization-based retargeting method which works by defining a set of rules that the target character's pose must satisfy, then iteratively adjusting the pose until those rules are met as closely as possible. The more rules the system tries to satisfy at once, the slower and harder the process becomes.

Therefore, reducing the amounts of constraints the framework follows actually eases the optimization process significantly. When a character is mid-stride, the foot-ground relationship is essential; the distance between the left hand and the right knee is irrelevant. Trying to preserve every relationship could produce conflicting constraints, especially when source and target characters have very different morphologies.

Our framework solves this by automatically deciding what matters at each frame. At each frame, the system checks two things: which body parts are near each other, and which are near the ground. Only those relationships get flagged as important. Everything else is ignored. This means the system is only ever solving a small, focused problem, which is what makes it fast without losing accuracy.

This sparsity is not just a shortcut that sacrifices accuracy for speed. It is what makes the system both fast and accurate: fast because the optimizer works with a small, focused set of constraints, and accurate because the constraints it does enforce are precisely the ones that matter for preserving the perceived quality of the motion.

FIGURE 3

Weighting Characteristics Formulas

Adaptive Weighting Formulas
How ReConForM decides which constraints matter at each frame
Limb interaction
Floor contact

Variables
Distance between key vertices i and j
Height of key vertex i from the ground
Min distance : 5% of character height; below = full weight
Max distance : 15% of character height; above = zero weight
Min height : near ground; below = full floor weight
Max height : above = floor contact ignored
Clamp : restricts output between 0 and 1
Weight : 0 = irrelevant, 1 = critical
How to read this
When two body parts are close together, their interaction weight approaches 1 and the system enforces their relationship. When they're far apart, the weight drops to 0 and they're ignored. The same logic applies to ground contact: a foot near the floor gets a high floor weight; a raised hand gets zero. This is what makes the system sparse: at any given frame, only a few weights are active.

FIGURE 3

Weighting Characteristics Formulas

Adaptive Weighting Formulas
How ReConForM decides which constraints matter at each frame
Limb interaction
Floor contact

Variables
Distance between key vertices i and j
Height of key vertex i from the ground
Min distance : 5% of character height; below = full weight
Max distance : 15% of character height; above = zero weight
Min height : near ground; below = full floor weight
Max height : above = floor contact ignored
Clamp : restricts output between 0 and 1
Weight : 0 = irrelevant, 1 = critical
How to read this
When two body parts are close together, their interaction weight approaches 1 and the system enforces their relationship. When they're far apart, the weight drops to 0 and they're ignored. The same logic applies to ground contact: a foot near the floor gets a high floor weight; a raised hand gets zero. This is what makes the system sparse: at any given frame, only a few weights are active.

Reaching real-time with three simple rules

Our framework achieves near real-time performance (a batch of 75 frames every 3 seconds) by processing retargeting as a lightweight optimization problem, solving for a small number of constraints rather than the full set. Given the sparse motion descriptors (e.g. distance between pairs, direction, penetration depth, etc.) and their adaptive weights, it optimizes the target character's pose to minimize three losses/criteria simultaneously: 

  • A semantic loss ensures that the weighted motion descriptors of the target match those of the source, preserving the meaning of contacts, distances, directions, and ground interactions. 

  • A regularization loss that keeps the result close to a plausible starting pose, preventing the optimizer from drifting into unrealistic configurations.

  • A smoothness loss that minimizes jerk across frames, ensuring the retargeted animation remains temporally coherent and free of artifacts.

The optimization runs on GPU using differentiable computation, the same math powering neural network training, processing all frames of an animation by batches. On a mid-range setup, with an NVIDIA RTX 3060 GPU, the system achieves a stable speed of 67 frames per second for animations longer than three seconds. This makes it suitable for real-time applications such as motion capture pre-visualization, gaming pipelines, live avatar animation, and interactive retargeting workflows.

FIGURE 4

Speed Rate and Framerate

FIGURE 4

Speed Rate and Framerate

Motion Aware Retargeting in Practice

In a typical animation pipeline, retargeting a single set of animations across a diverse character cast produces dozens of artifacts: arms clipping through torsos, feet sliding on the ground, hands that should be clasped floating apart. Each one requires manual cleanup, multiplied across every character and every animation. The system eliminates this class of problems at the source. Because it sees the mesh, contacts that hold on the source character hold on the target, regardless of body type.

FIGURE 5

Retargeting Cleanup Time Comparison

Retargeting cleanup time comparison
Manual traditional cleanup vs. mesh-aware retargeting on a morphologically different character
Simple motion (locomotion, gestures)
Manual retargeting cleanup~3 hours
Our framework~5 seconds
Contact-correct output no cleanup needed

Complex motion (dance, martial arts)
Manual retargeting cleanup~6 hours
Our framework~45 minutes
Contact-correct output some cleanup needed

2,160xSimple motion
faster retargeting
8xComplex motion
faster retargeting
Based on StableMotion (2025) manual cleanup estimates. Our framework at 67 FPS on RTX 3060.

FIGURE 5

Retargeting Cleanup Time Comparison

Retargeting cleanup time comparison
Manual traditional cleanup vs. mesh-aware retargeting on a morphologically different character
Simple motion (locomotion, gestures)
Manual retargeting cleanup~3 hours
Our framework~5 seconds
Contact-correct output no cleanup needed

Complex motion (dance, martial arts)
Manual retargeting cleanup~6 hours
Our framework~45 minutes
Contact-correct output some cleanup needed

2,160xSimple motion
faster retargeting
8xComplex motion
faster retargeting
Based on StableMotion (2025) manual cleanup estimates. Our framework at 67 FPS on RTX 3060.

This framework strongly accelerates retargeting cleanup for animators. When a motion is retargeted without mesh-aware properties, an animator must manually correct structural failures: self-penetrations, ground penetrations, and foot sliding. Fixing self-penetration alone can take a long time for a short clip. Accounting for foot sliding correction, ground penetration, and lost contact restoration, a full retargeting cleanup on a 10-second complex motion clip, like dance or martial arts, can take over 3 hours. This framework produces contact-correct output in seconds, reducing or eliminating this cleanup step entirely.

FIGURE 6

Our Retargeting vs Mixamo's Retargeting

FIGURE 6

Our Retargeting vs Mixamo's Retargeting

But the transformation does not stop at traditional pipelines. A new generation of animation workflows is emerging, driven by generative AI. Gaming studios, live avatar platforms, robotics pipelines, and synthetic data generation now consider contact-aware retargeting a foundational requirement. These new pipelines demand a retargeting layer that is precise enough to preserve the physical plausibility of generated motion, and flexible enough to handle any output morphology. Mesh-aware retargeting is not an improvement to these pipelines. It is a foundational step towards enabling them.

FIGURE 7

Emerging Animation Pipelines

Mesh-aware retargeting in emerging animation pipelines
How mesh-aware retargeting streamlines traditional animation workflows
Animation pipeline without mesh-aware retargeting
Motion Source
Video
Motion Extraction
Pose estimation
Retargeting
Skeleton-based, bone mapping
Manual Cleanup
Fix penetrations, sliding, lost contacts
Target Character
Per character
Manual process
Animation pipeline with mesh-aware retargeting
Motion Source
Video
Motion Extraction
Pose estimation
Our Framework
Mesh-aware, contact-preserving
Target Character
Multiple characters
Automated process

FIGURE 7

Emerging Animation Pipelines

Mesh-aware retargeting in emerging animation pipelines
How mesh-aware retargeting streamlines traditional animation workflows
Animation pipeline without mesh-aware retargeting
Motion Source
Video
Motion Extraction
Pose estimation
Retargeting
Skeleton-based, bone mapping
Manual Cleanup
Fix penetrations, sliding, lost contacts
Target Character
Per character
Manual process
Animation pipeline with mesh-aware retargeting
Motion Source
Video
Motion Extraction
Pose estimation
Our Framework
Mesh-aware, contact-preserving
Target Character
Multiple characters
Automated process

Motion retargeting is no longer a post-processing step. It is infrastructure. As character diversity grows across gaming, live avatars, and embodied AI, the ability to transfer motion accurately onto any morphology, in real time, and with contact preserved, becomes a foundational layer that everything else builds on.

Want to find out more about our Mesh-Aware retargeting method? Read our ReConForM paper here: