The Physical Understanding Gap in Video Generation

Video generation has made extraordinary progress. Models like Veo 3.1, Kling 3.0, Seedance 2.0, and Wan 2.6 now routinely produce 1080p video with coherent lighting, natural motion, and recognizable human faces. In two years, the field went from producing jittery, uncanny clips to generating cinematic sequences that fool the untrained eye.

But look closer, and the same failures appear everywhere. Generated objects pass through surfaces instead of colliding. Characters reach for a cup but cannot grasp it. Physics breaks at the moment of interaction, and fine articulation collapses into pixelated blur. The models generate projections of a 3D world without understanding the world they are projecting.

FIGURE 1

Structural failures in current SOTA models

LTX 2.3

A glass of water falls from the table onto the ground. However, the man tries to stop it with a tennis racket.

Veo 3.1

Two people pass a ball back and forth while a third person walks between them, briefly occluding the ball.

Wan 2.6

Two dancers perform a lift where one partner throws the other into the air and catches them.

FIGURE 1

Structural failures in current SOTA models

LTX 2.3

A glass of water falls from the table onto the ground. However, the man tries to stop it with a tennis racket.

Veo 3.1

Two people pass a ball back and forth while a third person walks between them, briefly occluding the ball.

Wan 2.6

Two dancers perform a lift where one partner throws the other into the air and catches them.

FIGURE 1

Structural failures in current SOTA models

LTX 2.3

A glass of water falls from the table onto the ground. However, the man tries to stop it with a tennis racket.

Veo 3.1

Two people pass a ball back and forth while a third person walks between them, briefly occluding the ball.

Wan 2.6

Two dancers perform a lift where one partner throws the other into the air and catches them.

The Structural Limits of Current Video Generation Models

Current video models learn statistical patterns of pixel change across time. They are structural consequences of models with no internal representation of contact, causality, or three-dimensional geometry. Current video models learn statistical patterns of pixel change across time.They learn that objects tend to fall, that faces have two eyes, that shadows follow light. But they do not learn why.

The most rigorous evidence came from ByteDance Research and Tsinghua University at ICML 2025. Bingyi Kang and the Seed research team found that when video models encounter unfamiliar scenarios, they do not reason about physics but copy the closest training examples, grasping onto superficial cues like color and object size while ignoring what actually matters: how fast things move, where they go, and why. The conclusion was blunt: "scaling alone is insufficient for video generation models to uncover fundamental physical laws." (How Far is Video Generation from World Model: A Physical Law Perspective, Kang et al. 2025).

If the problem is structural, then surface-level fixes will not solve it. This realization has pushed the field toward the concept of world models. A model that truly understood the world would not need explicit conditioning to avoid physics violations. It would avoid them because it knows how physics works. But while researchers agree on the diagnosis, their proposed treatments are fundamentally different.

Two paths to physical understanding

One methodological position holds that physical understanding can emerge from 2D video alone, given sufficient scale. Trained on billions of video frames, the model develops implicit motion priors from understanding how pixels change over time. It learns that objects tend to fall at a consistent rate, that shadows track light sources, that reflections mirror the scene above a water surface. None of these rules are programmed. They are absorbed from the statistical regularities of how pixels change across time. And the approach has produced real results: at sufficient scale, generated videos exhibit emergent 3D consistency as the camera rotates, objects persist through occlusion and re-emerge correctly, and simple state changes are tracked across frames. These properties emerge without any explicit 3D representation or physics engine. They are purely phenomena of scale.

In 2D conditioned models, the ceiling is structural: the model learns that "pixels corresponding to falling objects tend to move downward at rates consistent with training data."It can produce output that looks physically plausible without developing any mechanism to verify whether it is. The model contains no structural information about why things happen the way they do. It has learned the appearance of physics, not its logic. Fei-Fei Li frames the problem precisely: current AI systems are "wordsmiths in the dark, eloquent but inexperienced, knowledgeable but ungrounded." Physical reasoning requires spatial intelligence, the ability to reason about geometry, depth, and dynamics in three dimensions. That information does not exist in 2D projections. It must be provided. 3D conditioning addresses this directly. Instead of inferring spatial relationships from flat projections, models can receive explicit three-dimensional information: the depth of every surface in the scene, the position of every object, and the geometric relationships between them. A 3D-conditioned model does not need to guess whether two objects occupy the same space, it knows  their coordinates. It does not need to infer depth from visual cues, it receives depth as input. It does not need to relearn the same physical interaction from every camera angle, because 3D coordinates are viewpoint-invariant: the spatial relationship between two objects is the same whether observed from the front or above. 3D conditioning does not replace what 2D models have learned. It adds the structural layer they are missing: a coordinate system grounded in how the physical world is actually organized.

FIGURE 2

2D baseline vs. 3D-conditioned (Kamo-1)

Baseline (2D only)

Kamo-1 (3D-conditioned)

Example #1: spatial awareness regardless of viewpoint

Kling 3.0 Motion Control

Camera movement ignores spatial awareness (object)

Model understands the exact location of the object

Example #2: Complex ground-contact interaction

Wan 2.2 Animate

Model misunderstood ground-contact interaction

Correct understanding of ground-contact interaction

Example #3: AWareness of object coordinates

Kling 3.0 Motion Control

Model wrongly estimates relation between character and chair

Model is aware of the position of each object in space

Example #4: Environment & object reasoning

Luma Ray 3

Lacks object and environment understanding

Understands relations between character and environment

FIGURE 2

2D baseline vs. 3D-conditioned (Kamo-1)

Example #1: spatial awareness regardless of viewpoint

Kling 3.0 Motion Control

Camera movement ignores spatial awareness (object)

Kinetix Kamo 1

Model understands the exact location of the object

Example #2: Complex ground-contact interaction

Wan 2.2 Animate

Model misunderstood ground-contact interaction

Kinetix Kamo 1

Correct understanding of ground-contact interaction

Example #3: Physical plausibility

Runway Act Two

Character inaccurately represented (proportions shift)

Kinetix Kamo 1

Conditioning keeps coherent character proportions

Example #4: Environment & object reasoning

Luma Ray 3

Lacks object and environment understanding

Kinetix Kamo 1

Understands relations between character and environment

3D Conditioning as Infrastructure

FIGURE 3

Kamo-1 Conditioning Architecture

CONDITIONING SIGNALS3D Body PosesJoint configuration in world spaceCamera TrajectoryObserver pathDepth MapsGeometrically-aware scene reconstructionFirst FrameEnvironment referenceKamo-13D-ConditionedVideo DiffusionViewpoint-invariant spatialrepresentation2D OUTPUTGenerated2D VideoCharacter motion and camera trajectory tracked independently in 3D3D SPACE2D

FIGURE 3

Kamo-1 Conditioning Architecture

CONDITIONING SIGNALS3D Body PosesJoint configuration in world spaceCamera TrajectoryObserver pathDepth MapsGeometrically-aware scene reconstructionFirst FrameEnvironment referenceKamo-13D-ConditionedVideo DiffusionViewpoint-invariant spatialrepresentation2D OUTPUTGenerated2D VideoCharacter motion and camera trajectory tracked independently in 3D3D SPACE2D

FIGURE 3

Kamo-1 Conditioning Architecture

CONDITIONING SIGNALS3D Body PosesJoint configuration in world spaceCamera TrajectoryObserver pathDepth MapsGeometrically-aware scene reconstructionFirst FrameEnvironment referenceKamo-13D-ConditionedVideo DiffusionViewpoint-invariant spatialrepresentation2D OUTPUTGenerated2D VideoCharacter motion and camera trajectory tracked independently in 3D3D SPACE2D

Kamo-1, our 3D-conditioned video diffusion model, ingests multiple 3D conditioning signals simultaneously. 3D body poses define character movement through space. Camera trajectories specify the observer's exact path. Depth maps encode scene geometry frame by frame, allow camera disentanglement and position the character accurately in the environment. The model generates 2D video, but with access to the geometric truth of what it is rendering: because it is aware of movement in 3D space, the camera is disentangled from the scene, and character motion remains precise in space and time regardless of viewpoint. If the camera orbits a walking character, the character continues in a straight line rather than warping or sliding with the viewpoint, because the model tracks each trajectory independently in 3D.

FIGURE 4

Camera orbit: character trajectory wrapped vs. character trajectory preserved

Baseline (2D conditioned)

Kamo-1 (3D-conditioned)

Kling 2.6 Motion Control

FIGURE 4

Camera orbit: character trajectory wrapped vs. character trajectory preserved

Baseline (2D conditioned)

Kling 2.6 Motion Control

Kamo-1 (3D-conditioned)

Kamo-1 currently conditions on character animation and camera in 3D. The environment, objects, and surfaces remain conditioned from the first frame onwards. The model produces a convincing 2D effect without maintaining a true 3D position. While, in a fully 3D-conditioned environment, the character exists within a persistent spatial world. When they walk away, the model knows their exact coordinates, their proportions remain consistent, and the reduction in apparent size is a natural consequence of perspective in a real coordinate system, not a learned visual trick. This is the difference between simulating depth and understanding space: one reproduces appearance, the other preserves the geometric truth of where things are, how large they are, and how they relate to everything around them. Further out, the direction points toward conditioning on full persistent 3D worlds, complete environments with geometry, materials, and lighting, within which video generation operates. This is where the boundary between "3D-conditioned video model" and "world model" begins to dissolve.

3D conditioning is a foundation, not a solution. You cannot reason about physics without first knowing where things are in space. 3D conditioning provides that spatial scaffold: the geometry and depth relationships on which physical reasoning depends. Without this layer, any understanding the model develops is built on ambiguous 2D projections. With it, the model has a coordinate system on which higher-level physical understanding can eventually be constructed. 3D conditioning does not teach physics. It builds the foundation that physics requires.

3D approaches come with real trade-offs. Current 3D conditioning and rendering methods tend to reduce visual quality: tighter spatial control can produce video that is geometrically accurate but visually rigid, and video models that are not realistic are in certain cases not usable regardless of how precise its physics are. 3D data is both rare and difficult to obtain at high quality. 3D generation technologies are not yet perfectly reliable, and feeding a video model inaccurate 3D reconstructions introduces errors that propagate through every frame. These are real constraints. But they are engineering problems, not structural ones. Visual quality improves as conditioning architectures mature. 3D data availability is growing through synthetic generation. And computational costs follow the same trajectory as every prior advance in the field: expensive at introduction, rapidly declining with optimization.

We are building toward a future where video generation produces not just images that look like reality, but representations that are grounded in how reality works.

The field is just beginning to discover what that requires.