The Physical Understanding Gap in Video Generation

Video generation has made extraordinary progress. Models like Veo 3.1, Kling 3.0, Seedance 2.0, and Wan 2.6 now routinely produce 1080p video with coherent lighting, natural motion, and recognizable human faces. In two years, the field went from producing jittery, uncanny clips to generating cinematic sequences that fool the untrained eye.
But look closer, and the same failures appear everywhere. Generated objects pass through surfaces instead of colliding. Characters reach for a cup but cannot grasp it. Physics breaks at the moment of interaction, and fine articulation collapses into pixelated blur. The models generate projections of a 3D world without understanding the world they are projecting.
The Structural Limits of Current Video Generation Models
Current video models learn statistical patterns of pixel change across time. They are structural consequences of models with no internal representation of contact, causality, or three-dimensional geometry. Current video models learn statistical patterns of pixel change across time.They learn that objects tend to fall, that faces have two eyes, that shadows follow light. But they do not learn why.
The most rigorous evidence came from ByteDance Research and Tsinghua University at ICML 2025. Bingyi Kang and the Seed research team found that when video models encounter unfamiliar scenarios, they do not reason about physics but copy the closest training examples, grasping onto superficial cues like color and object size while ignoring what actually matters: how fast things move, where they go, and why. The conclusion was blunt: "scaling alone is insufficient for video generation models to uncover fundamental physical laws." (How Far is Video Generation from World Model: A Physical Law Perspective, Kang et al. 2025).
If the problem is structural, then surface-level fixes will not solve it. This realization has pushed the field toward the concept of world models. A model that truly understood the world would not need explicit conditioning to avoid physics violations. It would avoid them because it knows how physics works. But while researchers agree on the diagnosis, their proposed treatments are fundamentally different.
Two paths to physical understanding
One methodological position holds that physical understanding can emerge from 2D video alone, given sufficient scale. Trained on billions of video frames, the model develops implicit motion priors from understanding how pixels change over time. It learns that objects tend to fall at a consistent rate, that shadows track light sources, that reflections mirror the scene above a water surface. None of these rules are programmed. They are absorbed from the statistical regularities of how pixels change across time. And the approach has produced real results: at sufficient scale, generated videos exhibit emergent 3D consistency as the camera rotates, objects persist through occlusion and re-emerge correctly, and simple state changes are tracked across frames. These properties emerge without any explicit 3D representation or physics engine. They are purely phenomena of scale.
In 2D conditioned models, the ceiling is structural: the model learns that "pixels corresponding to falling objects tend to move downward at rates consistent with training data."It can produce output that looks physically plausible without developing any mechanism to verify whether it is. The model contains no structural information about why things happen the way they do. It has learned the appearance of physics, not its logic. Fei-Fei Li frames the problem precisely: current AI systems are "wordsmiths in the dark, eloquent but inexperienced, knowledgeable but ungrounded." Physical reasoning requires spatial intelligence, the ability to reason about geometry, depth, and dynamics in three dimensions. That information does not exist in 2D projections. It must be provided. 3D conditioning addresses this directly. Instead of inferring spatial relationships from flat projections, models can receive explicit three-dimensional information: the depth of every surface in the scene, the position of every object, and the geometric relationships between them. A 3D-conditioned model does not need to guess whether two objects occupy the same space, it knows their coordinates. It does not need to infer depth from visual cues, it receives depth as input. It does not need to relearn the same physical interaction from every camera angle, because 3D coordinates are viewpoint-invariant: the spatial relationship between two objects is the same whether observed from the front or above. 3D conditioning does not replace what 2D models have learned. It adds the structural layer they are missing: a coordinate system grounded in how the physical world is actually organized.
3D Conditioning as Infrastructure
Kamo-1, our 3D-conditioned video diffusion model, ingests multiple 3D conditioning signals simultaneously. 3D body poses define character movement through space. Camera trajectories specify the observer's exact path. Depth maps encode scene geometry frame by frame, allow camera disentanglement and position the character accurately in the environment. The model generates 2D video, but with access to the geometric truth of what it is rendering: because it is aware of movement in 3D space, the camera is disentangled from the scene, and character motion remains precise in space and time regardless of viewpoint. If the camera orbits a walking character, the character continues in a straight line rather than warping or sliding with the viewpoint, because the model tracks each trajectory independently in 3D.
Kamo-1 currently conditions on character animation and camera in 3D. The environment, objects, and surfaces remain conditioned from the first frame onwards. The model produces a convincing 2D effect without maintaining a true 3D position. While, in a fully 3D-conditioned environment, the character exists within a persistent spatial world. When they walk away, the model knows their exact coordinates, their proportions remain consistent, and the reduction in apparent size is a natural consequence of perspective in a real coordinate system, not a learned visual trick. This is the difference between simulating depth and understanding space: one reproduces appearance, the other preserves the geometric truth of where things are, how large they are, and how they relate to everything around them. Further out, the direction points toward conditioning on full persistent 3D worlds, complete environments with geometry, materials, and lighting, within which video generation operates. This is where the boundary between "3D-conditioned video model" and "world model" begins to dissolve.
3D conditioning is a foundation, not a solution. You cannot reason about physics without first knowing where things are in space. 3D conditioning provides that spatial scaffold: the geometry and depth relationships on which physical reasoning depends. Without this layer, any understanding the model develops is built on ambiguous 2D projections. With it, the model has a coordinate system on which higher-level physical understanding can eventually be constructed. 3D conditioning does not teach physics. It builds the foundation that physics requires.
3D approaches come with real trade-offs. Current 3D conditioning and rendering methods tend to reduce visual quality: tighter spatial control can produce video that is geometrically accurate but visually rigid, and video models that are not realistic are in certain cases not usable regardless of how precise its physics are. 3D data is both rare and difficult to obtain at high quality. 3D generation technologies are not yet perfectly reliable, and feeding a video model inaccurate 3D reconstructions introduces errors that propagate through every frame. These are real constraints. But they are engineering problems, not structural ones. Visual quality improves as conditioning architectures mature. 3D data availability is growing through synthetic generation. And computational costs follow the same trajectory as every prior advance in the field: expensive at introduction, rapidly declining with optimization.
We are building toward a future where video generation produces not just images that look like reality, but representations that are grounded in how reality works.
The field is just beginning to discover what that requires.