Interactive world models for first-person shooter (FPS) games must resolve high-frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per-pixel temporal sequences so that each position computes its action response from local visual content. This separates in-scope effects from out-of-scope generation without segmentation labels. We also introduce CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry. It comprises 69K clips from 7 titles with 10-DoF controller signals, curated to remove gameplay bias. The model learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross-game generalization.
SCOPE inserts a spatial action decoupling module into each DiT block. Discrete events use cross-attention to confine effects to in-scope regions; continuous controls use MLP fusion and temporal self-attention for smooth ego-motion. All output projections are zero-initialized for stable residual training.
Figure 1. SCOPE architecture with dual-pathway spatial action decoupling module.
SCOPE is trained on CrossFPS, a multi-game dataset with 69K clips from 7 FPS titles and frame-aligned 10-DoF gamepad telemetry.
Figure 2. CrossFPS: 69K clips across 7 FPS titles with per-frame 10-DoF action annotations.
SCOPE supports dense per-frame 10-DoF control with simultaneous multi-action composition, enabling high playability across diverse unseen environments.
SCOPE generalizes zero-shot to unseen visual styles from a single context frame, without any fine-tuning.
Side-by-side comparison with Wan2.2, HY-World 1.5, LingBot-World, and Matrix-Game 3.0.
Table 1. Quantitative comparison on the CrossFPS test set.
| Method | Visual Quality | Motion Quality | Consistency | |||||
|---|---|---|---|---|---|---|---|---|
| JEPA↑ | FVD↓ | LPIPS↓ | Dyn.Deg.↑ | Flow↑ | Smooth↑ | Photo.↓ | Depth↓ | |
| Matrix-Game 3.0 | 0.366 | 1022.7 | 0.692 | 0.661 | 13.36 | 2.502 | 1.194 | 1.524 |
| LingBot-World (Act) | 0.615 | 954.4 | 0.627 | 0.868 | 15.50 | 2.215 | 0.626 | 1.454 |
| HY-World 1.5 | 0.464 | 1131.7 | 0.611 | 0.225 | 2.37 | 1.690 | 2.523 | 1.502 |
| SCOPE (Ours) | 0.806 | 690.3 | 0.601 | 0.910 | 18.24 | 2.383 | 0.198 | 1.299 |
Table 2. Visual quality on unseen scenes (50 clips per category, first frames from GPT-image-2).
| Scene Style | JEPA↑ | LPIPS↓ | Flow↑ | Photo.↓ | Smooth↑ |
|---|---|---|---|---|---|
| Stylized open-world | 0.772 | 0.618 | 17.45 | 0.235 | 2.341 |
| Cooperative adventure | 0.758 | 0.632 | 16.89 | 0.251 | 2.298 |
| Mythological action | 0.781 | 0.612 | 17.82 | 0.224 | 2.356 |
| Sci-fi corridor | 0.795 | 0.605 | 18.01 | 0.212 | 2.370 |
| Average (unseen) | 0.777 | 0.617 | 17.54 | 0.231 | 2.341 |
| In-distribution (ref.) | 0.806 | 0.601 | 18.24 | 0.198 | 2.383 |
Table 3. Action controllability on unseen scenes. Completion rate (N=50 per task).
| Method | Single Action | Multi-Action Composition | Action-Environment Interaction | Avg. | |||||
|---|---|---|---|---|---|---|---|---|---|
| Fire | Scope | Scope+Fire | Move+Fire | Switch+Fire | Object | Environment | NPC | ||
| Matrix-Game 3.0 | 0% | 0% | 0% | 4% | 0% | 0% | 0% | 0% | 0.5% |
| HY-World 1.5 | 4% | 12% | 2% | 36% | 2% | 0% | 6% | 2% | 8.0% |
| LingBot-World (Act) | 82% | 74% | 42% | 18% | 26% | 12% | 32% | 20% | 38.3% |
| SCOPE (Ours) | 94% | 90% | 82% | 76% | 68% | 46% | 62% | 54% | 71.5% |
@article{tong2026scope,
title={SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models},
author={Zizhao Tong and Hongfeng Lai and Zeqing Wang and Zhaohu Xing and Kexu Cheng and Haoran Xu and Zhao Pu and Shangwen Zhu and Ruili Feng and Jian Zhao and Yan Zhang and Hao Tang and Yeying Jin and Ling Shao},
year={2026}
}