AV Blog 7: Efficient Video Encoding with AutoGaze

AV Blog 7: Efficient Video Encoding with AutoGaze

Applying NVIDIA’s AutoGaze patch-selection model to Bench2Drive driving video — and finding that driving scenes are surprisingly compressible.

(Note: Sections of this post were produced with LLM assistance)

Vision transformers are powerful, but they’re expensive. A 224×224 image has 196 patches at 16×16 resolution — and if you’re processing 6 cameras at 10 Hz, that’s nearly 12,000 patch embeddings per second before you’ve even started thinking about temporal context. If we want E2E driving on lightweight edge devices, we may be able to re-balance our compute budget around lighter perception encodings.

This post covers AutoGaze (CVPR 2026) — a lightweight patch selection model from NVIDIA Labs — and what happens when you point it at Bench2Drive driving video.

AutoGaze visualization: original, gazed patches, and VideoMAE reconstruction on Bench2Drive data
AutoGaze applied to a 64-frame front-camera clip from Bench2Drive. Left: original frame. Middle: patches selected by AutoGaze (non-selected regions darkened, selected patches outlined in red). Right: VideoMAE reconstruction from selected patches only. AutoGaze selected just 3.7% of available patches.

Chapter 1: What is AutoGaze?

AutoGaze is a patch selection model. Given an input video, it scans the frames causally and decides which patches are worth processing — and which ones can be dropped. The downstream vision encoder (SigLIP, VideoMAE, or whatever you’re using) then only processes the selected patches, skipping everything else.

The key ideas:

  • Causal design: AutoGaze processes frames left-to-right. It can cache previous frame context and process new frames one at a time — making it naturally compatible with streaming inference.
  • Two control knobs: gazing_ratio (maximum fraction of patches selected per frame) and task_loss_requirement (a reconstruction quality threshold — AutoGaze stops selecting patches once the remaining patches can be reconstructed below this loss). The model is trained to satisfy the reconstruction requirement with as few patches as possible.
  • Multi-scale: The model operates at four scales (32×32, 64×64, 112×112, 224×224), producing a combined 265 patch indices per frame.

The reconstruction model (VideoMAE_AutoGaze) is trained to reconstruct full frames from only the selected patches. This is both a training signal and a useful diagnostic — if the reconstruction looks good, the selected patches captured the essential information.


Chapter 2: Setup

The setup is straightforward but has a few gotchas.

git clone https://github.com/NVlabs/AutoGaze.git
uv venv .venv --python 3.11 --seed
source .venv/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

The main dependency pain point is flash-attn. The repo pins flash_attn==2.8.3, but there’s no prebuilt wheel for torch 2.11+cu128. The fix: install the flash_attn_4 beta wheel and update pyproject.toml to match:

pip install "https://github.com/Dao-AILab/flash-attention/releases/download/fa4-v4.0.0.beta4/flash_attn_4-4.0.0b4-py3-none-any.whl"
# change "flash_attn" → "flash_attn_4" in AutoGaze/pyproject.toml
uv pip install -e ./AutoGaze --no-build-isolation

The VideoMAE reconstruction model is stored as a raw PyTorch checkpoint (videomae.pt) under bfshi/VideoMAE_AutoGaze. It needs to be loaded manually — the checkpoint keys have a module.mae. prefix that needs to be stripped before calling load_state_dict.


Chapter 3: Running AutoGaze on Bench2Drive

Our data is already in a convenient format: 224×224 JPEG frames extracted from CARLA, organized as {scenario}/camera/{camera}/{frame:05d}.jpg. AutoGaze runs on 16-frame chunks, so for 64 frames we process 4 chunks in sequence.

from autogaze.models.autogaze import AutoGaze, AutoGazeImageProcessor
from autogaze.datasets.video_utils import transform_video_for_pytorch

ag_transform = AutoGazeImageProcessor.from_pretrained("nvidia/AutoGaze")
ag_model = AutoGaze.from_pretrained("nvidia/AutoGaze").to("cuda").eval()

# 16 frames → (1, 16, 3, 224, 224)
video = transform_video_for_pytorch(raw_frames, ag_transform)[None].cuda()

with torch.inference_mode():
    gaze_out = ag_model(
        {"video": video},
        gazing_ratio=0.75,
        task_loss_requirement=0.7,
    )

# gaze_out["gazing_mask"][-1]  →  (B, T, 196)  per-frame patch mask at 224 scale
# gaze_out["gazing_pos"]       →  (B, K)       flat patch indices across full sequence

The reconstruction model takes the same gaze_out and reconstructs the frames:

mae_out = mae_model(
    pixel_values=video,
    gazing_info=gaze_out,
    frame_idx_to_reconstruct=torch.arange(16, device="cuda"),
    interpolate_pos_encoding=True,
)
# mae_out.reconstruction  →  (B, T, C, H, W)

Chapter 4: Results

We ran six settings of task_loss_requirement on the same 64-frame clip (Accident_Town04_Route159_Weather3, front camera, 4 × 16-frame chunks). Lower values demand better reconstruction and force more patches to be selected. Latency figures are measured on an NVIDIA RTX 3090. LPIPS is computed between the VideoMAE reconstruction and the original frame (lower is better).

task_loss_requirement Patches selected AutoGaze VideoMAE Total LPIPS
0.3 (high coverage) 8,950 / 16,960 — 52.8% ~165 ms/frame ~28 ms/frame ~193 ms/frame 0.096
0.5 (medium coverage) 6,934 / 16,960 — 40.9% ~129 ms/frame ~24 ms/frame ~153 ms/frame 0.107
0.7 (low coverage) 624 / 16,960 — 3.7% ~42 ms/frame ~12 ms/frame ~54 ms/frame 0.181
0.8 (minimal coverage) 11 / 16,960 — <0.1% ~26 ms/frame ~11 ms/frame ~37 ms/frame 0.232
0.9 (no coverage) 0 / 16,960 — 0% ~19 ms/frame ~10 ms/frame ~29 ms/frame 0.270
1.0 (no coverage) 0 / 16,960 — 0% ~16 ms/frame ~10 ms/frame ~26 ms/frame 0.300
AutoGaze low coverage (task_loss=0.7): 3.7% patches selected
Low coverage (task_loss_requirement=0.7): 3.7% of patches selected. AutoGaze finds the reconstruction requirement easy to satisfy — most of the scene is discarded.
AutoGaze medium coverage (task_loss=0.5): 40.9% patches selected
Medium coverage (task_loss_requirement=0.5): 40.9% of patches selected. Significantly more of the scene is retained, with the gaze distribution spreading more evenly across frames.
AutoGaze high coverage (task_loss=0.3): 52.8% patches selected
High coverage (task_loss_requirement=0.3): 52.8% of patches selected. Approaching half the available patches — diminishing returns compared to medium, with ~3× the latency of the low setting.

The 3.7% → 40.9% jump from lowering the threshold from 0.7 to 0.5 is striking. Driving scenes are apparently very compressible at loose reconstruction requirements — long stretches of uniform road, static sky, repetitive lane markings are easily satisfied. But demanding better fidelity quickly pulls in a much larger fraction of the scene.

Also notable: going from medium (40.9%) to high (52.8%) costs another ~36ms/frame for just 12 extra percentage points of coverage. The curve is flattening — there’s diminishing return to tightening the requirement further.

At thresholds of 0.9 and 1.0, AutoGaze selects zero patches entirely — the model is satisfied that the reconstruction requirement can be met without attending to anything. VideoMAE then reconstructs purely from its learned priors, with LPIPS of 0.27–0.30. The 0.8 threshold is a near-degenerate case (11 patches across 64 frames, ~0.17 per frame), with LPIPS of 0.23.

The LPIPS scores quantify what’s visible in the reconstructions: at 52.8% coverage the reconstruction is noticeably faithful (0.096), while at 3.7% the perceptual error is roughly double (0.181). The gap between 0.3 and 0.5 thresholds is small in LPIPS (0.096 vs 0.107) despite a 12-point difference in coverage, suggesting the marginal patches selected at 0.3 are mostly low-information regions.

The distribution within each chunk is also telling. The first frame receives nearly all the attention at low coverage, with later frames getting minimal patches — consistent with AutoGaze’s causal design. At higher coverage this evens out as AutoGaze has to look more carefully at each frame to meet the reconstruction target.


Chapter 5: How Few Patches Do You Need?

The task_loss_requirement sweep tells us about reconstruction quality at the thresholds AutoGaze naturally lands on — but it doesn’t give fine-grained control over how many patches are used. The second control knob, gazing_ratio, sets a hard per-frame cap on patch selection regardless of reconstruction quality. Sweeping it gives a clean patches-vs-LPIPS curve.

Line charts: LPIPS vs gazing_ratio (left) and LPIPS vs task_loss_requirement (right)
Left: reconstruction quality (LPIPS) as a function of gazing_ratio. Right: LPIPS as a function of task_loss_requirement. Annotations show actual patch coverage (%). Note the x-axes are inverted — left means more patches / stricter quality.
gazing_ratio Patches selected AutoGaze VideoMAE LPIPS
0.75 8,950 / 16,960 — 52.8% ~166 ms/frame ~30 ms/frame 0.096
0.40 2,988 / 16,960 — 17.6% ~87 ms/frame ~17 ms/frame 0.135
0.20 132 / 16,960 — 0.8% ~46 ms/frame ~13 ms/frame 0.176
0.10 23 / 16,960 — 0.1% ~26 ms/frame ~11 ms/frame 0.263
<0.10 0 / 16,960 — 0%

Below gazing_ratio=0.10 the model generates EOS immediately for every frame — the per-frame token budget (e.g. 1–13 slots) is apparently too small for AutoGaze to commit to any patch, so nothing is selected. The model has a floor.

The useful range is 0.10–0.75. Within it, the sharpest quality gain per patch comes between 0.20 and 0.40: jumping from 0.8% to 17.6% coverage cuts LPIPS from 0.176 to 0.135 — a meaningful improvement at a cost of ~40ms/frame.

gazing_ratio=0.10: 0.1% patches selected
gazing_ratio=0.10: 23 patches selected across 64 frames (~0.36 per frame). Reconstruction is sparse but retains some coarse scene structure.
gazing_ratio=0.20: 0.8% patches selected
gazing_ratio=0.20: 132 patches across 64 frames (~2 per frame). Reconstructions start showing road geometry clearly.
gazing_ratio=0.40: 17.6% patches selected
gazing_ratio=0.40: 2,988 patches across 64 frames (~47 per frame). Good reconstruction fidelity at roughly a third of the latency of full coverage.
gazing_ratio=0.75: 52.8% patches selected
gazing_ratio=0.75: 8,950 patches across 64 frames (~140 per frame). Best reconstruction quality in the sweep.

Chapter 6: What the Reconstructions Tell Us

Looking at the visualization above, the reconstruction captures the coarse structure of the scene — road layout, sky, general geometry — but misses fine details. That’s expected given 3.7% patch coverage. Whether this level of reconstruction fidelity is sufficient for a downstream driving model is an open question.

What the reconstruction is actually useful for here is as a diagnostic: it confirms that the selected patches aren’t garbage. The model is genuinely identifying the informative regions.

One thing that isn’t yet resolved: the VideoMAE checkpoint has scale_embed and time_embed weights that aren’t in the base facebook/vit-mae-large architecture. These are loaded via key-stripping from the checkpoint. Whether they’re loading correctly is worth verifying more carefully before using the reconstruction as a ground truth.


Chapter 7: Real-World Dashcam

To check how the model transfers beyond CARLA simulation, we applied it to a real-world front-facing dashcam clip. The video (768×432, 30fps) is downsampled to 8fps and resized to 224×224 before being passed to AutoGaze, matching the model’s native resolution. We ran the full gazing_ratio sweep on this clip.

gazing_ratio Patches selected AutoGaze LPIPS
0.75 54.4% ~151 ms/frame 0.097
0.40 22.7% ~83 ms/frame 0.123
0.20 3.9% ~48 ms/frame 0.158
0.10 0.4% ~24 ms/frame 0.210
<0.10 0%

The model floor (zero patches selected) sits at the same gazing_ratio<0.10 as on CARLA. Above the floor, coverage is notably higher on real dashcam footage than on CARLA at the same ratio — 22.7% vs 17.6% at ratio=0.40, and 3.9% vs 0.8% at ratio=0.20 — suggesting real road scenes have more spatial complexity that requires attending to more patches to meet reconstruction quality.

AutoGaze dashcam, gazing_ratio=0.10: 0.4% patches
Minimum useful coverage: gazing_ratio=0.10, 0.4% of patches selected. Just a handful of patches per frame — the model is barely looking at anything, yet the reconstruction retains coarse structure.
AutoGaze dashcam, gazing_ratio=0.20: 3.9% patches
gazing_ratio=0.20: 3.9% patches. Road geometry and the leading vehicle are already visible in the reconstruction.
AutoGaze dashcam, gazing_ratio=0.40: 22.7% patches
gazing_ratio=0.40: 22.7% patches. Good reconstruction fidelity — road markings, sky, and vehicles are clearly reconstructed.
AutoGaze dashcam, gazing_ratio=0.75: 54.4% patches
gazing_ratio=0.75: 54.4% patches. Best quality in the sweep.

Effect of chunk size (temporal context window)

The autoregressive gaze decoder attends to all prior frames within a chunk. Reducing chunk size limits how far back it looks, which should speed up inference — but how much does it matter in practice?

Chunk size vs latency and LPIPS
AutoGaze latency (orange) and LPIPS (blue dashed) as a function of chunk size on the dashcam clip. AutoGaze latency barely changes across chunk sizes — the bottleneck is per-frame gaze generation, not the context window. VideoMAE benefits more from smaller chunks.
Chunk size AutoGaze VideoMAE Total LPIPS
2 ~153 ms/frame ~18 ms/frame ~171 ms/frame 0.095
4 ~149 ms/frame ~17 ms/frame ~166 ms/frame 0.094
8 ~151 ms/frame ~21 ms/frame ~172 ms/frame 0.095
16 ~168 ms/frame ~30 ms/frame ~198 ms/frame 0.096

AutoGaze latency is nearly flat across chunk sizes — the bottleneck is the number of gaze tokens generated per frame, not the context window. VideoMAE does benefit from smaller chunks (30ms→17ms/frame from chunk 16→4), because it reconstructs all frames in a chunk jointly. LPIPS is stable throughout, confirming that reducing the context window doesn’t hurt reconstruction quality. Chunk size 4 is the practical recommendation: same quality, ~30ms/frame faster total.


Chapter 8: What’s Next

The immediate application is building this into the E2E video pipeline. Rather than feeding full dense patch sequences to a vision backbone, we can run AutoGaze first and only process the selected patches — which at 3.7% selection on driving data would be a very significant compute saving.

A few open questions:

  • Does the patch selection preserve what matters for driving? Vehicles, pedestrians, and lane markings might not be well-represented at 3.7% coverage. Worth checking by overlaying gaze masks on semantically labeled frames.
  • Does the causal streaming mode help at inference? The model supports KV-cache streaming — useful if we want per-frame gaze decisions without reprocessing the full 16-frame window.
  • Latency at scale. AutoGaze at chunk_size=4 runs at ~150ms/frame on an RTX 3090. With 6 cameras this would need to run in parallel or be optimized significantly for real-time use.

The code for the inference pipeline and visualization is at vid_comp (GitHub).