AV Blog 6: [Paper Review: RAP 3D Rasterization Augmented End-to-End Planning]

AV Blog 6: [Paper Review: RAP 3D Rasterization Augmented End-to-End Planning]

[Three major contributions to e2e planning.]

Github Project Page ArXiv

RAP is the state-of-the-art system for closed-loop E2E planning across four major benchmarks as of Febuary 7 2026. These impressive results come from a scalable 3D rasterization pipeline which renders driving scenes using geometric primitives from annotations, an alignment module which unified the perception of real and primitive visual features, and a rasterization-augmented e2e planning framework which builds on imitation learning.


Chapter 1: [Introduction / Background]

3D Rasterization Pipeline

Driving scenes are reconstructed as annotated primitives (lane polylines, agent cuboids) which is training-free, fast, and controllable, while encoding the cues necessary for driving. This allows additional training augmentations such as:

  1. Perturbing ego trajectories to simulate recovery maneuveurs (in the sequence history). This addresses brittleness of imitation learning (IL) approaches.
  2. Cross-agent view synthesis re-renders scenes from other agent viewpoints, expanding scale and interaction diversity.
  3. Raster-to-real alignment encodes semantic and geometric features to a latent between both rendered and rasterized scenes - allowing the lightweight training scenes to carryover perception features to real driving.

The authors show that this pipeline dubbed Rasterized Augmented Planning (RAP) benefits multiple planners - making it agnostic to model design.

Research note: Since RAP is agnostic to model, maybe I can build an ablation on the planners which use RAP features. We can show given this RAP-data paradigm, how different planning architectures and approaches create incremental improvements to closed-loop driving benchmarks.

Related Work

e2e planning: e2e motion plannings map sensory observations directly to future trajectories. Reinforcement learning approaches [citations] remain bottlenecked by sampling inefficiency and the need for scalable environments to train in. Imitation learning from real-world expert driving logs [citations] achieves strong open-loop (waypoint prediction) evaluations but struggles from a lack of recovery from mistakes and a fail to generalize to long-tail (rare) events. Many [citations] works introduce adversarial scene generation restricted to birds-eye-view scenes - the good news is RAP can generate many perspectives from each of these scenarios!

Rendering for Driving MetaDrive (Li 2023b) synthesizes digital scenes from real-world logs Neural rendering approaches like NeRF and 3D Gaussian Splitting reconstruct logs with high fidelity and support counterfactual replay. Editor’s Note: Counterfactuals are hypothetical scenes in which the ego could have taken another path. Practically, I believe this refers to modifying the past kinematics but retaining the desired future waypoints. DAgger (Ross 2011) motivated why off-expert states matter. Counterfactual scenarios with weird historical tracks (to predict future tracks) can be generated by rendering the ego view from those alternative prior positions. NeRD and Gaussian Splatting could be used for this, but in RAP the generation is much cheaper. They may do this on-the-fly in the training loop?

Occupancy-based reconstructions predict a 3D grid of voxels being occupied or unoccupied, and require dense labeling. Attempts at training planners with these renderings like RAD remain small-scale.

Research Note: Is RAP so light as to enable reinforcement learning or some kind of online, continuous learning in simulated scenes? Could you directly train for closed-loop performance in this way?

Chapter 2: [Method]

3D Rasterization design prioritizes rendering speed and scalability. Instead of photorealism, focus on preserving geometric and semantic cues which DO affect driving - rather than textures and lighting.

Map elements (road surfaces, crosswalks, lanes) are polylines (sets of lines) in the 3d world. Traffic objects like vehicles, bicyles, traffic cones, objects are approximated by oriented cuboids with an xyz and rotation position at time T and a length-height-width cuboid. Traffic lights are red/yellow/green cuboids.

The world scene can be projected to the image plane by modeling the ego camera. The points are then rasterized into an RGB canvas using depth-aware composition.

Data Augmentations via 3D Rasterization

Perturb the logged ego trajectory - modify the trajectory then re-render via rasterization to have counterfactual scenes where ego vehicle drifted from previous expert path. Recovery from distribution shifts, improves closed-loop evals. (Would make a good ablation result)

Cross-agent view synthesis. Each nuPlan traffic scenario has N agent trajectories - which can turn into N ego trajectories for training. The authors report over 500k rasterized training samples to cover more viewpoints, interactions, and (rare!) recovery scenarios.

Raster-to-Real Alignment

Visual encoder can extract projected features from both a real sample and a paired rasterized rendering (are the “real samples” from a driving simulator? so we have ground-truth object positions?).

Under the image encoder, the raster features are the target for the projection of the real features. This means that at deployment time, the incoming real features are projected to the same place as where the perception latents were at training. The MSE loss between the latents is the loss for this training objective. This design is clever because raster features are high-quality, clear signals which cut out irrelevant noise from scenes, and only contain important visual features!

Research note: NavSim2 https://arxiv.org/pdf/2506.04218 looks very interesting - makes a tradeoff between expensive closed-loop interactive evals, and cheap but less meaningful open-loop model predictions. Super recent work. Documentation on github and paper. Public leaderboard. Also, WOD-E2E Driving Challenge (waymo open dataset) challenge

Theres a section here on global alignment which aligns rasterized and real images even when lacking real images for all possible rasterized images. They do a domain adaptation auxiliary loss.


Chapter 4: [Experiments]

rasterized data built from OpenScene - a subset of nuPlan containing 1200 hours of annotated driving logs. 120 hours provide ego-centric real camera sensors (hence the global alignment step). Rasterize both ego AND other agents trajectories across all 1200 hours - and perturb ego trajectories in 10% of ego logs.

Extract 7-second clips where 2 seconds are input and 5 seconds are output. (This is a departure from how I was using bench2drive. i also wonder about varying both of these values to bootstrap more data in sequence-to-sequence prediction?

Planning-Aware Driving Metric Score (PDMS) filtering strategy to remove trivial cases where constant-velocity baseline.

For a vision encoder, they build with RAP-DINO, which is a dinov3-h 888m param model. For closed-loop inference on Bench2Drive, they introduce RapResnet with 29m params. They also show RAP benefits RAP-iPAD and RAP-DiffusionDrive.

Multi-modal trajectory head predicts future waypoints. Trajectory scoring head scores future trajectories as an auxiliary loss using PDMS scores as labels. (I should be doing this)

Training on 4H100 gpus takes 80 hours. about 8$/hr for 80 hours == $640 (vast.ai spot prices). Not bad!


Chapter 5: [Benchmarks, Leaderboards]

NAVSIM v1/v2: Evaluates ability to make 4-second future traj (at 2hz, means 8 waypoints for 4 second future) based on 2 seconds of historical ego states and multi-view camera. Metric is PDMS (from above?) which aggregates sub-metrics like no collisions, drivable area compliance, time-to-collision, ego progress, comfort (lol).

DAC: staying on the road, staying out of bike lanes

TTC: How close did you come (in terms of seconds to collision based on trajectories) to impacting?

Ego Progress (EP): Did you get somewhere? We need to actually drive to make progress. Not 1m/s, either!

Comfort: Jerkiness, slamming breaks, swerving. Measured via kinematics.

NavSim2 Adds:

Traffic light compliance, driving direction compliance, lane keeping, extended comfort (more comfort like speed oscillation and high acceleration).

NAVSIM v2 also introduces gaussian splatting in a counterfactual-like synthetic world for evaluating out-of-distrubition routes without full simulation.

Ablations

3D rasterization design

Ablation on recovery-oriented perturbations They compare using-and-not-using perturbations - I wonder affect of N perturbations? I could try it

Ablation on R2R alignment.

Ablation on cross-agent view synthesis adds 1k, 10k, 100k, 500l, 1m synthetic samples of other vehicles - finds log-scaling trend with strong positive relationship but diminishing.


Appendix

They note that you could align the visual features as raster-to-real (make the raster world look like real embed) symmetic (make them share a latent, train both encodings), or real-to-raster (make the real world look like rasterized). They find real-to-raster does have best downstream.

The code is at [Repo Name] (GitHub).