AV Blog 3: End-to-End Trajectory Planning from Vision

19 Mar 2026 on Autonomous-vehicles, Computer-vision, Transformers, Trajectory-planning, Ablation-study

Designing and ablating a suite of neural planners — from a pure-kinematics MLP baseline through vision-augmented transformer architectures with auxiliary depth supervision.

With depth and segmentation extraction covered in the last post, it’s time to tackle the core problem: end-to-end trajectory planning. Given sensor history, where should the car go next?

This post introduces six model architectures of increasing complexity, and lays out the ablation study designed to understand which components of a vision-augmented planner actually matter.

Chapter 1: The Task

The planner takes in a window of recent sensor data and must predict the next 50 future waypoints of the ego vehicle. Inputs available at each timestep:

Input	Shape	Description
`past_traj`	(B, 41, 2)	Last ~4s of ego (x, y) positions
`speed`	(B, 41)	Ego speed per timestep
`acceleration`	(B, 41, 3)	3-axis acceleration per timestep
`command`	(B,)	High-level driving command (turn left/right, go straight, follow lane)
`images`	(B, C, 3, 224, 224)	Onboard camera frames — 1 (front) or 6 (surround)

The planner outputs a future trajectory: (B, 50, 2) waypoints over the next ~5 seconds.

Why 41 past steps and 50 future steps? The Bench2Drive dataset runs at ~10Hz. 41 frames ≈ 4 seconds of history; 50 frames ≈ 5 seconds of future — long enough to capture a full lane change or intersection crossing.

Chapter 2: The Architectures

Six architectures are implemented, progressively adding components. Each builds intuition for what contributes to planning performance.

Architecture 1: MLP Planner

The simplest possible baseline. All kinematic inputs are flattened into a single vector and fed through a deep MLP.

MLP Planner architecture diagram — Figure 1: The MLP Planner flattens all kinematic inputs — trajectory, speed, acceleration, and one-hot command — into a single 286-dim vector and passes it through 4 hidden layers to predict 50 future waypoints.

Key design choices:

Input: flatten past_traj (82) + speed (41) + acceleration (123) + one-hot command (4) = 286-dim vector
4× [Linear(256) → LayerNorm → GELU → Dropout(0.1)]
Output: Linear(100) reshaped to (50, 2) waypoints
Parameters: 289K

This model is vision-free. It asks: how well can we plan from kinematics alone? It also gives us a compute and accuracy floor for every subsequent architecture.

Architecture 2: Transformer Planner

The same kinematic inputs — but rather than flattening, we preserve the temporal structure of the 41-step history. Each timestep becomes a token; a transformer encoder learns temporal relationships, and a decoder with learned query embeddings generates the future trajectory.

Transformer Planner architecture diagram — Figure 2: The Transformer Planner preserves the temporal axis of kinematic data. Each past timestep is encoded as a 10-dim token (x, y, speed, acc_xyz, command), projected to d=128, and processed by a 3-layer transformer encoder. A 3-layer decoder with 50 learned query embeddings auto-regressively attends to the encoded history.

Key design choices:

Per-timestep token: [x, y, speed, acc_x, acc_y, acc_z, cmd_0..3] = 10-dim
Input projection: Linear(10 → 128) + sinusoidal positional encoding
3× TransformerEncoderLayer (4 heads, FFN=512)
3× TransformerDecoderLayer with 50 learned query embeddings
Output: Linear(128 → 2) per query
Parameters: 1.4M

Ablation question: Does temporal attention over kinematics outperform the MLP’s flattened view?

Architecture 3: ResNet Planner

Now we add vision. The front camera is encoded with a frozen, pretrained ResNet50 backbone. Visual features are projected to token embeddings and fused with kinematic memory inside a cross-attention decoder.

ResNet Planner architecture diagram — Figure 3: The ResNet Planner uses a frozen ResNet50 to extract visual features from the front camera. Feature maps at 7×7 (and optionally 56×56 through 7×7 multiscale) are projected to d=128 tokens. A kinematic encoder processes trajectory history, and a FlexDecoder cross-attends to both visual and kinematic sources to predict the trajectory.

Key design choices:

Frozen ResNet50: no gradient flows into the backbone — visual features are treated as a fixed prior
Supports single-scale (7×7 = 49 tokens from layer4) or multiscale (56²+28²+14²+7² = 4,410 tokens)
KinematicEncoder: 2-layer transformer encoder over the kinematic sequence
FlexDecoderLayer: self-attention on future queries + cross-attention to [visual tokens kinematic tokens]
Parameters: 33.2M (dominated by frozen ResNet50 ~25M)

Ablation questions:

Does vision from ResNet50 improve over kinematics-only?
Does multiscale visual context (small objects at higher resolution) improve planning?

Architecture 4: Front Camera Planner

Swaps the ResNet50 backbone for a frozen TinyViT — a lightweight Vision Transformer that generates richer spatial tokens with less compute than a full ViT. Still limited to the single front camera.

Front Camera Planner architecture diagram — Figure 4: The Front Camera Planner uses TinyViT as the visual backbone (frozen). Multiscale visual tokens are combined with kinematic encoder output and decoded by a FlexDecoder to predict 50 future waypoints. 2D sincos positional encodings are used for visual tokens; 1D sincos for kinematics.

Key design choices:

Frozen TinyViT: produces richer semantic tokens than ResNet’s convolutional features
2D sinusoidal positional encodings for spatial awareness in the visual tokens
Otherwise structurally identical to ResNetPlanner
Parameters: 29.6M (dominated by frozen TinyViT ~28M)

Ablation question: Does TinyViT’s attention-based feature extraction improve over ResNet50’s convolution-based features for trajectory planning?

Architecture 5: Front Camera + Depth Planner

Adds an auxiliary depth estimation head to the Front Camera Planner. The hypothesis: training to predict depth forces the visual encoder tokens to encode geometry, which then benefits trajectory planning.

Front Camera Depth Planner architecture diagram — Figure 5: The Front Camera Depth Planner adds a separate depth decoder head alongside the trajectory decoder. Both share the same frozen TinyViT visual tokens and kinematic encoder. Depth supervision is applied when ground-truth depth maps are available in the batch.

Key design choices:

Trajectory decoder and depth decoder are separate (no shared weights beyond the backbone tokens)
Depth head outputs (B, 1, 224, 224) — a full-resolution depth map
Depth loss is only applied when depth ground truth is available (curriculum-friendly)
At inference, only the trajectory head is used
Parameters: 35.7M (+6.1M depth decoder over Front Cam Planner)

Ablation questions:

Does geometric auxiliary supervision improve trajectory accuracy?
Is there a trade-off where depth training hurts or helps the planning objective?

Architecture 6: Vision Transformer Planner (Full Model)

The full architecture: six surround cameras, multiscale TinyViT features, kinematic encoder, and optional auxiliary depth and semantic segmentation heads — all unified under a shared decoder framework.

Vision Transformer Planner architecture diagram — Figure 6: The Vision Transformer Planner processes all 6 onboard cameras with a shared frozen TinyViT backbone. Multiscale features from all cameras are pooled and combined with kinematic encoding. Separate decoders predict the future trajectory (primary), depth map, and semantic segmentation (auxiliary).

Key design choices:

6 cameras: front, front-left, front-right, rear, rear-left, rear-right
Shared TinyViT backbone across all cameras (weight sharing)
Multiscale tokens: 4 levels (56×56, 28×28, 14×14, 7×7) from all cameras
3 decoder heads: trajectory, depth, segmentation
2D sincos pos enc applied per-camera, 1D sincos for kinematics
Auxiliary heads detached from trajectory decoder (no gradient leakage)

Ablation questions:

Does surround-view context (6 cameras vs 1) improve planning?
Does adding semantic segmentation as an auxiliary task help further?

Chapter 3: Ablation Study Design

The six architectures are not arbitrary — they form a controlled ablation ladder across three independent axes.

Axis 1: Visual Input

Model	Vision	Cameras	Parameters	Latency (ms)	FPS	Avg L2 (↓)
MLP Planner	None	—	289K	0.4	2550	—
Transformer Planner	None	—	1.4M	2.3	440	—
ResNet Planner	ResNet50 (frozen)	1 (front)	33.2M	7.7	130	—
Front Cam Planner	TinyViT (frozen)	1 (front)	29.6M	8.8	113	—
FrontCam+Depth Planner	TinyViT (frozen)	1 (front)	35.7M	11.9	84	—
ViT Planner (full)	TinyViT (frozen)	6 (surround)	41.8M	66.3	15	—
ThinkTwice (SOTA)	—	—	—	—	—	0.95

Avg L2: mean L2 distance (meters) between predicted and ground-truth waypoints on the Bench2Drive validation split. Lower is better. SOTA is 0.95m from ThinkTwice. Results populated as training completes. Latency measured at batch size 1 on an RTX 3090 (fp32); Bench2Drive runs at 10 Hz so ≥10 FPS is required for real-time inference.

Controls: does vision help planning? Does more coverage (surround vs front only) help?

Axis 2: Backbone Architecture

Backbone	Type	Parameters	Frozen?
None (MLP/Transformer)	—	—	—
ResNet50	CNN	~25M	Yes
TinyViT	ViT	~28M	Yes

Control: CNN features vs. attention-based features — which provides better planning cues?

Axis 3: Auxiliary Supervision

Model	Depth Head	Segmentation Head
Front Cam Planner	No	No
FrontCam+Depth Planner	Yes	No
ViT Planner (full)	Yes	Yes

Control: does geometric/semantic auxiliary supervision act as a regularizer that improves trajectory prediction?

Metrics

All models are evaluated on the Bench2Drive validation split using:

L2 displacement error at 1s, 2s, 3s, 5s horizons (meters)
Final Displacement Error (FDE) at 5s horizon
Collision rate (% of rollouts with ego-obstacle contact, when available)

Auxiliary task metrics (depth AbsRel, segmentation weighted IoU) are tracked separately to monitor auxiliary head quality and its correlation with trajectory performance.

Hypotheses

Vision helps — even a single front camera should give meaningful improvement over kinematics-only planners, especially in intersection and lane-change scenarios.
TinyViT > ResNet50 for this task — attention-based spatial features should encode scene geometry better than convolutional features.
Depth auxiliary supervision helps — geometric supervision should force the shared tokens to encode 3D structure, improving obstacle avoidance.
Surround cameras help at longer horizons — at 1s horizon, forward camera may be enough; at 5s, knowing what’s behind and to the side matters.
Segmentation auxiliary has diminishing returns — semantic labels are noisier than depth for trajectory tasks; this one we’re less sure about.

Chapter 4: What’s Next

With the architectures designed and the ablation storyboarded, the next steps are:

Train all six models to convergence under identical hyperparameters
Evaluate on the validation set across all displacement horizons
Visualize planned trajectories on held-out scenarios — especially failure cases
Quantify the contribution of each ablation axis in an aggregated table

The goal is not just a best-performing model, but a clear understanding of which components earn their compute.

Follow along with the code:

E2E Driving (GitHub)

Chapter 1: The Task

Chapter 2: The Architectures

Architecture 1: MLP Planner

Architecture 2: Transformer Planner

Architecture 3: ResNet Planner

Architecture 4: Front Camera Planner

Architecture 5: Front Camera + Depth Planner

Architecture 6: Vision Transformer Planner (Full Model)

Chapter 3: Ablation Study Design

Axis 1: Visual Input

Axis 2: Backbone Architecture

Axis 3: Auxiliary Supervision

Metrics

Hypotheses

Chapter 4: What’s Next

Templates:

Error