4. ● Kaggle: Lyft Motion Prediction for Autonomous Vehicles
● l5kit Data HP: Data - Lyft
Competition/Dataset page
5. ● Focus on “Motion Prediction” part
○ Given bird-eye-view image (No natural images)
○ Predict 3 possible trajectories with confidence.
Competition introduction
Competition Scope Image from https://self-driving.lyft.com/level5/data/
6. ● It was focusing “Perception” part
○ https://www.kaggle.com/c/3d-object-detection-for-autonomous-vehicles
○ Detect car as 3d object
Last year competition: Lyft 3D Object Detection
Image from https://self-driving.lyft.com/level5/data/ Image from https://www.kaggle.com/tarunpaparaju/lyft-competition-understanding-the-data
7. ● Information in the bird-eye-view
○ Label of passengers (e.g. car, bicycle and pedestrian...)
○ Status of traffic light
○ Road information (e.g. pedestrian crossings and direction)
○ Location and timestamp...
Competition introduction
These information
can be gathered into
single image using
l5kit library
8. ● Total dataset size: 1118 hours, 26344 km
● Road length: 6.8 miles
● Train (89GB), Validation (11GB), Test Dataset (3GB):
○ Big data: Approx 200M, 190K, 71K Agents to predict motion.
Lyft level5 Data description
Image from https://arxiv.org/pdf/2006.14480.pdf
“One Thousand and One Hours: Self-driving Motion Prediction Dataset”
10. ● Route on google map
● Not so long distance, around Lyft office (Actually, CNN can “memorize” the place from image)
EDA using google earth
1.Station 2.Intersection
2.←Paper fig
2.Signals
11. ● Many straight roads
● Some complicated intersections...
EDA using google earth
12. ● More & more EDA, Train/Valid/Test stat is almost same!
No extrapolation found in this dataset…
○ Agent type distribution:CAR 91%, CYCLIST 2%, PEDESTRIAN 7%
○ Date :From 2019 October to 2020 March
○ Time :Daytime, From 7am to 7pm
○ Place:All road is included in train/valid/test
● Less effort is necessary “how to handle & train data”
→ Pure programming skill & ML techniques were important.
More EDA, No extrapolation found in this dataset...
Time
https://www.kaggle.com/c/lyft-motion-prediction-autonomous-vehicles/discussion/189516
Date
14. ● Structured numpy array + zarr is used to save data on disk.
● structured array: https://numpy.org/doc/stable/user/basics.rec.html
Raw Data format
● zarr: https://zarr.readthedocs.io/en/stable/
○ It can save structured array on disk
15. ● l5kit is provided as baseline: https://github.com/lyft/l5kit
○ (Complicated) data preprocessing part is already implemented
○ Rasterizer
■ Semantic → protocol buffer is used inside MapAPI to draw semantic Map
■ Satellite → Draw satellite image.
● Most kaggle competition : 0 → 1
This competition : 1 → 10
L5kit library
Rasterizer
(base implementation
provided by Lyft)
Raw data (zarr)
- World coordinate
in time
- Extent (size)
- Yaw
CNN
Predict future
coordinates
(3 trajectories)
Typical approach already supported by l5kit Image
18. ● 1. Use train_full.zarr
● 2. l5kit==1.1.0
● 3. Set min_history=0, min_future=10 in AgentDataset
● 4. Cosine annealing for LR decrease until 0, with training 1 epoch
→ That’s enough to win the prize! (Private LB: 10.274)
● 5. Ensemble with GMM (Gaussian Mixture Models)
→ Further boosted score by 0.8 (Private LB: 9.475)
Short Summary
20. ● How to predict probabilistic behavior?
● Suggested Baseline kernel “Lyft: Training with multi-mode confidence”
○ Single model outputs 3 trajectories with the confidence at the same time
○ Train using competition evaluation metric loss directly
○ 1st place solution also originate from our approach (link)
Approach/Solution:
21. Approach/Metric:
• In this competition, model outputs 3 hypotheses (trajectories).
– ground truth:
– hypotheses:
• Assume the ground truth positions to be modeled by a mixture of Normal distributions.
• LB score is calculated by following metric and we directory used it as loss function of
CNN.
22. ● To utilize all possible data? → Let’s use train_full.zarr without down sampling
○ But size is big!….
○ 89 GB
○ 191,177,863 record with default setting
→ Need distributed training!
※ It was important to use all the data, to get good score in the competition.
Use train_full.zarr dataset
23. ● torch.distributedis used
○ 8 V100 GPUs * 5 days for 1 epoch
● Practically, need to modify AgentDataset to cache index arrays in disk
○ AgentDataset is copied in DataLoader when num_workers is set.
■ 8 multiprocesses * 4 num_workers = 32 copy is created
■ On-memory usage of AgentDataset is huge! Cannot fit in RAM.
● cumulative_sizesattribute was the bottleneck.
○ Cache track_id, scene_index, state_indexinto zarr to
reduce on-memory usage.
Distributed training
24. ● Pointed out in “We did it all wrong” discussion:
○ The target_positions value need to be rotated in the same way with the image,
specified by agent’s “yaw”
Use l5kit==1.1.0
l5kit==1.0.6 target_positions l5kit==1.1.0 target_positions
25. ● Use chopped dataset: Only use 100-th frame from each scene.
○ This is how test data is made.
○ But it discards all ground truth data,
instead, set agent_mask in AgentDataset to make validation data.
● Check validation/test dataset carefully
○ We Noticed that it contains at least 10 future frames & 0 history frames.
→ Next page
Validation strategy
26. ● Set min_history=0, min_future=10 in AgentDataset
○ MOST IMPORTANT!
○ Public LB Score jumps to 13.059 here.
Align training dataset to validation/test dataset
27. ● Tried several models
● Worked Well:
○ Resnet18
○ Resnet50
○ SEResNeXt50
○ ecaresnet18
● Not working well: Big, deeper models tend to have worse performance...
○ ResNet101
○ ResNet152
CNN Models
28. ● Trained hyperparameters
○ Batch size 12 * 8 processes
○ Adam optimizer
○ Cosing annealing with 1 epoch (Better than Exponential decay)
Training with cosine annealing
29. ● Used albumentationslibrary, tried several augmentations.
○ Tried Cutout, Blur, Downscale
○ Other augmentation used in natural image, ex flip, was not appropriate this time
● Only cutout is adopted for final model.
Augmentation: 1. Image based augmentation
Cutout Blur DownscaleOriginal image
30. ● Modified BoxRasterizer to add augmentation
○ 1. Random Agent drop
○ 2. Agent extent size scaling
● We could not find clear improvement during our experiment.
Final model does not use this augmentation...
Augmentation: 2. Rasterizer level augmentation
Several agents
are dropped
Host car size
is different
31. ● How to ensemble models?
○ In this competition, we train model to predict three trajectories (x1,x2,x3) and
three confidences (c1,c2,c3).
○ Simple ensemble methods such as averaging do not work.
● Consider the outputs as Gaussian mixture models
○ The outputs can be considered as confidence-weighted GMMs with
n_components=3
○ You can take the average of GMMs and the average of N GMMs takes the form
of GMM with n_components=3N
Ensemble by GMM and EM algorithm
32. ● You can get ensembled outputs from by
following the steps below.
○ Sampling enough points (e.g. 1000N) from the distribution .
○ Run the EM algorithm with n_components=3on the sampled points
(We used sklearn.mixture.GaussianMixture).
○ Let be the output of the EM algorithm.
Ensemble by GMM and EM algorithm
37. ● CNN Models: Smaller model was enough
○ ResNet18 was enough to get 4th place
○ Tried bigger ResNet101, ResNet152, etc… But worse performance
● Only 1 epoch training was enough!
○ Because data is very big & almost duplicated for consecutive frames
○ Important to use Cosine annealing for learning rate schedule
● Rasterizer (drawing image) is bottleneck
○ CPU intensive task, GPU util is not 100%.
Findings
Rasterizer
(base implementation
provided by Lyft)
Raw data
- World coordinate
in time
- Extent (size)
- Yaw
CNN
Predict future
coordinates
(3 trajectories)
Typical approach
Image
38. ● https://www.kaggle.com/c/lyft-motion-prediction-autonomous-vehicles/discussion/201493
● Optimize Rasterizer implementation
→ 8 GPU * 2 days for 1 epoch
● Hyperparameters with “heavy” training
○ Semantic + Satellite images
○ Bigger image (448 * 224) ← (224, 224)
○ num history: 30 ← 10
○ min_future: 5 ← 10
○ Modify agent filter threshold
○ batch_size: 64
etc...
● Pre-training small image 4 epoch → Fine tune big image 1 epoch
○ It was very effective
[1st place solution] : L5kit Speedup
39. ● 10th place solution GNN based methods called VectorNet
○ Faster training & inference
■ They did not use rasterized images at all
■ 11 GPU hours for 1 epoch (Our CNN needs about 960 GPU hours)
○ Comparable performance to CNN-based methods
Other interesting approaches: VectorNet
VectorNet [Gao+, CVPR2020] VectorNet
CNN
CNN
(or not shared)
41. ● How different is the 3 trajectory generated by CNN models?
● Case1: Different directions
○ CNN can predict different possible ways/directions that agents move in the
future.
The diversity of 3 trajectory
42. ● How different is the 3 trajectory generated by CNN models?
● Case2: Speed or start time is different
○ Even direction is straight, CNN can predict different possible
speed/acceleration that agents move in the future.
The diversity of 3 trajectory
44. ● raster_size (Image size)
○ Tried 224x224 & 128x128.
○ Default 224x224 was better
● pixel_size
○ Tried 0.5, 0.25, 0.15.
○ Default 0.5 was better.
● num_history specific model
○ Short history model:
■ Tried to train 0 history model
→ the performance was not better than original model
○ Long history model
■ Tried 10, 14, 20
■ Default 10 was better in our experiment
(But 1st place solution used num_history=30)
Hyperparamter change
45. ● Added velocity arrow to the BoxRasterizer
Custom Rasterizer: 1. VelocityBoxRasterizer
46. ● Original SemanticRasterizer: Semantic image is drawn as RGB image
Custom Rasterizer: 2. ChannelSemanticRasterizer
● ChannelSemanticRasterizer:
○ Separated road, lane, green/yellow/red signal & crosswalk
Somehow, the training performance was worse than original SemanticRasterizer...
47. ● We thought that the red signal length is important to predict when the stopping
agent starts moving in the future.
● This Semantic Rasterizer changes its value by looking how long the single continued
in the history.
Custom Rasterizer: 3. TLSemanticRasterizer
48. ● Draw each agent type in different color/channel
○ CAR = Blue
○ CYCLIST = Yellow
○ PEDESTRIAN = Red
○ UNKNOWN = Gray
● Unknown type agent is also drawn
Custom Rasterizer: 4. AgentTypeBoxRasterizer
49. ● Predict all agent’s future coords at once, from 1 image.
● Using semantic segmentation models (segmentation-models-pytorch)
● Stopped investigation because agent sometimes exists very far from host car.
Multi-agent prediction model
https://self-driving.lyft.com/level5/data/
50. ● What kind of data makes the serious big error?
● When the “yaw” annotation is wrong, prediction & actual direction becomes different!
● Fix data’s yaw field contributes total score improvement?
○ YES! for validation dataset (see below).
○ NO!! for test dataset, yaw annotation seems wrong for only stopped cars.
● In the application, I guess this is very important problem to be considered...
Yaw correction
Loss=43988 Loss=30962 Loss=10818
51. ● Kaggle page: Lyft Motion Prediction for Autonomous Vehicles
● Data HP: https://self-driving.lyft.com/level5/data/
● Solution Discussion: Lyft Motion Prediction for Autonomous Vehicles
● Solution Code: https://github.com/pfnet-research/kaggle-lyft-motion-prediction-4th-place-solution
References