This is the official repository for the CVPR 2022 (Oral) paper Deep Visual Geo-localization Benchmark. It can be used to reproduce results from the paper, and to compute a wide range of experiments, by changing the components of a Visual Geo-localization pipeline. Check out our website!
Before you begin experimenting with this toolbox, your dataset should be organized in a directory tree as such:
.
├── benchmarking_vg
└── datasets_vg
└── datasets
└── pitts30k
└── images
├── train
│ ├── database
│ └── queries
├── val
│ ├── database
│ └── queries
└── test
├── database
└── queries
The datasets_vg repo can be used to download a number of datasets. Detailed instructions on how to download datasets are in the repo. Note that many datasets are available, and pitts30k is just an example.
For a basic experiment run
$ python3 train.py --dataset_name=pitts30k
this will train a ResNet-18 + NetVLAD on Pitts30k.
The experiment creates a folder named ./logs/default/YYYY-MM-DD_HH-mm-ss
, where checkpoints are saved, as well as an info.log
file with training logs and other information, such as model size, FLOPs and descriptors dimensionality.
You can replace the backbone and the aggregation as such
$ python3 train.py --dataset_name=pitts30k --backbone=resnet50conv4 --aggregation=gem
you can easily use ResNets cropped at conv4 or conv5.
To add a fully connected layer of dimension 2048 to GeM pooling:
$ python3 train.py --dataset_name=pitts30k --backbone=resnet50conv4 --aggregation=gem --fc_output_dim=2048
To add PCA to a NetVLAD layer just do:
$ python3 eval.py --dataset_name=pitts30k --backbone=resnet50conv4 --aggregation=netvlad --pca_dim=2048 --pca_dataset_folder=pitts30k/images/train
where pca_dataset_folder points to the folder with the images used to compute PCA. In the paper we compute PCA's principal components on the train set as it showed best results. PCA is used only at test time.
To evaluate the trained model on other datasets (this example is with the St Lucia dataset), simply run
$ python3 eval.py --backbone=resnet50conv4 --aggregation=gem --resume=logs/default/YYYY-MM-DD_HH-mm-ss/best_model.pth --dataset_name=st_lucia
Finally, to reproduce our results, use the appropriate mining method: full for pitts30k and partial for msls as such:
$ python3 train.py --dataset_name=pitts30k --mining=full
As simple as this, you can replicate all results from tables 3, 4, 5 of the main paper, as well as tables 2, 3, 4 of the supplementary.
To resize the images simply pass the parameters resize with the target resolution. For example, 80% of resolution to the full pitts30k images, would be 384, 512, because the full images are 480, 640:
$ python3 train.py --dataset_name=pitts30k --resize=384 512
We gather all such methods under the test_method parameter. The available methods are hard_resize, single_query, central_crop, five_crops_mean, nearest_crop and majority_voting. Although hard_resize is the default, in most datasets it doesn't apply any transformation at all (see the paper for more information), because all images have the same resolution.
$ python3 eval.py --resume=logs/default/YYYY-MM-DD_HH-mm-ss/best_model.pth --dataset_name=tokyo247 --test_method=nearest_crop
You can reproduce all data augmentation techniques from the paper with simple commands, for example:
$ python3 train.py --dataset_name=pitts30k --horizontal_flipping --saturation 2 --brightness 1
The code allows to automatically download and use models trained on Landmark Recognition datasets from popular repositories: radenovic and naver. These repos offer ResNets-50/101 with GeM and FC 2048 trained on such datasets, and can be used as such:
$ python eval.py --off_the_shelf=radenovic_gldv1 --l2=after_pool --backbone=r101l4 --aggregation=gem --fc_output_dim=2048
$ python eval.py --dataset_name=pitts30k --off_the_shelf=naver --l2=none --backbone=r101l4 --aggregation=gem --fc_output_dim=2048
Check out our pretrain_vg repo which we use to train such models. You can automatically download and train on those models as such
$ python train.py --dataset_name=pitts30k --pretrained=places
You can use a different distance than the default 25 meters as simply as this (for example to 100 meters):
$ python3 eval.py --resume=logs/default/YYYY-MM-DD_HH-mm-ss/best_model.pth --val_positive_dist_threshold=100
By default the toolbox computes recalls@ 1, 5, 10, 20, but you can compute other recalls as such:
$ python3 eval.py --resume=logs/default/YYYY-MM-DD_HH-mm-ss/best_model.pth --recall_values 1 5 10 15 20 50 100
We are currently exploring hosting options, so this is a partial list of models. More models will be added soon!!
Pretrained models with different backbones
Pretained networks employing different backbones.
Model | Training on Pitts30k | Training on MSLS | ||||
---|---|---|---|---|---|---|
Pitts30k (R@1) | MSLS (R@1) | Download | Pitts30k (R@1) | MSLS (R@1) | Download | |
vgg16-gem | 78.5 | 43.4 | [Link] | 70.2 | 66.7 | [Link] |
resnet18-gem | 77.8 | 35.3 | [Link] | 71.6 | 65.3 | [Link] |
resnet50-gem | 82.0 | 38.0 | [Link] | 77.4 | 72.0 | [Link] |
resnet101-gem | 82.4 | 39.6 | [Link] | 77.2 | 72.5 | [Link] |
vgg16-netvlad | 83.2 | 50.9 | [Link] | 79.0 | 74.6 | [Link] |
resnet18-netvlad | 86.4 | 47.4 | [Link] | 81.6 | 75.8 | [Link] |
resnet50-netvlad | 86.0 | 50.7 | [Link] | 80.9 | 76.9 | [Link] |
resnet101-netvlad | 86.5 | 51.8 | [Link] | 80.8 | 77.7 | [Link] |
cct384-netvlad | 85.0 | 52.5 | [Link] | 80.3 | 85.1 | [Link] |
Pretrained models with different aggregation methods
Pretrained networks trained using different aggregation methods.
Model | Training on Pitts30k (R@1) | Training on MSLS (R@1) | ||||
---|---|---|---|---|---|---|
Pitts30k (R@1) | MSLS (R@1) | Download | Pitts30k (R@1) | MSLS (R@1) | Download | |
resnet50-gem | 82.0 | 38.0 | [Link] | 77.4 | 72.0 | [Link] |
resnet50-gem-fc2048 | 80.1 | 33.7 | [Link] | 79.2 | 73.5 | [Link] |
resnet50-gem-fc65536 | 80.8 | 35.8 | [Link] | 79.0 | 74.4 | [Link] |
resnet50-netvlad | 86.0 | 50.7 | [Link] | 80.9 | 76.9 | [Link] |
resnet50-crn | 85.8 | 54.0 | [Link] | 80.8 | 77.8 | [Link] |
Pretrained models with different mining methods
Pretained networks trained using three different mining methods (random, full database mining and partial database mining):
Model | Training on Pitts30k (R@1) | Training on MSLS (R@1) | ||||
---|---|---|---|---|---|---|
Pitts30k (R@1) | MSLS (R@1) | Download | Pitts30k (R@1) | MSLS (R@1) | Download | |
resnet18-gem-random | 73.7 | 30.5 | [Link] | 62.2 | 50.6 | [Link] |
resnet18-gem-full | 77.8 | 35.3 | [Link] | 70.1 | 61.8 | [Link] |
resnet18-gem-partial | 76.5 | 34.2 | [Link] | 71.6 | 65.3 | [Link] |
resnet18-netvlad-random | 83.9 | 43.6 | [Link] | 73.3 | 61.5 | [Link] |
resnet18-netvlad-full | 86.4 | 47.4 | [Link] | - | - | - |
resnet18-netvlad-partial | 86.2 | 47.3 | [Link] | 81.6 | 75.8 | [Link] |
resnet50-gem-random | 77.9 | 34.3 | [Link] | 69.5 | 57.4 | [Link] |
resnet50-gem-full | 82.0 | 38.0 | [Link] | 77.3 | 69.7 | [Link] |
resnet50-gem-partial | 82.3 | 39.0 | [Link] | 77.4 | 72.0 | [Link] |
resnet50-netvlad-random | 83.4 | 45.0 | [Link] | 74.9 | 63.6 | [Link] |
resnet50-netvlad-full | 86.0 | 50.7 | [Link] | - | - | - |
resnet50-netvlad-partial | 85.5 | 48.6 | [Link] | 80.9 | 76.9 | [Link] |
If you find our work useful in your research please consider citing our paper:
@inProceedings{Berton_CVPR_2022_benchmark,
author = {Berton, Gabriele and Mereu, Riccardo and Trivigno, Gabriele and Masone, Carlo and
Csurka, Gabriela and Sattler, Torsten and Caputo, Barbara},
title = {Deep Visual Geo-localization Benchmark},
booktitle = {CVPR},
month = {June},
year = {2022},
}
Parts of this repo are inspired by the following great repositories:
- NetVLAD's original code (in MATLAB)
- NetVLAD layer in PyTorch
- NetVLAD training in PyTorch
- GeM layer
- Deep Image Retrieval
- Mapillary Street-level Sequences
- Compact Convolutional Transformers
Check out also our other repo CosPlace, from the CVPR 2022 paper "Rethinking Visual Geo-localization for Large-Scale Applications", which provides a new SOTA in visual geo-localization / visual place recognition.