Awesome-forests is a curated list of ground-truth/validation/in situ forest datasets for the forest-interested machine learning community. The list targets data-based biodiversity, carbon, wildfire, ecosystem service, you name it! analysis. The list does NOT contain data products, such as, algorithm-generated global maps.
Getting started with data science in forests is TOUGH. The lack of organized datasets is one reason why. So, this list of datasets intends to get you started with building machine learning models for analysing your forests.
This is a wide open and inclusive community. We would very much appreciate if you add your favorite datasets via a pull request or (emailing (lutjens at mit [dot] edu).
Photo of a dog in a forest, by [**Jamie Street**](https://unsplash.com/@jamie452) on [Unsplash](https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)- Tree species classification
- Tree detection
- Tree damage and health classification
- Navigation in forests
- Biodiversity flora
- Tree crown segmentation
- Aboveground carbon quantification
- Belowground carbon quantification
- Forest type and land cover classification
- Change detection and deforestation
- Wildfire
- Wildlife
- Bioacoustics
- Raw geospatial imagery
- Awesome-awesome
- Excluded data products
-
The Auto Arborist Dataset: A Large-Scale Benchmark for Multiview Urban Forest Monitoring Under Domain Shift (Beery et al., 2022)
A tree genus classification dataset from 23 cities in US, Canada with Google Street View imagery with 2M trees and >300 classes. -
TreeSatAI: Benchmark Archive for Deep Learning in Forest Applications (Schulz et al., 2022)
A tree species classification dataset from Lower Saxony, Germany with 50k images of aerial, Sen-1, Sen-2, georeferenced, time-referenced, with species, age, genus, forest type, and land cover. -
IDtrees NIST NEON (Weecology, University of Florida, NEON, 2020)
A tree species classification dataset from ≈3 National Forest sites, USA, with ≈400 labeled trees of ≈20 species with airborne RGB, Hyperspectral and Lidar imagery. -
Kaggle Forest Cover Type (USFS, 2013?)
A tree species classification dataset from Roosevelt National Forest, USA, with ≈15k labeled and ≈565k unlabeled trees with cartographic variables. -
Pasadena Urban Trees (Caltech, 2016)
A tree species classification dataset from urban Pasadena, USA, with ≈ 80k labeled trees of 18 species with airborne and ground RGB imagery. -
Open AI Challenge: Aerial Imagery of South Pacific Islands (WeRobotics, Worldbank, 2018)
A tree species classification dataset from Kingdom of Tonga with 50km² data of 4 species with airborne RGB imagery.
-
Raw urban street tree inventory data (USFS, 2006-2013)
A raw dataset from 49 cities in California, USA, with ≈930k trees with forest structure variables (e.g., tree species, height, DBH, crown). -
New York City Street Tree Map (NYC Parks, ?-2021)
A raw dataset from urban New York City, USA, with >680k trees of >230 species. -
Raw data for urban trees in California communities (USFS, 2007-2012)
A raw dataset from urban California, USA, with ≈4k trees with forest structure variables (e.g., tree species, height, DBH, crown). -
NEON Woody Plant Vegetation Structure (NEON)
A raw dataset from 49 US national forests with forest structure variables (e.g., tree species, height, DBH, low-res. GPS)
-
DeepForest WeEcology NEON (Weecology, NEON, UofFlorida, 2018)
A tree detection dataset from ≈22 National Forest sites, USA with >15k labeled and >400k unlabeled trees with airborne RGB, Hyperspectral, and Lidar imagery. -
Kaggle Aerial Cactus Identification (CONACYT, 2019)
A cactus detection dataset from Mexiko with 17k cacti with airborne RGB imagery. -
Swedish National Forest Data Lab: Forest Damages – Larch Casebearer 1.0. (Swedish Forest Agency 2021)
A tree detection and classification dataset from 10 sites with RGB drone imagery. In total ~ 102k annotated bounding boxes labeled "Lark" or "other", of which ~ 44,5k are also labeled describing tree damage in four categories. -
Norlab – PercepTree (Northern laboratory, 2022)
This repository contains two datasets: 43k synthetic forest images and 100 real image dataset. Both include high-definition RGB images with depth information, bounding box, instance segmentation masks and keypoints annotation.
- see Tree species
- Forest Damages – Larch Casebearer (Swedish Forest Agency, 2021)
A tree damage classification dataset from 5 areas in Sweden with 1.5k images with >100k labeled trees with airborne RGB
- FinnWoodlands Dataset (Tampere University, Finland, 2023)
A dataset for autonomous nagivation inside forests with ~5K RGB stereo images, point clouds, and sparse depth maps, as well as 300 annotated frames for semantic, instance, or panoptic segmentation of tree trunks, paths, and more.
-
Kaggle iNaturalist (iNaturalist, FGVC8, 2021)
A flora and fauna species classification dataset from global sites with 2.7M labeled images of 10k species with smartphone imagery. -
Kaggle GeoLifeCLEF 2021 (ImageCLEF, 2021)
A flora and fauna location-based species recommendation dataset from France with 1.9M labeled images of 31k species with satellite imagery and cartographic variables.
- ReforesTree: A Dataset for Estimating Tropical Forest Carbon Stock with Deep Learning and Aerial Imagery (Reiersen et al., 2022)
An ML-processed dataset from six reforestation sites in Ecuador for estimating aboveground biomass with RGB drone imagery and individual tree location, bounding box, DBH, species, and biomass. Heavily biased towards banana. - Forest Canopy Height in Mexican Ecosystems (Requena-Mullor JM and Caughlin TT, 2018)
A forest height quantification dataset from Mexico with lidar-derived canopy height values, Landsat-derived vegetation indices, and 1105 aerial images.
-
Carbon Stocks of Individual Trees in African Drylands: Allometry and Output Data (Tucker et al., 2023)
Contains raw field measurements of destructive harvests of individual trees in African drylands to derive allometry equations. I could not find the associated Tucker et al., 2023 paper's ground-truth data of tree crowns. -
Tallo: a global tree allometry and crown architecture database (Jucket et al., 2022)
A dataset with 500k georeferenced records of individual trees >62k globally distributed sites, >5k tree species from >180 plant families with >100 data contributors from >40 countries -
NASA G-LiHT: Goddard's LiDAR, Hyperspectral & Thermal Imager (Cook et al., 2013) A raw dataset from US forests with high-resolution (<1m) LiDAR, hyperspectral, and thermal imagery
-
BAAD: a Biomass And Allometry Database for woody plants (Falster et al., 2015)
A raw aboveground biomass dataset from global sites with 260k measurements collected in 176 different studies, from 21k individual trees across ~700 species. -
FOS: The Forest Observation System, building a global reference dataset for remote sensing of forest biomass (Schepaschenko et al., 2019) A raw aboveground biomass dataset from global sites with 1.6k sub-plots (0.25ha) containing geolocation (+-1km), canopy height, aboveground biomass estimates and more.
-
TRY: a Global Database of Plant Traits (Kattge et al., 2011)
A raw plant trait dataset from global sites with 3.0M individual entries across 69k out of 300k existing plant species. The data focuses on 52 trait groups that characterize the vegetative and regeneration stages of the plant life cycle, including growth, dispersal, establishment and persistence -
GWDD: Global Wood Density Dataset (Zanne et al., 2009)
A raw wood density dataset containing the wood density of 8.4k species. -
GlobAllomeTree: Assessing volume, biomass and carbon stocks of trees and forests (FAO et al., 2013)
A large raw dataset of allometric equations, wood densities, raw biomass and volume data, and biomass extension factors. -
see Table S4 in The global forest above-ground biomass pool for 2010 estimated from high-resolution satellite observations (Santoro et al., 2021) for more.
- todo: add ground-truth datasets on belowground carbon inventories
- todo. To get started, see Tree Detection for rectangular bounding boxes of tree crowns.
-
An Unexpectedly Large Count of Trees in the West African Sahara and Sahel (Brandt et al., 2020)
A raw dataset of the West Sahara with ≈3k geolocated tree crown segmentations. -
Individual tree point clouds and tree measurements from multi-platform laser scanning in German forests (Weiser et al., 2022)
Spatially overlapping 3D laser scanning point clouds acquired from three different acquisition platforms (airplane, UAV and terrestrial tripod) in 12 forest plots in Germany, including individually segmented single tree point clouds and field-measured as well as point-cloud derived tree metrics. Also available for download from the pytreedb demo website, including various filtering options.
-
coastTrain (Murray et al., 2022)
A dataset with over 190K point observations of coastal ecosystem classes (tidal flat, mangrove, coral reef, saltmarsh, seagrass, interdial, kelp, ...) including geolocation and relevant metadata, but no satellite imagery. -
BigEarthNet: large-scale Sentinel-2 benchmark (TU Berlin, 2019)
A landcover multi-classification dataset from 10 European countries with ≈600k labeled images with CORINE land cover labels with Sentinel-2 L2A (10m res.) satellite imagery. -
Chesapeake land cover (Chesapeake Conservancy, Microsoft, NAIP, USGS, 2013-2017)
A land cover classification dataset from the Chesapeake Bay, USA, of a 6x7km² area with high- and low-resolution (NLCD) land cover labels with high- (NAIP, RGB-NIR) and low-resolution (Landsat 8, 13-band) satellite imagery. -
Kaggle Planet: Understanding the Amazon from Space (SCCON, Planet, 2017)
A land cover classification dataset from the Amazon with deforestation, mining, cloud labels with RGB-NIR (5m res.) satellite imagery. -
WiDS Datathon 2019: detection of oil palm plantations (Global WiDS Team & West Big Data Innovation Hub, 2019)
Binary palm oil plantation classification with 20k images with Planet RGB (3m res.) satellite imagery -
UC Merced land use dataset(UC Merced, 2010)
A small land cover classification dataset with 2100 images and 21 balanced classes with airborne (0.3m res.) imagery. -
See Awesome satellite imagery datasets for more satellite imagery datasets.
-
See SustainBench for more UN SDG -related satellite imagery datasets.
-
Dynamic EarthNet challenge (Planet, DLR, TUM, 2021)
A time-series prediction and multi-class change detection dataset of Europe over 2-years with 75 image time-series with 7 land-cover labels and weekly Planet RGB (3m res.) imagery. -
Semantic change detection dataset (SECOND) (Yang et al., 2020)
A land cover change detection dataset in over cities and suburbs in China with ≈5k image-pairs with 6 land cover classes and airborne imagery. -
ForestNet deforestation driver (Jeremy Irvin, Hao Sheng et al., 2020)
A dataset that consists of 2,756 LANDSAT-8 satellite images of forest loss events with deforestation driver annotations. The driver annotations were grouped into Plantation, Smallholder Agriculture, Grassland/shrubland, and Other. -
Global Forest Change (University of Maryland, 2013)
Different layers of global forest loss, extracted from Landsat satellite imagery, todo: this is a data product, find ground-truth data -
Awesome remote sensing change detection
A list with more change detection datasets.
- todo: add datasets for fire detection, fuel moisture quantification, wildfire spread prediction, etc.
- todo: Add https://mlhub.earth/data/su_sar_moisture_content_main, https://www.sciencedirect.com/science/article/pii/S003442572030167X?via%3Dihub
-
iWildCam A species classification dataset from 414 global locations with >200k labeled images with wildlife camera trap imagery, Landsat-8 multispectral imagery, and GPS coordinates.
-
iNaturalist Multiple species classification datasets from global imagery of animals and plants with >2.7M from 10k species.
-
See LILA.science for more processed conservation datasets
-
See Awesome-deep-ecology for more ecology datasets
- todo: add bioacoustics datasets
-
Global ecosystem dynamics investigation (GEDI) (NASA, University of Maryland, 2021)
A satellite lidar dataset of the globe with topography and lidar pointcloud (100m res.). -
Norway's international climate and forests initiative imagery program (NICFI) (NICFI, Ksat, Airbus, Planet, 2020)
A satellite imagery dataset of tropical rainforests with monthly mosaics of RGB (5m res.) satellite imagery. -
National agriculture imagery program (NAIP) (FSA USDA, 2003-2021)
An airborne imagery dataset of CONUS with RGB-NIR (0.5m res.) imagery. -
see awesome-gis
-
Awesome satellite imagery datasets
A list of more satellite imagery datasets with annotations for deep learning and computer vision. -
Awesome GIS
A list of GIS resources. -
OpenForest
A list of over 88 in-situ datasets and data products in forestry that are open-access and focused on understanding the composition of forests at the tree level -
todo: add link to dataset list on conservationtech.directory
These datasets were excluded, because we could not find a source for the validation dataset. If you know the source please create an issue or pull request.
-
Canopy Height map by WRI and Meta
Used GEDI and NEON datasets for training and/or validation.
- Awesome-forests contains individual entries from Awesome satellite imagery datasets and Awesome remote sensing change detection