A complete list of datasets that are available by default inside the library are:
Main datasets
ADULT_SAMPLE: A small of the adults dataset to predict whether income exceeds $50K/yr based on census data.
BIWI_SAMPLE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.
CIFAR: The famous cifar-10 dataset which consists of 60000 32x32 colour images in 10 classes, with 6000 images per class.
COCO_SAMPLE: A sample of the coco dataset for object detection.
COCO_TINY: A tiny version of the coco dataset for object detection.
HUMAN_NUMBERS: A synthetic dataset consisting of human number counts in text such as one, two, three, four.. Useful for experimenting with Language Models.
IMAGENETTE: A smaller version of the imagenet dataset pronounced just like ‘Imagenet’, except with a corny inauthentic French accent.
IMAGENETTE_160: The 160px version of the Imagenette dataset.
IMAGENETTE_320: The 320px version of the Imagenette dataset.
IMAGEWOOF: Imagewoof is a subset of 10 classes from Imagenet that aren’t so easy to classify, since they’re all dog breeds.
IMAGEWOOF_160: 160px version of the ImageWoof dataset.
IMAGEWOOF_320: 320px version of the ImageWoof dataset.
IMAGEWANG: Imagewang contains Imagenette and Imagewoof combined, but with some twists that make it into a tricky semi-supervised unbalanced classification problem
CALTECH_101: Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. Collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc ’Aurelio Ranzato.
CARS: The Cars dataset contains 16,185 images of 196 classes of cars.
CIFAR_100: The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class.
CUB_200_2011: Caltech-UCSD Birds-200-2011 (CUB-200-2011) is an extended version of the CUB-200 dataset, with roughly double the number of images per class and new part location annotations
FLOWERS: 17 category flower dataset by gathering images from various websites.
FOOD:
MNIST: MNIST dataset consisting of handwritten digits.
PETS: A 37 category pet dataset with roughly 200 images for each class.
NLP datasets
AG_NEWS: The AG News corpus consists of news articles from the AG’s corpus of news articles on the web pertaining to the 4 largest classes. The dataset contains 30,000 training and 1,900 testing examples for each class.
AMAZON_REVIEWS: This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
AMAZON_REVIEWS_POLARITY: Amazon reviews dataset for sentiment analysis.
DBPEDIA: The DBpedia ontology dataset contains 560,000 training samples and 70,000 testing samples for each of 14 nonoverlapping classes from DBpedia.
MT_ENG_FRA: Machine translation dataset from English to French.
SOGOU_NEWS: The Sogou-SRR (Search Result Relevance) dataset was constructed to support researches on search engine relevance estimation and ranking tasks.
WIKITEXT: The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
WIKITEXT_TINY: A tiny version of the WIKITEXT dataset.
YAHOO_ANSWERS: YAHOO’s question answers dataset.
YELP_REVIEWS: The Yelp dataset is a subset of YELP businesses, reviews, and user data for use in personal, educational, and academic purposes
YELP_REVIEWS_POLARITY: For sentiment classification on YELP reviews.
Image localization datasets
BIWI_HEAD_POSE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.
CAMVID: Consists of driving labelled dataset for segmentation type models.
CAMVID_TINY: A tiny camvid dataset for segmentation type models.
SIIM_SMALL: A smaller version of the SIIM dataset where the objective is to classify pneumothorax from a set of chest radiographic images.
TCGA_SMALL: A smaller version of the TCGA-OV dataset with subcutaneous and visceral fat segmentations. Citations:
Holback, C., Jarosz, R., Prior, F., Mutch, D. G., Bhosale, P., Garcia, K., … Erickson, B. J. (2016). Radiology Data from The Cancer Genome Atlas Ovarian Cancer [TCGA-OV] collection. The Cancer Imaging Archive. paper
Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045-1057. paper
Pretrained models
OPENAI_TRANSFORMER: The GPT2 Transformer pretrained weights.
WT103_FWD: The WikiText-103 forward language model weights.
WT103_BWD: The WikiText-103 backward language model weights.
def fastai_cfg()->Config: # Config that contains default download paths for `data`, `model`, `storage` and `archive`
Config object for fastai’s config.ini
This is a basic Config file that consists of data, model, storage and archive. All future downloads occur at the paths defined in the config file based on the type of download. For example, all future fastai datasets are downloaded to the data while all pretrained model weights are download to model unless the default download location is updated. The config file directory is defined by enviromental variable FASTAI_HOME if it exists, otherwise it is set to ~/.fastai.
The default local path is at ~/.fastai/archive/ but this can be updated by passing a different c_key. Note: c_key should be one of 'archive', 'data', 'model', 'storage'.
def untar_data( url:str, # File to download archive:Path=None, # Optional override for `Config`'s `archive` key data:Path=None, # Optional override for `Config`'s `data` key c_key:str='data', # Key in `Config` where to extract file force_download:bool=False, # Setting to `True` will overwrite any existing copy of data base:str=None, # Directory containing config file and base of relative paths)->Path: # Path to extracted file(s)
Download url using FastDownload.get
untar_data is a thin wrapper for FastDownload.get. It downloads and extracts url, by default to subdirectories of ~/.fastai (see fastai_cfg for details), and returns the path to the extracted data. Setting the force_download flag to ‘True’ will overwrite any existing copy of the data already present. For an explanation of the c_key parameter, see URLs.