External data

Helper functions to download the fastai datasets

To download any of the datasets or pretrained weights, simply run untar_data by passing any dataset name mentioned above like so:

path = untar_data(URLs.PETS)
path.ls()

>> (#7393) [Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg'),...]

To download model pretrained weights:

path = untar_data(URLs.WT103_BWD)
path.ls()

>> (#2) [Path('/home/ubuntu/.fastai/data/wt103-bwd/itos_wt103.pkl'),Path('/home/ubuntu/.fastai/data/wt103-bwd/lstm_bwd.pth')]

Datasets

A complete list of datasets that are available by default inside the library are:

Main datasets

  1. ADULT_SAMPLE: A small of the adults dataset to predict whether income exceeds $50K/yr based on census data.
  • BIWI_SAMPLE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.
  1. CIFAR: The famous cifar-10 dataset which consists of 60000 32x32 colour images in 10 classes, with 6000 images per class.
  2. COCO_SAMPLE: A sample of the coco dataset for object detection.
  3. COCO_TINY: A tiny version of the coco dataset for object detection.
  • HUMAN_NUMBERS: A synthetic dataset consisting of human number counts in text such as one, two, three, four.. Useful for experimenting with Language Models.

  • IMDB: The full IMDB sentiment analysis dataset.

  • IMDB_SAMPLE: A sample of the full IMDB sentiment analysis dataset.

  • ML_SAMPLE: A movielens sample dataset for recommendation engines to recommend movies to users.

  • ML_100k: The movielens 100k dataset for recommendation engines to recommend movies to users.

  • MNIST_SAMPLE: A sample of the famous MNIST dataset consisting of handwritten digits.

  • MNIST_TINY: A tiny version of the famous MNIST dataset consisting of handwritten digits.

  • MNIST_VAR_SIZE_TINY:

  • PLANET_SAMPLE: A sample of the planets dataset from the Kaggle competition Planet: Understanding the Amazon from Space.

  • PLANET_TINY: A tiny version of the planets dataset from the Kaggle competition Planet: Understanding the Amazon from Space for faster experimentation and prototyping.

  • IMAGENETTE: A smaller version of the imagenet dataset pronounced just like ‘Imagenet’, except with a corny inauthentic French accent.

  • IMAGENETTE_160: The 160px version of the Imagenette dataset.

  • IMAGENETTE_320: The 320px version of the Imagenette dataset.

  • IMAGEWOOF: Imagewoof is a subset of 10 classes from Imagenet that aren’t so easy to classify, since they’re all dog breeds.

  • IMAGEWOOF_160: 160px version of the ImageWoof dataset.

  • IMAGEWOOF_320: 320px version of the ImageWoof dataset.

  • IMAGEWANG: Imagewang contains Imagenette and Imagewoof combined, but with some twists that make it into a tricky semi-supervised unbalanced classification problem

  • IMAGEWANG_160: 160px version of Imagewang.

  • IMAGEWANG_320: 320px version of Imagewang.

Kaggle competition datasets

  1. DOGS: Image dataset consisting of dogs and cats images from Dogs vs Cats kaggle competition.

Image Classification datasets

  1. CALTECH_101: Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. Collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc ’Aurelio Ranzato.
  2. CARS: The Cars dataset contains 16,185 images of 196 classes of cars.
  3. CIFAR_100: The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class.
  4. CUB_200_2011: Caltech-UCSD Birds-200-2011 (CUB-200-2011) is an extended version of the CUB-200 dataset, with roughly double the number of images per class and new part location annotations
  5. FLOWERS: 17 category flower dataset by gathering images from various websites.
  6. FOOD:
  7. MNIST: MNIST dataset consisting of handwritten digits.
  8. PETS: A 37 category pet dataset with roughly 200 images for each class.

NLP datasets

  1. AG_NEWS: The AG News corpus consists of news articles from the AG’s corpus of news articles on the web pertaining to the 4 largest classes. The dataset contains 30,000 training and 1,900 testing examples for each class.
  2. AMAZON_REVIEWS: This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
  3. AMAZON_REVIEWS_POLARITY: Amazon reviews dataset for sentiment analysis.
  4. DBPEDIA: The DBpedia ontology dataset contains 560,000 training samples and 70,000 testing samples for each of 14 nonoverlapping classes from DBpedia.
  5. MT_ENG_FRA: Machine translation dataset from English to French.
  6. SOGOU_NEWS: The Sogou-SRR (Search Result Relevance) dataset was constructed to support researches on search engine relevance estimation and ranking tasks.
  7. WIKITEXT: The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
  8. WIKITEXT_TINY: A tiny version of the WIKITEXT dataset.
  9. YAHOO_ANSWERS: YAHOO’s question answers dataset.
  10. YELP_REVIEWS: The Yelp dataset is a subset of YELP businesses, reviews, and user data for use in personal, educational, and academic purposes
  11. YELP_REVIEWS_POLARITY: For sentiment classification on YELP reviews.

Image localization datasets

  1. BIWI_HEAD_POSE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.
  2. CAMVID: Consists of driving labelled dataset for segmentation type models.
  3. CAMVID_TINY: A tiny camvid dataset for segmentation type models.
  4. LSUN_BEDROOMS: Large-scale Image Dataset using Deep Learning with Humans in the Loop
  5. PASCAL_2007: Pascal 2007 dataset to recognize objects from a number of visual object classes in realistic scenes.
  6. PASCAL_2012: Pascal 2012 dataset to recognize objects from a number of visual object classes in realistic scenes.

Audio classification

  1. MACAQUES: 7285 macaque coo calls across 8 individuals from Distributed acoustic cues for caller identity in macaque vocalization.
  2. ZEBRA_FINCH: 3405 zebra finch calls classified across 11 call types. Additional labels include name of individual making the vocalization and its age.

Medical imaging datasets

  1. SIIM_SMALL: A smaller version of the SIIM dataset where the objective is to classify pneumothorax from a set of chest radiographic images.

  2. TCGA_SMALL: A smaller version of the TCGA-OV dataset with subcutaneous and visceral fat segmentations. Citations:

    Holback, C., Jarosz, R., Prior, F., Mutch, D. G., Bhosale, P., Garcia, K., … Erickson, B. J. (2016). Radiology Data from The Cancer Genome Atlas Ovarian Cancer [TCGA-OV] collection. The Cancer Imaging Archive. paper

    Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045-1057. paper

Pretrained models

  1. OPENAI_TRANSFORMER: The GPT2 Transformer pretrained weights.
  2. WT103_FWD: The WikiText-103 forward language model weights.
  3. WT103_BWD: The WikiText-103 backward language model weights.

Config


source

fastai_cfg

 fastai_cfg ()

Config object for fastai’s config.ini

This is a basic Config file that consists of data, model, storage and archive. All future downloads occur at the paths defined in the config file based on the type of download. For example, all future fastai datasets are downloaded to the data while all pretrained model weights are download to model unless the default download location is updated. The config file directory is defined by enviromental variable FASTAI_HOME if it exists, otherwise it is set to ~/.fastai.

cfg = fastai_cfg()
cfg.data,cfg.path('data')
('data', Path('/home/jhoward/.fastai/data'))

source

fastai_path

 fastai_path (folder:str)

Local path to folder in Config

fastai_path('archive')
Path('/home/jhoward/.fastai/archive')

source

URLs

 URLs ()

Global constants for dataset and model URLs.

The default local path is at ~/.fastai/archive/ but this can be updated by passing a different c_key. Note: c_key should be one of 'archive', 'data', 'model', 'storage'.

url = URLs.PETS
local_path = URLs.path(url)
test_eq(local_path.parent, fastai_path('archive'))
local_path
Path('/home/jhoward/.fastai/archive/oxford-iiit-pet.tgz')
local_path = URLs.path(url, c_key='model')
test_eq(local_path.parent, fastai_path('model'))
local_path
Path('/home/jhoward/.fastai/models/oxford-iiit-pet.tgz')

source

untar_data

 untar_data (url:str, archive:pathlib.Path=None, data:pathlib.Path=None,
             c_key:str='data', force_download:bool=False,
             base:str='~/.fastai')

Download url using FastDownload.get

Type Default Details
url str File to download
archive Path None Optional override for Config’s archive key
data Path None Optional override for Config’s data key
c_key str data Key in Config where to extract file
force_download bool False Setting to True will overwrite any existing copy of data
base str ~/.fastai Directory containing config file and base of relative paths
Returns Path Path to extracted file(s)

untar_data is a thin wrapper for FastDownload.get. It downloads and extracts url, by default to subdirectories of ~/.fastai (see fastai_cfg for details), and returns the path to the extracted data. Setting the force_download flag to ‘True’ will overwrite any existing copy of the data already present. For an explanation of the c_key parameter, see URLs.

untar_data(URLs.MNIST_SAMPLE)
Path('/home/jhoward/.fastai/data/mnist_sample')