Helper functions to download the fastai datasets

A complete list of datasets that are available by default inside the library are:

Main datasets:

  1. ADULT_SAMPLE: A small of the adults dataset to predict whether income exceeds $50K/yr based on census data.
  • BIWI_SAMPLE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.
  1. CIFAR: The famous cifar-10 dataset which consists of 60000 32x32 colour images in 10 classes, with 6000 images per class.
  2. COCO_SAMPLE: A sample of the coco dataset for object detection.
  3. COCO_TINY: A tiny version of the coco dataset for object detection.
  • HUMAN_NUMBERS: A synthetic dataset consisting of human number counts in text such as one, two, three, four.. Useful for experimenting with Language Models.
  • IMDB: The full IMDB sentiment analysis dataset.

  • IMDB_SAMPLE: A sample of the full IMDB sentiment analysis dataset.

  • ML_SAMPLE: A movielens sample dataset for recommendation engines to recommend movies to users.
  • ML_100k: The movielens 100k dataset for recommendation engines to recommend movies to users.
  • MNIST_SAMPLE: A sample of the famous MNIST dataset consisting of handwritten digits.
  • MNIST_TINY: A tiny version of the famous MNIST dataset consisting of handwritten digits.
  • PLANET_SAMPLE: A sample of the planets dataset from the Kaggle competition Planet: Understanding the Amazon from Space.
  • PLANET_TINY: A tiny version of the planets dataset from the Kaggle competition Planet: Understanding the Amazon from Space for faster experimentation and prototyping.
  • IMAGENETTE: A smaller version of the imagenet dataset pronounced just like 'Imagenet', except with a corny inauthentic French accent.
  • IMAGENETTE_160: The 160px version of the Imagenette dataset.
  • IMAGENETTE_320: The 320px version of the Imagenette dataset.
  • IMAGEWOOF: Imagewoof is a subset of 10 classes from Imagenet that aren't so easy to classify, since they're all dog breeds.
  • IMAGEWOOF_160: 160px version of the ImageWoof dataset.
  • IMAGEWOOF_320: 320px version of the ImageWoof dataset.
  • IMAGEWANG: Imagewang contains Imagenette and Imagewoof combined, but with some twists that make it into a tricky semi-supervised unbalanced classification problem
  • IMAGEWANG_160: 160px version of Imagewang.
  • IMAGEWANG_320: 320px version of Imagewang.

Kaggle competition datasets:

  1. DOGS: Image dataset consisting of dogs and cats images from Dogs vs Cats kaggle competition.

Image Classification datasets:

  1. CALTECH_101: Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. Collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc 'Aurelio Ranzato.
  2. CARS: The Cars dataset contains 16,185 images of 196 classes of cars.
  3. CIFAR_100: The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class.
  4. CUB_200_2011: Caltech-UCSD Birds-200-2011 (CUB-200-2011) is an extended version of the CUB-200 dataset, with roughly double the number of images per class and new part location annotations
  5. FLOWERS: 17 category flower dataset by gathering images from various websites.
  6. FOOD:
  7. MNIST: MNIST dataset consisting of handwritten digits.
  8. PETS: A 37 category pet dataset with roughly 200 images for each class.

NLP datasets:

  1. AG_NEWS: The AG News corpus consists of news articles from the AG’s corpus of news articles on the web pertaining to the 4 largest classes. The dataset contains 30,000 training and 1,900 testing examples for each class.
  2. AMAZON_REVIEWS: This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
  3. AMAZON_REVIEWS_POLARITY: Amazon reviews dataset for sentiment analysis.
  4. DBPEDIA: The DBpedia ontology dataset contains 560,000 training samples and 70,000 testing samples for each of 14 nonoverlapping classes from DBpedia.
  5. MT_ENG_FRA: Machine translation dataset from English to French.
  6. SOGOU_NEWS: The Sogou-SRR (Search Result Relevance) dataset was constructed to support researches on search engine relevance estimation and ranking tasks.
  7. WIKITEXT: The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
  8. WIKITEXT_TINY: A tiny version of the WIKITEXT dataset.
  9. YAHOO_ANSWERS: YAHOO's question answers dataset.
  10. YELP_REVIEWS: The Yelp dataset is a subset of YELP businesses, reviews, and user data for use in personal, educational, and academic purposes
  11. YELP_REVIEWS_POLARITY: For sentiment classification on YELP reviews.

Image localization datasets:

  1. BIWI_HEAD_POSE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.
  2. CAMVID: Consists of driving labelled dataset for segmentation type models.
  3. CAMVID_TINY: A tiny camvid dataset for segmentation type models.
  4. LSUN_BEDROOMS: Large-scale Image Dataset using Deep Learning with Humans in the Loop
  5. PASCAL_2007: Pascal 2007 dataset to recognize objects from a number of visual object classes in realistic scenes.
  6. PASCAL_2012: Pascal 2012 dataset to recognize objects from a number of visual object classes in realistic scenes.

Audio classification:

  1. MACAQUES: 7285 macaque coo calls across 8 individuals from Distributed acoustic cues for caller identity in macaque vocalization.
  2. ZEBRA_FINCH: 3405 zebra finch calls classified across 11 call types. Additional labels include name of individual making the vocalization and its age.

Medical imaging datasets:

  1. SIIM_SMALL: A smaller version of the SIIM dataset where the objective is to classify pneumothorax from a set of chest radiographic images.
  2. TCGA_SMALL: A smaller version of the TCGA-OV dataset with subcutaneous and visceral fat segmentations. Citations:

    Holback, C., Jarosz, R., Prior, F., Mutch, D. G., Bhosale, P., Garcia, K., … Erickson, B. J. (2016). Radiology Data from The Cancer Genome Atlas Ovarian Cancer [TCGA-OV] collection. The Cancer Imaging Archive.

    Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045-1057.

Pretrained models:

  1. OPENAI_TRANSFORMER: The GPT2 Transformer pretrained weights.
  2. WT103_FWD: The WikiText-103 forward language model weights.
  3. WT103_BWD: The WikiText-103 backward language model weights.

To download any of the datasets or pretrained weights, simply run untar_data by passing any dataset name mentioned above like so:

path = untar_data(URLs.PETS)
> > (#7393) [Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg'),...]

To download model pretrained weights:```python path = untar_data(URLs.PETS)

(#2) [Path('/home/ubuntu/.fastai/data/wt103-bwd/itos_wt103.pkl'),Path('/home/ubuntu/.fastai/data/wt103-bwd/lstm_bwd.pth')] ```



Reading and writing settings.ini

If a config file doesn't exist already, it is always created at ~/.fastai/config.yml location by default whenever an instance of the Config class is created. Here is a quick example to explain:

config_file = Path("~/.fastai/config.yml").expanduser()
if config_file.exists(): os.remove(config_file)
assert not config_file.exists()

config = Config()
assert config_file.exists()

The config is now available as config.d:

{'archive_path': '/home/jhoward/.fastai/archive',
 'data_path': '/home/jhoward/.fastai/data',
 'model_path': '/home/jhoward/.fastai/models',
 'storage_path': '/tmp',
 'version': 2}

As can be seen, this is a basic config file that consists of data_path, model_path, storage_path and archive_path. All future downloads occur at the paths defined in the config file based on the type of download. For example, all future fastai datasets are downloaded to the data_path while all pretrained model weights are download to model_path unless the default download location is updated.

Please note that it is possible to update the default path locations in the config file. Let's first create a backup of the config file, then update the config to show the changes and re update the new config with the backup file.

if config_file.exists(): shutil.move(config_file, config_bak)
config['archive_path'] = Path(".")
config = Config()
{'archive_path': '.',
 'data_archive_path': '/home/jhoward/.fastai/data',
 'data_path': '/home/jhoward/.fastai/data',
 'model_path': '/home/jhoward/.fastai/models',
 'storage_path': '/tmp',
 'version': 2}

The archive_path has been updated to ".". Now let's remove any updates we made to Config file that we made for the purpose of this example.

if config_bak.exists(): shutil.move(config_bak, config_file)
config = Config()
{'archive_path': '/home/jhoward/.fastai/archive',
 'data_archive_path': '/home/jhoward/.fastai/data',
 'data_path': '/home/jhoward/.fastai/data',
 'model_path': '/home/jhoward/.fastai/models',
 'storage_path': '/tmp',
 'version': 2}

class URLs[source]


Global constants for dataset and model URLs.

The default local path is at ~/.fastai/archive/ but this can be updated by passing a different c_key. Note: c_key should be one of 'archive_path', 'data_archive_path', 'data_path', 'model_path', 'storage_path'.

url = URLs.PETS
local_path = URLs.path(url)
test_eq(local_path.parent, Config()['archive']); 
local_path = URLs.path(url, c_key='model')
test_eq(local_path.parent, Config()['model'])



download_url(url, dest, overwrite=False, pbar=None, show_progress=True, chunk_size=1048576, timeout=4, retries=5)

Download url to dest unless it exists and not overwrite

The download_url is a very handy function inside fastai! This function can be used to download any file from the internet to a location passed by dest argument of the function. It should not be confused, that this function can only be used to download fastai-files. That couldn't be further away from the truth. As an example, let's download the pets dataset from the actual source file:

fname = Path("./dog.jpg")
if fname.exists(): os.remove(fname)
url = ""
download_url(url, fname)
assert fname.exists()

Let's confirm that the file was indeed downloaded correctly.

from PIL import Image
im =

As can be seen, the file has been downloaded to the local path provided in dest argument. Calling the function again doesn't trigger a download since the file is already there. This can be confirmed by checking that the last modified time of the file that is downloaded doesn't get updated.

if fname.exists(): last_modified_time = os.path.getmtime(fname)
download_url(url, fname)
test_eq(os.path.getmtime(fname), last_modified_time)
if fname.exists(): os.remove(fname)

We can also use the download_url function to download the pet's dataset straight from the source by simply passing in url.


download_data(url, fname=None, c_key='archive', force_download=False, timeout=4)

Download url to fname.

The download_data is a convenience function and a wrapper outside download_url to download fastai files to the appropriate local path based on the c_key.

If fname is None, it will default to the archive folder you have in your config file (or data, model if you specify a different c_key) followed by the last part of the url: for instance URLs.MNIST_SAMPLE is and the default value for fname will be ~/.fastai/archive/mnist_sample.tgz.

If force_download=True, the file is alwayd downloaded. Otherwise, it's only when the file doesn't exists that the download is triggered.



file_extract(fname, dest=None)

Extract fname to dest using tarfile or zipfile.

file_extract is used by default in untar_data to decompress the downloaded file.



Return newest folder on path



Rename file if different from dest

let's rename the untar/unzip data if dest name is different from fname


untar_data(url, fname=None, dest=None, c_key='data', force_download=False, extract_func=file_extract, timeout=4)

Download url to fname if dest doesn't exist, and un-tgz or unzip to folder dest.

untar_data is a very powerful convenience function to download files from url to dest. The url can be a default url from the URLs class or a custom url. If dest is not passed, files are downloaded at the default_dest which defaults to ~/.fastai/data/.

This convenience function extracts the downloaded files to dest by default. In order, to simply download the files without extracting, pass the noop function as extract_func.

Note, it is also possible to pass a custom extract_func to untar_data if the filetype doesn't end with .tgz or .zip. The gzip and zip files are supported by default and there is no need to pass custom extract_func for these type of files.

Internally, if files are not available at fname location already which defaults to ~/.fastai/archive/, the files get downloaded at ~/.fastai/archive and are then extracted at dest location. If no dest is passed the default_dest to download the files is ~/.fastai/data. If files are already available at the fname location but not available then a symbolic link is created for each file from fname location to dest.

Also, if force_download is set to True, files are re downloaded even if they exist.

from tempfile import TemporaryDirectory

with TemporaryDirectory() as d:
    d = Path(d)
    dest = untar_data(URLs.MNIST_TINY, fname='mnist_tiny.tgz', dest=d, force_download=True)
    assert Path('mnist_tiny.tgz').exists()
    assert (d/'mnist_tiny').exists()

#Test c_key
tst_model = config.model/'mnist_sample'
test_eq(untar_data(URLs.MNIST_SAMPLE, c_key='model'), tst_model)
assert not tst_model.with_suffix('.tgz').exists() #Archive wasn't downloaded in the models path
assert (config.archive/'mnist_sample.tgz').exists() #Archive was downloaded there

Sometimes the extracted folder does not have the same name as the downloaded file.

with TemporaryDirectory() as d:
    d = Path(d)
    untar_data(URLs.MNIST_TINY, fname='mnist_tiny.tgz', dest=d, force_download=True)
    p = Path('nims_tini.tgz')
    dest = Path('nims_tini')
    assert p.exists()
    file_extract(p, dest.parent)
url = URLs.CALTECH_101
_add_check(url, URLs.path(url))