Storing Data

Where to Put Your Data

In Turing, there are three places where you can put your data:

Home Directories: Used for general project data storage
Scratch: Used for short-term, temporary file storage
Archive: Used for bulk file storage

When to use Home Directories

Home directories are located at /home/${USER}/, and are the default directory you are placed in when you log in to Turing. If you aren't sure where your data should go, put it here! Home directories are frequently backed up, and snapshots are taken daily.

When to use Scratch

Scratch directories are located at /scratch/${USER}/, and should be used for storing large amounds of data or files that don't need to be backed up. For example, a simulation job might produce regular checkpoints in order to resume in case of an error. Another common use case is for storing large datasets which are publicly available, and can be re-downloaded if needed. If these files are created in the home directories, they will be snapshotted and backed up, potentially for years. If you know your data doesn't need that level of protection, use scratch. Because scratch is used for transient storage, we may either delete or archive data on scratch if it hasn't been used in a while.

When to use Archive

Archive is located at /archive/${USER}, and is used for long-term bulk storage. It has slower performance than home directories, but much more available space. Data here is snapshotted just like home directories. It is a perfect place to put old projects or datasets you aren't actively working on, but might need in the future. However, data stored here should not be used directly for running jobs. As such, archive is only available on the login nodes, not the compute or GPU nodes. If you need to use archived data in a job, please move it to your home directory or copy it to scratch first. Because archive is meant for long-term storage, we may group your files into a compressed zip file if they haven't been used in a while to save space.

Common pitfalls

Keep Data in Zip Files

Often, datasets contain millions of small files, zipped together in an archive. It is tempting to extract these archives to the filesystem directly, however this can both put extra strain on the file server and reduce performance of your code. Most programming languages have easy-to-use utilites to read data directly from a zip file. For example, to read all data files in an archive:

import zipfile

with ZipFile('training-data.zip') as myzip:
    for file in myzip.namelist():
        with myzip.open(file) as myfile:
            print(myfile.read())