Storing Data
Where to Put Your Data
In Turing, there are three places where you can put your data:
- Home Directories: Used for general project data storage
- Scratch: Used for short-term, temporary file storage
- Archive: Used for bulk file storage
When to use Home Directories
Home directories are located at /home/${USER}/
, and are the default directory
you are placed in when you log in to Turing. If you aren't sure where your data
should go, put it here! Home directories are frequently backed up, and snapshots
are taken daily.
When to use Scratch
Scratch directories are located at /scratch/${USER}/
, and should be used for
storing large amounds of data or files that don't need to be backed up. For
example, a simulation job might produce regular checkpoints in order to resume
in case of an error. Another common use case is for storing large datasets which
are publicly available, and can be re-downloaded if needed. If these files are
created in the home directories, they will be snapshotted and backed up,
potentially for years. If you know your data doesn't need that level of
protection, use scratch. Because scratch is used for transient storage, we may
either delete or archive data on scratch if it hasn't been used in a while.
When to use Archive
Archive is located at /archive/${USER}
, and is used for long-term bulk
storage. It has slower performance than home directories, but much more
available space. Data here is snapshotted just like home directories. It is a
perfect place to put old projects or datasets you aren't actively working on,
but might need in the future. However, data stored here should not be used
directly for running jobs. As such, archive is only available on the login
nodes, not the compute or GPU nodes. If you need to use archived data in a job,
please move it to your home directory or copy it to scratch first. Because
archive is meant for long-term storage, we may group your files into a
compressed zip file if they haven't been used in a while to save space.
Common pitfalls
Keep Data in Zip Files
Often, datasets contain millions of small files, zipped together in an archive. It is tempting to extract these archives to the filesystem directly, however this can both put extra strain on the file server and reduce performance of your code. Most programming languages have easy-to-use utilites to read data directly from a zip file. For example, to read all data files in an archive: