Saving data means storing research materials so that they can be accessed and used – by yourself or by others – at a later date.
According to Stanford's Research Policy Handbook, research data must be archived for a minimum of three years after the final project close-out, with original data retained wherever possible.
The following circumstances may justify longer periods of retention:
Beyond the period of retention specified in the handbook, the destruction of the research record is at the discretion of the PI and his or her department or laboratory.
When saving your data, be aware of its risk classification. At Stanford, most research data is considered "low risk" meaning it can be saved on a wide array of services and platforms. The exception is "regulated data", which includes datasets containing protected health information (PHI), social security numbers, and financial information. Regulated data has special requirements and should only be saved on approved services.
This page from University IT provides a breakdown of how Stanford classifies risk as well as which services can be used to store low, medium, and high-risk data. If you are starting a project that you believe could involve the collection, storage, and/or use of high-risk data, please complete a data risk assessment.
Whenever possible, maintain at least two backup copies of your data. One of these copies should be in a different geographic location (such as in the cloud). To prevent your data from being lost, it is important to schedule regular backups and make sure you are also backing up your documentation.
Understand the difference between working storage and preservation. There is a difference between how and where you save data when you are working on it and how and where you save data to preserve it long-term. As far as if you should preserve your data in "raw" or "processed" form, you should consider what form will be most useful for future researchers, even if that future researcher is just yourself.
Stanford University classifies its information assets into risk-based categories for the purpose of determining who is allowed to access the information and what security precautions must be taken to protect it against unauthorized access.
Except for regulated data such as Protected Health Information (PHI), Social Security Numbers (SSNs), and financial account numbers, research data and systems predominately fall into the Low Risk classification.
Personally-Identifying Information (PII) - Information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other personal or identifying information that is linked or linkable to a specific individual.
Protected Health Information (PHI) - Protected health information is a subset of personally-identifying information. Under US law, PHI is information in the medical record or designated record set that can be used to identify an individual and that was created, used, or disclosed in the course of providing a health care service such as diagnosis or treatment.
Under the US Health Insurance Portability and Accountability Act (HIPAA), PHI that is based on the following list of 18 identifiers must be treated with special care:
For additional information about working with PHI, see this page from the Technology and Digital Solutions (TDS) team.
When saving data, it is important to not only consider where you are saving it but also how you are saving.
To illustrate this, see the gif below of gene names being entered into cells of an Excel spreadsheet. Note how Excel is automatically converting the information in these cells to dates - which could have significant effects on any analyses based on this dataset. This issue is so common that, in 2020, the HUGO Gene Nomenclature Committee drafted new guidelines that, among other things, changes the names of affected genes.
This example is not meant as a criticism of a particular software tool, but to demonstrate the importance of understanding what the software you are using to save your data is doing. Just as you might do quality assurance checks when collecting your data, be sure to do similar checks while saving.
As you shift from saving your data as you work on it to saving it to preserve it long term, something to consider is your file formats. When saving data over the long-term, we recommend saving in formats that are as open, lossless, and unencrypted as possible.
Open (non-proprietary) formats are those that can be used and implemented by anyone. In practice, this means that files stored in open formats can be opened and used by a variety of proprietary, free, and open-source software tools rather than just a single piece of software. Open, non-proprietary, formats are far more likely to remain usable over the long term even if the software that created them is not available or no longer functional.
The Library of Congress maintains a list of recommended file formats for long-term preservation which has been adapted in the table below. The following is not meant to be an exhaustive list, but to highlight especially common file formats and data types. Note that "open" file formats are not necessarily lossless.
|Category||LOC Recommendation||Open File Formats|
|Text||XML based markup formats (EBUB, BITS, etc) PDF, XML-based document formats (DOCX, ODF)||Plain text (TXT), HTML, Markdown, ePub, LaTeX, Open Office XML, PDF|
|Still Images||TIFF, PNG, JPEG2000||PNG, SVG, JPEG2000, GIF|
|Moving Images||MOV, MPEG-2||Matroska (MKV)|
|Audio||PCM, WAVE||WAVE, FLAC, MP3, OGG|
Formats using well-known schemas with public validation tool available, Line-oriented formats (TSV, CSV, fixed-width), Any proprietary format that is a de facto standard for a profession or supported by multiple tools (XLS, XLSX)
In practice, you might not be able to save your data in these formats in every situation. But this table is a good starting point for thinking about how to prepare your data for long-term preservation.