Documentation refers to information that is needed to make use of your data. In practical terms, you should consider any documentation that relates to your data as part of your data.
The video below, which was developed as part of NIH's training modules for rigor and reproducibility, demonstrates what can happen when study procedures are not properly documented.
In general, you should maintain documentation at both the project level and the file level. The following examples are not meant to be an exhaustive list, but to illustrate the type of information you should try to document. Project level documentation includes information about the processes used throughout the project, including how you and your collaborators are collecting, organizing, and analyzing your data. File level documentation includes details related to individual files.
A good rule of thumb is to always document more than you think is necessary.
|Project Level Documentation||File Level Documentation|
|Context of data collection, including tools used.||The meaning of variable names and descriptions associated with a file.|
|Details of the data collection methodology.||The definitions of codes and classification schemes used in a file.|
|How data files are structured and organized.||How missing data is coded in a file (and why it is missing).|
|How data is validated/How quality assurance checks are completed.||The meaning of terms/acronyms in a file.|
|How data is manipulated or transformed from raw data through the data analysis process.||Details of the algorithms/processes used to transform specific files.|
|The software tools used (including versions).||Details about when files were changed and by whom.|
Documentation can be maintained in a variety of forms. Some common forms of documentation are:
Readme - A Readme file is a text file located in a project-related folder that describes the contents and structure of the folder and/or a dataset so that a researcher can locate the information they need.
Data Dictionary - Also known as a codebook, a data dictionary defines and describes the elements of a dataset so that it can be understood and used at a later date.
Protocol - A protocol describes the procedure(s) or method(s) used in the implementation of a research project or experiment.
Lab Notebook - For research groups that use them, lab notebooks are often the primary record of the research process. They are used to document hypotheses, experiments, analyses, and interpretations of experiments. For information about keeping a lab notebook, see this page from Stanford's Office of Technology Licensing.
Metadata - Metadata is data about data. There are different types of metadata, including descriptive metadata (information about the content of your data), structural metadata (information about the physical structure of your data, including file format), and administrative metadata (information about how and when your data was created). Metadata often conforms to a specific scheme- a set of standardized rules about how the metadata is organized and used.
There are a variety of ways to maintain documentation related to your research. It can be as straightforward as developing a regular practice of documenting your process in a Google Document or as formal as maintaining a formal lab notebook.
If you need to maintain protocols, we strongly recommend a tool like protocols.io. Protocols.io allows you to create step-by-step detailed, interactive, and dynamic protocols that can be run on mobile or web. The premium version, which allows you to create both private and public protocols, is available to Stanford researchers free of charge (simply follow these instructions).
Using protocols.io, you can:
In the organization section, we discussed giving our data unique and descriptive names. The variable names do indeed give important information about what is in each column. However, additional information may still be necessary to understand the contents of the data and how it was collected and analyzed.
Below is a simple data dictionary for the file examplestudy_participant01_version01.csv. It includes the name of each variable in the file (which do not have spaces and special characters), the variables name written out in plain language, and information about the attributes of each variable (including units).
|day||day||The day (out of 5) the measure was collected. Days are consecutive.|
|temp_f||body temperature (Fahrenheit)||The body temperature of the participant, measured in degrees Fahrenheit. Body temperature was taken using a non-contact forehead thermometer|
|hr_rest||heart rate (resting)||The resting heart rate of the participant, measured in beats per minute. Heart rate was taken using a fingertip pulse oximeter.|
|spo2||Oxygen saturation||Pulsatile oxygen saturation, measured in percentage. Sp02 was taken using a fingertip pulse oximeter.|
This data dictionary does not contain information about the steps used to collect the data in this file, the software tools used to analyze the data, or other details that would be necessary to understand or build upon this data. Much of this information would be recorded in a protocol.
Again, a good rule of thumb is to document more than you think is necessary. Even if you think you'll remember what your variables represent, what procedures you applied, what software you used, and other details, document it all.