Data Management and Sharing

Organizing Data

One of the most important aspects of data management is organizing your data. This includes several elements, including thinking through names, structures, and the relationships.

To demonstrate the importance of these elements, let's return to our example of the file data.csv. If you were given access to a folder with the contents below and told to work in the most up-to-date version of that file, could you do so? The folder is not especially structured, but could you figure anything out from the names of the files? Without other information, could you understand the relationships between the different files in this folder?

Naming Files

Ideally, your file names should include whatever information you would need to easily locate a specific file. For files associated with a specific research project, it is often a good idea to maintain a standardized file naming conventions- a system of naming files that is consistent across related files.

If you are maintaining different versions of the same file, you should consider documenting what the difference is between the versions. Whatever convention you end up using, make sure your file names are unique, descriptive, and meaningful.

Even if you are not working with the command line, it is not advisable to include spaces or special characters in your file names. Down the line, these might cause problems for you or other researchers who may want to access or use your files on other systems.

Things to include in a file name:
  1. Project, experiment, or investigator name
  2. Date or version number of file
  3. Description of file contents
Things to avoid:
  1. Generic or uninformative file names
  2. Spaces
  3. Special characters (e.g. ~, !, @, #, $, %)

With this in mind, we can rename our file data.csv to something a bit less generic.

The new name of our file, examplestudy_participant01_version01.csv, includes the name of the study (Example Study), a participant ID number (participant01), and a version number (version01). Presumably, this file would also be accompanied by some documentation that describes the contents of this file and any changes between this and subsequent versions.

If we were to compile the data in this file with data collected from other participants, the subsequent file could be named something like examplestudy_combined_version01.csv. Again, accompanied by documentation that includes a description of any changes made between this and any subsequent versions.

Exactly what information you should include in your file names will depend on your data, your project, and how you plan to structure your files.

Structuring Files

In addition to giving your files descriptive and meaningful names, another way to make sure you can find the files you need when you need them is to use a consistent folder (or directory) structure.

The file structure below, which is adapted from the TIER Protocol, shows one way you can consider organizing files associated with a given project.

Your own structure may look very different from this, but this structure includes several useful features:

  • The project folder includes subfolders, each of which has a descriptive and meaningful name.
  • Each subfolder contains only the files that are supposed to be in there.
  • A readme file at the top of the file structure. This is text file that outlines the contents of each subfolder and other information a researcher would need to navigate the structure.

Organizing Individual Files

The principles of file naming and organization can also be applied within files.
Let's revisit the contents of our file, formerly named data.csv now named examplestudy_participant01_version01.csv. Changing the variable names into something more descriptive and meaningful provides a lot more information about what's in our dataset.
The contents of a file named examplestudy_participant01_version01.csv
day temp_f hr_rest spo2
1 97.5 55 97
2 97.6 52 98
3 97.5 49 97
4 97.5 58 98
5 97.4 56 98

We can now infer that our data contains information about temperature, resting heart rate, and oxygen saturation collected over the course of five days. However, to fully understand, replicate, and build upon work related to this data, additional documentation is required.