One of the most important aspects of data management is organizing your data. This includes several elements, including thinking through names, structures, and the relationships.
To demonstrate the importance of these elements, let's return to our example of the file data.csv. If you were given access to a folder with the contents below and told to work in the most up-to-date version of that file, could you do so? The folder is not especially structured, but could you figure anything out from the names of the files? Without other information, could you understand the relationships between the different files in this folder?
Ideally, your file names should include whatever information you would need to easily locate a specific file. For files associated with a specific research project, it is often a good idea to maintain a standardized file naming conventions- a system of naming files that is consistent across related files.
If you are maintaining different versions of the same file, you should consider documenting what the difference is between the versions. Whatever convention you end up using, make sure your file names are unique, descriptive, and meaningful.
Even if you are not working with the command line, it is not advisable to include spaces or special characters in your file names. Down the line, these might cause problems for you or other researchers who may want to access or use your files on other systems.
With this in mind, we can rename our file data.csv to something a bit less generic.
The new name of our file, examplestudy_participant01_version01.csv, includes the name of the study (Example Study), a participant ID number (participant01), and a version number (version01). Presumably, this file would also be accompanied by some documentation that describes the contents of this file and any changes between this and subsequent versions.
If we were to compile the data in this file with data collected from other participants, the subsequent file could be named something like examplestudy_combined_version01.csv. Again, accompanied by documentation that includes a description of any changes made between this and any subsequent versions.
Exactly what information you should include in your file names will depend on your data, your project, and how you plan to structure your files.
In addition to giving your files descriptive and meaningful names, another way to make sure you can find the files you need when you need them is to use a consistent folder (or directory) structure.
The file structure below, which is adapted from the TIER Protocol, shows one way you can consider organizing files associated with a given project.
Your own structure may look very different from this, but this structure includes several useful features:
day | temp_f | hr_rest | spo2 |
1 | 97.5 | 55 | 97 |
2 | 97.6 | 52 | 98 |
3 | 97.5 | 49 | 97 |
4 | 97.5 | 58 | 98 |
5 | 97.4 | 56 | 98 |
We can now infer that our data contains information about temperature, resting heart rate, and oxygen saturation collected over the course of five days. However, to fully understand, replicate, and build upon work related to this data, additional documentation is required.