This document describes the Center for Advancing Research in Transportation Emissions, Energy and Health (CARTEEH) policy on data handling, storage, and sharing. CARTEEH is a Tier 1 University Transportation Center (UTC) that has an explicit goal to encourage collaboration between health and transportation researchers by integrating data sources from the two research fields. The integration of transportation and health data presents novel challenges for effective and safe data handling and sharing. For this reason, CARTEEH’s Data Management Plan (DMP) will be a “living document” that will be updated as necessary over the life of the Center. One of the initial projects undertaken by CARTEEH is the establishment of a data hub to facilitate the sharing and access of data in the interdisciplinary area of health and transportation. It is anticipated that this data hub initiative will inform subsequent updates to the DMP. CARTEEH will develop the data hub into a repository conformant with the U.S. Department of Transportation (DOT) Public Access Plan.
CARTEEH’s DMP ensures that data collected during research activities involving the center complies with DOT policy, and ensures the maximum value of data for future research. The approved and most recent DMP will be available to collaborators via the CARTEEH website and data hub.
Data Formats and Metadata Standards
Data will be collected in a variety of formats by each CARTEEH research team, and stored in a single repository prior to become publicly accessible. Data formats anticipated to be collected from the various research activities include: 1) Tabular data – CSV, TSV, Tab, and Microsoft Excel 2) Media – Pictures 3) GIS – SHP, KML, GeoJSON 4) GPS - GPX.
The metadata and data generated from the CARTEEH research projects will be uploaded and archived into the CARTEEH Data Hub.
The Data Library is to be used in storing metadata, publishing data stories, and making models available for the general public. In terms of metadata, a custom derivative of the Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata (CSDGM) is being used to address the geographical and temporal nature of transportation and health datasets. The challenge with the CSDGM is that it requires a significant amount of information to be provided by the researcher – a task found to be daunting and discourages the actual input of the metadata by the researcher.
The custom metadata derivative of the CSDGM implemented in the Data Library includes the following:
- Identification: Title, Description, Author(s), Subject Areas, Study Type, Temporal Granularity, Spatial Granularity, and Search Tags
- Time Coverage: Status, Frequency of Publication, Temporal Coverage, Publication Date
- Spatial Coverage: Spatial/Geographical Coverage Location and Area (GeoJson or Latitude-Longitude)
- Data Quality: Completeness, Validation Report, Logical Consistency Report, Originality, and Data Collection/Quality Control Steps
- Legal Constraints: Use Constraints, Access Constraints, and License
Access to the Data Library is granted through a web browser interface and username and password controlled. Datasets published within the Data Library are associated with a consistent
metadata format that documents the characteristics of the data.
To ensure that versioning is handled consistently, only data repository curators will be able to publish new versions of existing datasets, in collaboration and coordination with the
CARTEEH research teams. After datasets are uploaded to the data hub, curators will verify compliance of each dataset with this CARTEEH DMP. Prior to publishing, each PI will be required to verify that the public dataset produced matches their expectations
and is an accurate representation of what they provided. CARTEEH researchers will ensure that archived data (and the associated metadata) is understandable and usable by other researchers.
Access and Sharing Policies
All data collected within the CARTEEH is made accessible via the Data Library and Data Storage Components of the Data Hub. The metadata is made publicly accessible, however, the raw data is controlled through security/privilege levels. Raw data made available to the public will not contain private or confidential information.
Data with private or confidential information is required to be de-identified prior to making the data publicly available. Some sensitive data may be made available only through brief descriptions of the data through the metadata, allowing the PI that collected the data to make case-by-case decisions on data sharing. Data which raises any concerns regarding privacy, ethical, or confidentiality will not be made available to the public.
Re-Use, Redistribution, and Derivative Products Policies
By default metadata from the CARTEEH data hub is available for open sharing under the Creative Commons Zero (CC0) universal public domain dedication. Under CC0, data and derivative products will be available for reuse and redistribution without restriction. More information on the CC0 waiver can be found on the Creative Commons website (http://creativecommons.org/about/cc0). Researchers uploading data can however opt-out of using the CC0 waiver for their datasets, if needed.
Credit to the source of data is however required for any materials (books, articles, conference papers, theses, dissertations, reports, and other such publications) created that employ, reference, or otherwise utilize the data (in whole or in part) generated by CARTEEH Researchers and uploaded to the Data Library.
Archiving and Preservation Plans
Daily backups of the data library and datasets are performed automatically. All original raw data files and data source processing programs will be versioned over time and maintained in a date-stamped file structure with text files documenting the provenance.