This presentation discusses managing research data through the data life cycle. It begins with an overview of the research life cycle and embedding the data life cycle within it. Key aspects of data management are then covered, including why manage data, ethical and legal issues, requirements for data sharing and retention, and creating a data management plan. The rest of the presentation delves into each stage of the data life cycle, providing best practices for data collection, organization, security, storage, documentation, processing, analysis, and long-term preservation or sharing. File formats, metadata, repositories, and bibliographic resources are also addressed.
On National Teacher Day, meet the 2024-25 Kenan Fellows
Managing the research life cycle
1. Managing the Research
Data Life Cycle
Presented by Sherry Lake
ShLake@virginia.edu
July 31, 2012 University of Florida Data Management Workshop
2. Research Life Cycle
Data Re- Data Deposit
Discovery Use
Archive
Proposal Project Data Data Data End of
Planning Start Up Collection Analysis Sharing Project
Writing
Re-
Purpose
Data Life Cycle
3. Why Manage Data?
Saves time
Others can understand your data
Makes sharing/preserving data easier
Reinforces open scientific inquiry and replication of results
Increases the visibility of your research
Facilitates new discoveries
Reduces costs by avoiding duplication
Required by funding agencies Proposal
Planning
Writing
4. Ethical and Legal Issues
Confidentiality
Evaluate the sensitivity of your data
Comply with institution’s research guidelines
Comply with regulations for health research
May need to enable a restricted view of your data
Intellectual Property
Copyright
Patents
Proposal
Planning
Writing
5. Data Sharing and Retention
Requirements
Be Aware of Funding Requirements
Informal sharing statement
Separate Data Management Plan
Know What Your Institution Requires
Know What Your Department Requires
Publisher’s Requirement
Nature Magazine
Proposal
Planning
Writing
6. Create a Data Management Plan
Appoint Data Manager Contact
Describe data to be collected and methodology
Include guidelines on data documentation
Plan quality assurance and backup procedures
Plan sharing of data for public use
Include preservation plans
Document copyright and intellectual property rights
Project
Start Up
7. Data Life Cycle
within Context of the Research Life Cycle
Data Re- Data Deposit
Discovery Use
Archive
Proposal Project Data Data Data End of
Planning Start Up Collection Analysis Sharing Project
Writing
Re-
Purpose
Data Life
Cycle
8. Managing Data in the Data Life Cycle
Data Collection and Organization
Data Control & Security
Backup & Storage
Documentation and Metadata
Processing and Analysis
Preparing Data to Share
9. What is Data?
Observational – data captured in real-time
Examples: Sensor readings, telemetry, survey
results, images
Usually irreplaceable
Experimental – data from lab equipment
Examples: gene
sequences, chromatograms, magnetic field
readings
Often reproducible, but can be expensive
10. What is Data?
Simulation – data generated from test models
Examples: climate models, economic models
Models & metadata (inputs) more important than
output data
Derived or compiled – data
Examples: text and data mining, compiled
database, 3D models
Reproducible (but very expensive)
11. Types and Formats of Data
Types Examples
Text ASCII, Word, PDF
Numerical ASCII, SPSS, STATA, Excel, Access,
MySQL
Multimedia Jpeg, tiff, mpeg, quicktime
Models 3D, statistical
Software Java, C, Fortran
Domain-specific FITS in astronomy, CIF in chemistry
Instrument- Olympus Confocal Microscope
specific Data Format
12. Organizing Your Files
File Version Control
Directory Structure/File Naming Conventions
File Naming Conventions for Specific Disciplines
File Structure
Use Same Structure for Backups
13. Data Security & Access Control
Protection of data from unauthorized
access, use, change, disclosure and destruction
• Network Security
• Physical Security
• Computer Systems & Files
14. Data Security & Access Control
Network security
Keep confidential data off internet servers (or behind firewalls)
Put sensitive materials on computers not connected to the
internet
Physical security
Access to buildings and rooms
Computer systems & files
Use passwords on files/systems
Virus protection
15. Data Storage
Things to consider when deciding on where and how to store
your data
File Format
Media Life and Format
Disaster Recovery Plan
Environmental Conditions
Security
16. Backup Your Data
Reduce the risk of damage or loss
Use multiple locations (one off-site)
Validate using checksums
Create a backup schedule
Use reliable backup medium
Test your backup system (i.e., test file recovery)
17. Backup & Storage Options
Personal Computer
Departmental or University Server
Tape Backups
Subject archive
CDs or DVDs – NOT Recommended
External Hard Drives
Cloud Storage
18. Documentation
Start at beginning of research and continue throughout
Data documentation enables you to understand the data in
detail
Enables others to find it, use it and properly cite it
19. Data Documentation
Data documentation includes information on:
+ The Project
+ Data Collection Methods
+ Structure of the data files
+ Data sources used
+ Transformations of the data
At the data-level, information on:
+ Labels and descriptions for variables & records
+ Codes and classifications
+ Derived data algorithms
+ File format and software used
21. Data Processing & Analysis
Software tools to create, process and visualize the data
+ Programming languages (Fortran, PHP, Ruby, Python, C++, etc)
+ Data collection software (LabView)
+ Analysis (SPSS, SAS, Matlab, Mathematica, R, etc)
Data
Analysis
22. Recording Processes
Record every change to a file, no matter how small
+ Document changes to files
+ Use file naming conventions
+ Headers inside the file
+ Log files (automatic)
+ Version Control Software (e.g. SVN)
+ File sharing software (Google Drive, or DropBox, others)
Data
Analysis
23. Prepare to Share
Preparing data to share makes publishing data easier
• Archive Submission Policies/Guidelines
• File Format Conversion
• Documentation & Metadata
• Programming Code
• Citations to existing datasets
• Creation of un-restricted dataset
Data
Sharing
24. Choosing File Formats
Accessible in the future
• Non-proprietary
• Open, documented standard
• Common, used by the research community
• Standard representation (ASCII, Unicode)
• Unencrypted
• Uncompressed
Data
Sharing
25. Preferred Format Choices
PDF, not Word
ASCII, not Excel
MPEG-4, not Quicktime
TIFF or JPEG2000, not GIF or JPG
XML or RDF, not RDBMS
Not software specific Data
Sharing
26. Documentation & Metadata
What is Metadata?
Who created the data?
What is the content of the data set?
When was it created?
Where was it collected?
How was it developed?
Data
Why was it developed? Sharing
27. Metadata Formats & Standards
Provides structure to describe data
Common terms
Definitions
Language
Structure
Many different standards (based on discipline)
DDI
FGDC
EML
Tools for creating metadata files
Nesstar (DDI) Data
Sharing
Metavist (FGDC)
Morpho (EML)
28. Archiving Your Data
Informally on a peer-to-peer basis
Make accessible on online project web page
Make accessible on institutional web site
Submitting to a journal
Deposit in discipline specific repository
Deposit in Institutional Repository
29. Advantages of Repositories
Secure Environment Backups
Quality of Data Promotion of Data
Access Control to Data Easy Dissemination
Long-term Preservation Online Resource Discovery
Licensing Arrangements
30. Data Repositories
Example of discipline specific repositories:
+ SIMBAD (Astronomy)
+ Protein Data Bank (Biology)
+ PubChem (Chemistry)
+ GEON (Earth Science)
+ Long Term Ecological Research (Ecology)
+ ICPSR (Social Sciences)
Databib is a tool for helping people identify
and locate online repositories of research data.
http://databib.org
31. Data Management Bibliography
Graham, A., McNeill, K., Stout, A., & Sweeney, L. (2010). Data Management
and Publishing. Retrieved 05/31/2012, from
http://libraries.mit.edu/guides/subjects/data-management/.
Inter-university Consortium for Political and Social Research (ICPSR). (2012).
Guide to social science data preparation and archiving: Best practices
throughout the data cycle (5th ed.). Ann Arbor, MI. Retrieved
05/31/2012, from
http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf.
Van den Eynden, V., Corti, L., Woollard, M. & Bishop, L. (2011). Managing and
Sharing Data: A Best Practice Guide for Researchers (3rd ed.). Retrieved
05/31/2012, from http://www.data-
archive.ac.uk/media/2894/managingsharing.pdf
This class is aimed at those engaged in the life cycle of research, from applying for research grant, thru data collection & ultimately to preparation of the data for deposit in a public archive.Some projects generate enormous amounts of data that it takes up much of the scientists time. Data management primarily occurs within the lifecycle of a research porject.Data sharing plans should be developed in conjunction with an archive to maximize the utility of the data to research and to ensure the availability of the data in the future.
Steps in the Research Life Cycle:Proposal Planning & Writing: Conduct a review of existing data setsDetermine if project will produce a new dataset (or combing existing)Investigate archiving challenges, consent and confidentialityId potential users of your dataDetermine costs related to archivingContact Archives for advice (Look for archives)Project Start UpCreate a data management planMake decisions about document form and contentConduct pretest & tests of materials and methodsData CollectionFollow Best PracticeOrganize files, backups & storage, QA for data collectionAccess Control and SecurityData AnalysisManage file versionsDocument analysis and file manipulationsData SharingDetermine file formatsContact Archive for adviceMore documenting and cleaning up dataEnd of ProjectWrite PaperSubmit Report FindingsDeposit Data in Data Archive (Repository) Remember: Managing Data in a research project is a process that runs throughout the project. Good data management is the foundation for good research. Especially if you are going to share your data. Good management is essential to ensure that data can be preserved and remain accessible I the long-term, so it can be re-used and understood by other researchers. When managed and preserved properly research data can be successfully used for future scientific purposes.
Planning the management of your data before you begin your research AND throughout its lifecycle is essential to ensure its current usability & long-term preservation and access.Can focus on research not user requestsWith a repository keeping your data, you can focus on your research rather than fielding requests or worrying about data on a web page. Your project may have lots of people working on it, you will need to know what each is doing and has done. Project may last years.Funding agencies now require a data management planYou can understand your data at a later timeHaving your data documented will allow future users understand your data and be able to use it.Takes less time to get data ready to shareIf follow plan then data should be ready for archiving (documenting the data throughout) insures proper description of the data are maintained.
Will the data contain direct or indirect identifiers that could be used to identify research participants?Challenges for archiving data…. Need to think about consentLinks on Uva compliance in research links on handout.Health Research links on handouts too. HIPPA Privacy Rule (Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule is the first comprehensive Federal protection for the privacy of personal health information)Your discipline may have other policies, i.e. National Academy of Engineering (link on handouts)Intellectual Property-determine copyright & ownership of research dataIf you’ve gathered the data from multiple sources, need to obtain permission to publish it.
Regarding research data generated from proposal/project Sharing and Data RetentionBefore you start your plan check mandates, policies, & procedures of grant funding and UvaExample from UVA: UVa’s policy on recordkeeping in research, Uva’s Health System Office of ResearchNIH Data Sharing Policy & Implementation Guidance (2003) suggests the following in the proposals: Schedule for data sharing Format of final dataset Documentation to be provided Analytical tools to be provided, if any Need for data sharing agreement Mode of data sharingNIH generally requires that files resulting from research awards be retained for at least three years after the final financial report has been filed. However, Commonwealth of Virginia record retention regulations are more strict (see below) and require that such records be retained five years after filing of the final financial report of a funding periodNSFdevelop and submit specific plans to share materials collected with NSF support, except where this is inappropriate or impossible. These plans should cover how and where these materials will be stored at reasonable cost, and how access will be provided to other researchers, generally at their cost. UVaData and notebooks resulting from sponsored research are the property of the University of Virginia. It is the responsibility of the principal investigator to retain all raw data in laboratory notebooks (or other appropriate format) for at least five years after completion of the research project (i.e., publication of a paper describing the work, or termination of the supporting research grant, whichever comes first) unless required to be retained longer by contract, law, regulation, or by some reasonable continuing need to refer to them.Uva Health SystemHas a responsible conduct of research that includes data management (protection, sharing, retention times)
How do you get started managing data.So how do I get started managing data?Handout has a link to Managing & Sharing Data with more detailsAlso link to a Data Management Plan FormShould be written down… sort like an instruction book.
Life cycle of a research project with respect to the data it creates:Data Collectiondata collection, entry, checking & cleaningData Analysis analyze data, derived “new” data, data documentationData Sharing prepare data for submissionManaging the Data in the Data Life Cycle includes: backup & storage, version control, file conversions, security & access control Document all data details
Here’s the details about what we are going to manage in the Data Life Cycle.
National Science Board. (2005). Long-lived digital data collections: Enabling research and education in the 21st century. Retrieved from http://www.nsf.gov/pubs/2005/nsb0540/nsb0540.pdfobservational data cannot be recollected and are archived indefinitely. cannot be recollected, remeasured, or verified. Data are typi- cally time and/or location dependent. This context is set by the fact that much of the value of observational data is in its secondary analysis. Experimental data can often be reproduced, although there are cases where experimental conditions or variables are unknown. Experimental data may be associated with a particular meth- odology or instrument
These are sometimes lumped together as computational data:Data that is the result of computer models or simulations can be reproduced if adequate infor- mation is provided about the computer hardware, software, and inputs. Statistical data, computational models, and simulations can also be recreated and verified, as long as sufficient disciplines Can you think of anything else as “data”? Most of the time we are managing the “digital” data, what about the non-digital … lab notebooks, notes, ?
Shows the many differing types and the many different formats for each one.Things to consider when choosing File FormatsCollection/Analysis format does not have to be the same as Preservation format, but if not, then it will need to be converted (interchangeable format – will talk about this later) for archiving.You can choose one format to do analysis, because it may be faster to do in proprietary format. But will need to change to a non-proprietary format later for archiving (Prepare for sharing). Migrate data into a format with these characteristics. Also keep a copy of the original software format.
Keep track of versions of documentation and data. Use directory structure and file naming conventions to help, or use Version Control SoftwareAlways record every change to a file no matter how small. Record relationships between files.Directory Structure: Top Level folder should include Project Name and Date,Each subsequent level should have its naming convention documented….. i.e., categorize by people, experiment, dataset versionFile naming conventions: reserve 3-letter file extension for application-specific codes, Id project in the file nameUse dates in filenames, some disciplines have their own recommendations for file namingFile Structure… flat files vs database (relational)Keep directory structure same for backups.I’ll go over more detail with examples in the next presentation on best practices
Keep master copy to an assigned team memberRestrict write access to specific membersRecord changes with Version controlNetwork: keep confidential data off internet servers (or behind firewalls), put sensitive materials on computers not connected to the internetPhysical security… who has access to your office,. Allowing repairs by an outside companyComputer: Keep virus protection up to date, does your computer have a login password, not sending personal or confidential data via e-mail or FTP, transmit via encrypted data, imposing confidentially agreements for data users Link Managing and Sharing Data document has anindepth section on Ethics, Consent and Confidentiality.
Data Storage for collected data and for backupsConsider Storage and Backup Options the sameUse formats that will be useable in the long-term, not dependent on a software versionCD & DVDs media life not reliable, may have to replace old media, maintaining devices that can still read the proprietary formats or media typeCopy or migrate data files to new media between 2 and 5 years after created.Appropriate environmental conditions will increase the life-span of media. Check environmental conditions recommendations for your particular media. Make sure storage location free from risk of fire and flood. Proper storage of “paper” dataBe aware of thefts, file changes and “loss” (data only on paper..??)
Why backup data?Keeping reliable backups is an integral part of data management. Regular back-ups protect against data loss due to:Hardware failure, software of media faults, virus infection or hacking, power failure, human errorsRecommendation, 3 backup copies original, external/local, external/remoteFull-backups, incrementalCheck the integrity of the files ensure transmitted without error (checksum and file size) Calculate a “value” of a block of data, perform on both files and if same “number” then OK.If using departmental server, check on backup/restore procedures (how quickly can you get files restored?)May want to have the backup procedures controlled by you.Test your backup system, test restoring files, don’t over re-use backup media
Use some options for “storage” others for backupsCloud Storage (Google Docs, DropBox, Windows Live SkyDrive, SpiderOak)
Documentation should start with the Data Management Plan. Start at the beginning and continue reduces likelihood that you will forget aspects of your data later.Document data collection, lab notebooks, digizitation infoThink about non-digital, papers, photos, reports, lab notebooks…. Should be digitized and stored with digital data.In order for the data to be used properly once it’s been archived the data must be documented.Data documentation (otherwise known as Metadata) enables you to understand the data in detail, enable others to find it, use it and properly cite it.Use versioning software for documentation file too.
Conform to community standards for recording data & metadata that adequately describe the context & quality of the data & help others use & find it.Data validation and other quality assurance proceduresModifications of the dataInformation should include:Title, Creator, Subject, Funders, Rights, Dates, Location, Methodology, Data Processing, Sources, File Formats, Variable Lists, Code lists, May need to put the this info in a metadata standard DDI, MODS, FGDC, DarwinCore, EML
Keep a copy of the data in its original form. Maintain it and final version as read-only. With detailed documentation, someone could replicate your findings from the original set to final.As you analyze your data, there will be various changes, additions and deletions to the dataset.Enables reproducibility – validate findings- Executability – others can re-run or re-use analysis
1st version: original data collection2nd version: “cleaned” dataset3rd version: combining variables & analysis Filenames include version# & “who”
There are lots of ways to share your data without depositing it in a repository: e-mail to requestors, posting to a web site, google, or other “cloud” sharing site, but you have to maintain it. And it makes “finding” your data harder.Depositing it in an archive makes it easier to discover and preserve. If it’s documented, well, then easy to use.Make sure confidentiality of respondent data is preserved. Will need to create a version of the dataset without personal info.
Safest option to guarantee long-term data access is to convert data to standard formats.For you the researcher even if not planning on sharing (publishing)These are formats more likely to be accessible in the future.Format of the file is a major factor in the ability to use the data in the future. As technology changes, plan for software and hardware obsolescence. System files (SAS, SPSS) are compact and efficient, but not very portable. Use software to “export” data to a portable (or transport) file. “Interchangeable format”Convert proprietary formats to non-proprietary. Check for data errors in conversion.
Examples of preferred format choicesFormats for long-term digital preservation (open). Don’t expect you (won’t have time) or the archive to be able to convert older formats to new ones.Good chart in the UK Document on Managing and Sharing Data (page 9).
Let’s stop and make sure everyone knows or can define “metadata”. What you use to describe your data, the pieces of information that will allow someone to understand your data, how it was collected thus making another person to replicate your results.In order for the data to be used properly once it’s been archived the data must be documented.If you had been documenting your data and files all along, this step should be easy
In order for the data to be used properly once it’s been archived the data must be documented.Metadata accompanying file should be written for a user 20 yrs into future…. Or written to someone not know about you or your work.
Where you archive your data has an impact on “who can find” your data. Are you looking for long-term preservation (how long would your data be useful)?Each has advantage and disadvantages. Data centers may not be able to accept all data. Start looking at where you want to archive while doing your project. Base your Data management plan on the expectations and criteria for archiving.
Data repositories may have criteria to evaluate and select datasets for reservation.