3. 1. Funders Require It
• National Institutes of Health: Data Sharing Policy (2003)
• All grants funded at $500K or above must include a Data Sharing Plan
3/01/13
• National Science Foundation: Data Management Plan Requirement
(2011)
• All proposals must submit a 2 pp supplementary “Data Management Plan” to
Data Management Basics
describe how projects will comply with NSF data sharing policy
• National Endowment for the Humanities: Sustainability and Data
Management Plans Requirement (2012)
• Digital Humanities Implementation Grants must include a plan to discuss how
data will be managed, disseminated, and preserved
• OSTP Directive to Funding Agencies (2013)
• Federal agencies with more than $100M in R&D expenditures must ensure 3
that published results of federally funded research are freely available to the
public within one year of publication -- including data
4. National Science Foundation
• Data Management Plan Requirement
• How projects will conform to NSF data sharing policy
• Flexible
3/01/13
• “The plan should reflect best practices in your area of research, and
should be appropriate to the data you generate.”
Data Management Basics
• Directorate for Social, Behavioral and Economic Sciences
• Discipline-specific guidelines
• Archeology (Digital Archeological Record)
• Economics (American Economic Association)
• Universals (for the NSF Universe)
• What data are generated by your research? 4
• What is your plan for managing the data?
5. 2. It Makes Life Easier
• For you…
• Increases efficiency
• Easier to understand the data collected throughout the life cycle of the
project
3/01/13
• Easier to find the data that you need throughout the life cycle of the
project
•
Data Management Basics
Satisfies applicable legal obligations
• Addresses preservation, documentation, verification issues
• Helps reviewers understand the characteristics of your data
• Increases citation rates for articles
• For others…
• Provides continuity – other researchers can build on your data
• Enhances longevity and usability
• Facilitates new discoveries
• Supports open access 5
6. 3. It’s the Right Thing To Do
Responsible Conduct of Research/Research Ethics
• Data Acquisition, Management, Sharing and Ownership
• Using the appropriate research method
3/01/13
• Providing attention to detail
• Obtaining appropriate permissions
Data Management Basics
• Recording data accurately and securely
• Maintaining data to allow it to confirm research findings,
establish priority, and be reanalyzed by other researchers.
• Storing data to protect confidentiality, be secure from physical and
electronic damage, destruction or theft, and be maintained for the
appropriate time frame dictated by sponsor and University policies.
Compliance
• Research using Human Subjects (Institutional Review Board) 6
7. 3/01/13
Data Management Basics
Naming Your files
Organizing Your Data
Backup and Storage
Post-Project Considerations
SMART DATA PRACTICES 7
8. Organizing Your Data
• Getting Started
• Consider your goals
• What do you want to get out of managing your data?
3/01/13
• What is the most efficient way to organize your data?
• Figure out your criteria for keeping data
Data Management Basics
• Think about where you want your data to end up
8
10. Organization
3/01/13
File
Data Management Basics
naming
and
labeling
Consistency Context
10
11. Some potential components for
your file naming strategy
• Version number
3/01/13
• Date of creation
• Name of creator
Data Management Basics
• Description of content
• Name of individual/research team/department
• Publication date
• Project number
11
12. Organizing Your Data
3/01/13
Data Management Basics
12
W. E. B. Du Bois, Niagara delegate meeting, Boston, 1907. W. E. B. Du Bois Papers (MS 312). Special
Collections and University Archives, University Libraries, University of Massachusetts Amherst
13. Organizing Your Data
• Let’s Clean Up Those File Names
• abcdefghijklmnopqrstuvwxyz.jpg
• doesn’t make much sense, does it?
3/01/13
• How about:
Data Management Basics
• 20120925_credo_du_bois_rrz_001.jpg
• And I put it in a directory called:
• credo_du_bois
13
14. Organizing Your Data
• Why this structure?
• Oh, I just made it up! But I’m going to be consistent
• 20120925 = date I found the image
3/01/13
• credo = database/collection where I found the image
• du_bois = image subject
Data Management Basics
• rrz = my initials (I am working in a group!)
• 001 = an accession number (I made that up, too, but I’ll continue to
use that schema)
14
15. BAD naming practices
• Using generic data file names that may conflict when moved
from one location to another
• Failing to think about scale
3/01/13
• Using special characters in a filename such as:
Data Management Basics
&*%$£]{!@
15
16. Versioning
• Use ordinal numbers (1,2,3) for major version changes and the
decimal for minor changes: v1, v1.1, v2.6
• Beware of using confusing labels: revision, final, final2,
3/01/13
definitive_copy
• Discard or delete obsolete versions
Data Management Basics
• Use an auto-backup facility (if available) rather than saving or
archiving multiple versions
• Turn on versioning or tracking in collaborative documents or
storage utilities such as Wikis, GoogleDocs, etc.
16
17. Quiz! File naming by date
What is the best filename?
A. 2012-09-25_Attachment
3/01/13
B. 25 September 2012 Attachment
C. 25092012attch
Data Management Basics
17
18. Quiz! File naming by description
What is the best filename?
A. dubois_great_barrington_recent_20120925_old
version.docx
3/01/13
B. 2012-09-25_dubois_great_barrington_V1.docx
Data Management Basics
C. FFTX_2365498_old.docx
18
19. Organizing Your Data
• Organizational methods
• Hierarchical
• Tag-based
3/01/13
• Retrieval “Very little skill is
Data Management Basics
• Location-based needed to actually be
• Search-based organized and
efficient…. just the
consciousness to put
this file or folder in the
right place.”
19
20. Organizing Your Data
Use folders!
3/01/13
DuBois
DuBois_Images
Data Management Basics
DuBois_Images/1868-1898/
DuBois_Images/1898-1928/
DuBois_Letters
DuBois_Letters/1868-1898/
DuBois_Letters/1898-1928/
DuBois_Newspapers/
etc.
20
21. Archive what you don’t or won’t
need
• Decide what your final data sets are
• Once your project is over, weed out obsolete data and decide
what you want to keep for the long-term
3/01/13
• Move files and folders to an ‘Archive’ or ‘Old files’ folder
• z_archive
Data Management Basics
21
22. Backup and Storage
3/01/13
Data Management Basics
22
January 2011: “Stolen laptop contains cancer cure data”
23. Backup and Storage
• Backup is an essential component of data management
• Prevent against accidental or malicious data loss
• Restore original data
3/01/13
• Keep 3 copies
Data Management Basics
Original
• Consider
• How much?
• How frequently?
• Which media? External External
Local Remote
• Synchronization
23
• Test your system
24. Backup and Storage
• Accessibility of data depends on storage media and file format
• Vulnerable to deterioration
• Become obsolete over time
3/01/13
• Plan for disruption
Data Management Basics
Original
• Consider
• Non-proprietary
file formats
• Different media types External External
in storage strategy Local Remote
• Migrate data
• Unencrypted, 24
uncompressed
25. Backup and Storage
• Security
• Encryption can be used for safely moving or storing files,
• Encrypting files on storage devices (flash drives)
3/01/13
• Encryption during file transfer (ie: WinSCP)
• Encrypted storage services
Data Management Basics
• Deleting Data
• Weed out obsolete data and decide what you want to keep for
the long-term
• Deleting files does not delete files
• Other things to Consider
• How will the data be used? 25
• Who pays for storage?
27. Data Management is About
Planning
Data management will:
• Prevent bad things
3/01/13
from happening to Collection Description
your data;
Data Management Basics
• Make you a more Storage
Access
efficient researcher; and Backup
• Prepare you for
grant management.
27
28. Data Management Plans
NSF
• The types of data;
3/01/13
• The standards to be used for data and metadata format and
content ;
Data Management Basics
• The policies for access and sharing;
• The policies and provisions for re-use, re-distribution, and the
production of derivatives; and
• The plans for archiving and for preservation of access.
28
30. Planning
• Data Working Group (email datamanagement@library.umass.edu)
• Digital projects
• Long-term preservation
3/01/13
• Assessment
• Web resources
Data Management Basics
• UMass Amherst Libraries: General Resources
(http://guides.library.umass.edu/datamanagement)
• Discipline-specific
• Your faculty
• Your mentors
• Your professional associations
• Industry partners
• Public engagement
30
31. Backup and Storage
• Storage
• Udrive (http://www.oit.umass.edu/udrive )
• Departmental servers
• CDs/DVDs/external hard drives
3/01/13
• Filesharing (see http://chronicle.com/blogs/profhacker/protecting-your-data/37350)
• Dropbox
Data Management Basics
• Google Docs
• Cloud Storage
• Amazon Web Services
• Rackspace
• Microsoft Azure
• Sugar Sync
• Additional Information
• MIT on Backups and Security
http://libraries.mit.edu/guides/subjects/data-management/backups.html
• UK Data Archive on Data Storage 31
http://www.data-archive.ac.uk/create-manage/storage
• UK Preservation Office “Caring for CDs and DVDs”
http://www.bl.uk/blpac/pdf/cd.pdf
33. Sources
• MIT Data Management
(http://libraries.mit.edu/guides/subjects/data-management/)
• UK Data Archive
3/01/13
(http://www.data-archive.ac.uk/)
• MANTRA
Data Management Basics
(http://datalib.edina.ac.uk/mantra/organisingdata.html)
• Creating Order from Chaos: 9 Great Ideas for Managing Your
Computer Files
(http://www.makeuseof.com/tag/creating-order-chaos-9-
great-ideas-managing-computer-files/)
• Research Information Management: Tools for the Humanities
(http://sudamih.oucs.ox.ac.uk/docs/Generic%20Courses/Tools
%20for%20the%20Humanities%20course%20book.docx)
33
34. Questions/contact
datamanagement@library.umass.edu
3/01/13
Data Management Basics
34
Hinweis der Redaktion
Starting in January 2011 NSF is requiring that grant proposals have a Data Management Plan.The DMP is described as no more than two pages, specifying the types of data, the standards to be used for data and metadata format and content, policies for accessing and sharing the data.They do state that a valid plan may include only the statement that no detailed plan is needed, but you have to justify that statement.DMP will be reviewed as an integral part of the proposal, coming under the Intellectual Merit or Broader Impacts sections or both. Grant Proposal Guide (GPG), Chapter II.C.2.j NSF Directorates, Programs have additional requirementshttp://www.nsf.gov/bfa/dias/policy/dmp.jspThe Biological Sciences, Engineering, Geosciences, Social, Behavioral and Economic Sciences Directorates are examples having additional requirements for their DMPs.National Institutes of Health expect researchers to include data sharing plans in their proposals as well. This appears to be a trend for other funding agencies. NIH data sharing policy: Data should be made as widely and freely available as possible while safeguarding the privacy of participants, and protecting confidential and proprietary data.NSF data sharing policy:Investigatorsare expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. NEH: All proposals will be required to include both a sustainability plan that discusses long-term support for the project and a data management plan that discusses how research data will be preserved.
The National Science Foundation recognizes the need for flexibility. Different types of data require different plans. The NSF documentation points researchers to some specific sites for more specific protocols. There are numerous others for the rest of the social sciences. You can contact your scholarly associations for more details.You must demonstrate that you know what you data are and how you will manage it to funding agencies.
For you:Since a number of grants are multi-year, renewable propositions; building good data management practices into proposals is crucial.By doing so, problems associated with lab turnover can be addressed (one professor noted that 3-4 more papers would have come out of his lab if not for this type of issue) But also assists you in remembering relevant details and procedures relating to your data and data collection over the long haul as well.Developing a good data archiving plan safeguards your investment of time and money and makes recovery from disaster possible and hopefully, faster and more complete. Addresses your documentation and verification issues. Type of data to be produced, description of the methodology, standards that will be applied.Satisfies many legal obligations such as security measures to protect confidentiality or IP considerations.Good DMP helps reviewers understand your work and increases the visibility. Easily accessible and clearly understood. Preserves your unique contribution to your field. For others:Although provisions are made for restrictions or embargos on data , particularly those having commercial implications, there is an underlying assumption that data should be shared, distributed and built upon. A data management plan gets you to think about and plan for how that will happen.Promoting new discoveries and minimizing duplication of effort. Open access movement Science Commons, PubChem et al which fosters the development of knowledge. Science Commons is an organization that promotes legal and technical mechanisms to remove barriers to sharing scientific information. One way they are looking at that is through the Open Knowledge Definition which sets out to define openness in relation to content and data.
RCR covers a range of topics that speak to the conduct of investigators and the integrity of the research university (where an investigator is defined in UMass COI policy as: the principal investigator and any other person who is responsible for the design, conduct, or reporting of research funded). It is a philosophy of creating an environment for research that encourages quality and ethical principles. Topics include Mentor/Trainee Responsibilities;Publication Practices and Responsible Authorship; Peer Review; Collaborative Science;Communication and Difficult Conversations; and Data Acquisition, Management, Sharing and Ownership. Many of the practices and constraints will be dictated by the discipline, by the lab, the funding conditions, but there are generally accepted standards that investigators should be aware of and adhere to relative to data ownership, data collection, data protection and data sharing.By following good data practices (or RCR), an investigator can avoid risk of misconduct and comply with policies and regulations regarding intellectual property and animal or human research subjects. Examples of Compliance include protocols for doing research with animals, for biological and environmental safety, and export control. Research using Human Subjects involves having project reviewed by the University’s IRB (a federally mandated body which reviews all sponsored research involving human subjects), obtaining consent, and maintaining confidence of data collected. Research with Human Subject is the domain where privacy (for sensitive data), confidentiality, and security will be major concerns when managing data. Examples of Ethical concerns include Conflicts of Interest (related to financial concerns or intellectual property rights concerns influence the design, conduct or reporting of research), Faculty Consulting, and Whistleblowing. Research Misconduct also in this category. Misconduct: means fabrication, falsification, or plagiarism in proposing, performing, reporting, or reviewing research, not including honest error or difference of opinion; misrepresentation of the procedures and outcomes of research to gain some advantage. Policies to investigate and determine misconduct include fact finding (which means examination of data).
Data = research
Organizing your data is about keeping good records, namely planning file naming conventions and organizing file directories to your advantageWhat are your goals? Based on those goals: how should you organize your data? Are there key themes, categories, people, dates, formats, etc? You might document/store/organize data differently for different outcomes like sharing, preserving, sharing a small subset, etc.What is important to save? If you plan well, you can put your research anywhere.
The most basic part of organizing your data is to consider your filenames. Most computers uses filenames to index content; “Windows search”Clear names will help in retrieving files and should fit with your overall organizational approach for your project.
There are three things to consider when naming files – organization, context, consistency.Organization is important for future access and retrieval –Context could include content-specific or descriptive information Consistency – choose a naming convention and ensure that the rules are followed systematically by always including the same information (such as date and time) in the same order (YYYYMMDD).
How would we name this image file – found in the University Archives?
File naming conventions.
Consistency is key. Use underscores instead of full-stops or spaces because, like special characters, these are parsed differently on different systems The filename should include as much descriptive information that will assist identification independent of where it is storedIf including dates, format them consistently
Scale:if you want to include a project number, don’t limit your project number to 2 digits, or you can only have ninety nine projectsSpecial characters: these are often used for specific tasks in a digital environment
It is important to identify and distinguish versions of research data files consistently. This ensures that a clear audit trail exists for tracking the development of a data file and identifying earlier versions when needed. Thus you will need to establish a method that makes sense to you that will indicate the version of your data files.http://datalib.edina.ac.uk/mantra/organisingdata.html
A – correct. Files using this naming convention are easy to distinguish from one another, easier to browse and locate chronologically.B - Incorrect. File not easy to browse and locate chronologically.C - Incorrect. File not easy to browse and locate chronologically. Filename not immediately intuitive.Tip! If using a date, use the format year-Month-Day: YYYY-MM-DD or YYYY-MM or YYYY-YYYY. This will maintain chronological order of your files.
A – incorrect - date is ambiguous, there could be several ‘old’ versions. B – correct – date is in uniform format and easy to distinguish/sort from files using same date convention. Filename represents more accurately the content. Using a version number convention also makes it easier to distinguish from other versions of the same file.C – incorrect – this is an application generated filename lacking descriptive or context-specific information.
Hierarchical – most commons operating systems default to this way of organizing filesAn item can only go into one place or folder (unless there are duplicates)Must choose a system for categorizing filesWell-adapted to location based findingTag-based – electronic labels or keywords applied to files, flat systemAn item can have many tags, more flexibility with how a file is categorizedTags must be applied consistentlyPlan and then follow the plan. Implement. File things immediately; put things in the right place according to your plan as they are created.
Example of a well-organized file with consistent naming conventions.Major heading with logical subheadings.Individual files under subheadings distinguished by date of analysis, or collection, etc. but be consistentOrganize by category – for example, if you are studying multiple individuals and are collecting many types of documents about them, you could organize first by the individual, then by type of coverage – image, letter, newspaper, then by date. One place for everything – you need a place where you know that you can access your files and folders there. The My Documents folder is the logical and perfect place for this - this is a home for your folders, which contain your files. Think of it in the sense that you wouldn’t put your folders in the yard, nor would you put your filing cabinet in the yard… you put both of them in the house. Your My Documents folder is your “house” of sorts.Plan and implement. File things immediately; put things in the right place according to your plan as they are created.
Personally I recommend still having it in the My Documentsfolder to keep things easy to remember and consistent. With a name like “Archive” it’ll likely be near the top of whatever folder you decide to put it in. To change this, you can add a “z” and a period to the beginning of the name, so the folder could look something like “z.Archive“. This will put it at the bottom of the list so you won’t have to worry about it being in the way all the time.http://www.makeuseof.com/tag/creating-order-chaos-9-great-ideas-managing-computer-files/
V. important component of data management – backup. University of Oklahoma researcher loses years of research due to theft. PC advisor poll from November 2010 indicates that 1 of 13 do not back up important data! [30% back up important data daily; 25% weekly; 21% monthly; 16% rarely backup data; 8% never]http://www.pcadvisor.co.uk/news/security/3248400/poll-30-percent-back-up-data-every-day/
Backup ensures that most recent data will always be accessible and concerns the procedures for saving and synchronizing data. Accidental or malicious data loss due to:hardware faults or failuresoftware or media faultsvirus infection or malicious hackingpower failurehuman errors by changing or deleting filesRecommended practice is to keep 3 copies of your data. How much: What will you need to restore in the event of data loss? Are there backup policies already established for the institutional/network computers you are using and will they be sufficient for your project?How frequently: how critical are the changes being made or the new data being generated? Backup after every change, or at regular intervals. Use automated backup processes. Which media: depends on quantity, file type, project needs. Options include removable media (hard or flash drives), recordable CD/DVD, or network drive. Synchronization: Ensures consistency between backup copies. Use the same or compatible naming conventions for the original project files – label removable media!
Storage concerns the location and media for housing data and is important because digital media are inherently unstable and change rapidly. Media currently available for storing data files are optical media - CDs and DVDs - and magnetic media - hard drives and tapes. Both vulnerable to physical degradation. Storage strategy even for short term projects should include two different forms of media.Non-proprietary file types (follow an open, documented standard; ASCII or Unicode; community-supported; unencrypted; uncompressed):PDF/A, not WordASCII, not Excel MPEG-4, not QuicktimeTIFF or JPEG2000, not GIF or JPGXML or RDF, not RDBMSWhich media: Portable HD? Cloud? Department server? Subject data repository? UK Data Archive recommends using at least two different media types in your storage strategy (optical/magnetic) in addition to local and remote backup copies. Unencrypted is ideal for storing your data because it will make it most easily read by you and others in the future. (MIT)Uncompressed is also ideal for storage, but if you need to do so to conserve space, limit compression to your 3rd backup copy (MIT)
Secure data storage will prevent unauthorized access, changes, disclosure, or destruction of data and includes physical as well as network security. Refers to physical security (passwords, firewalls, anti-virus and anti-malware software) as well as security when sharing or moving files. Encryption is the easiest and most practical method of protecting data stored or transmitted electronically and is particularly essential with sensitive data. (ECU)Moving or storing files, such as back-ups or storage on mobile devices. Individual files can be encrypted, as well as entire storage devices or spaces.http://www.ecu.edu/cs-itcs/itsecurity/DataEncryption.cfmWeeding: Determined by project requirementsHow will the data be used?In-house? Outside users?Restricted?Is it live or “archived?”
These may be things that you will get to toward the end of a project, but are good to think about. Traditional outcomes of research are published papers (much of what tenure and promotion is based on). Growing practice to submit supplemental data files along with manuscripts at the point of publication. Know what your intellectual property is, what your copyrights are and how they apply to data and databases.Much of what is created is considered an “exempt scholarly work”: university automatically waives ownership of this class of IP. UMass Policy: the creator owns IP that is created or discovered here.Copyright providesLegal protection for “original works of authorship”Facts and ideas can not be copyrighted, but their expression canData sets and databases can be protected under copyright as literary works, which includes “tables” and “compilations”Expectations of sharing are have also created an environment where datasets are being shared within communities.It has been recognized, by Creative Commons specifically, that the nature of sharing data sets is fundamentally different than sharing textual documents. Also that the benefits of data sharing outweigh the constraints of applying copyright. They have endorsed a Database Protocol which encourages the unfettered sharing of data through the use of a CC0 license: this essentially puts data into the public domain. Venues for data sharing include Institutional and Disciplinary Repositories. Data Citation means providing a reference to data in the same way as researchers routinely provide a bibliographic reference to printed resources. Important part of validating datasets as a primary research output rather than a by product of research. University resources: university funds, time, and facilities; not use of library, facilities available to the public, or occasional use of office equipment.Exempted scholarly works: Students sign participation agreement (prior to hire as research assistants, for example)Who owns copyright of data?Creator of the dataUnder UMass IP Policy, the creator owns IP that is made, discovered, or created here unlessSignificant use of University resourcesUniversity-commissioned workIP Subject to contractual obligations (ie: sponsored research)Student work (except “exempt scholarly work”)“Exempt Scholarly Work” includesInstruction materials, including text books and class notesResearch articles, monographs, proposalsTheses and dissertations, dramatic works and performances, drawings sculpture, musical compositions and performances, poetry, fiction and non-fictionhttp://www.umass.edu/research/system/files/Intellectual_Propery_Policy_UMA.pdfStop for questions.
These are the elements of data management – thinks that you should think about. Data management will have positive benefits.
You will need somewhere to store your data as you are workingUdrive – you get 1GB, can share files with anyone through the udrive3rd party – many cloud storage providers – Amazon gives you 5GB, Dropbox gives you 2GB, google docs gives you 1GB, but you can purchase more space – 400gb for $100/year, 1TB for $256/year; cloud options provide a nearly infinitely-scalable tier of storage for archiving very large datasets. Prices can range from $0.14/GB to $0.55/GB.OIT security pages have links and instructions for downloading anit-virus and anti-malware software; it has tips for protecting your personal computer from unauthorized access;