Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Responsible conduct of research: Data Management

327 Aufrufe

Veröffentlicht am

A presentation for the Food and Nutrition Science Responsible conduct of research class on data management best practices. Covers material in the context of writing a data management plan.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Responsible conduct of research: Data Management

  1. 1. Responsible Conduct of Research: Managing Data Tobin Magle Data Management Specialist Nicole Kaplan Information Manager Daniel Draper Digital Repositories Unit Coordinator
  2. 2. Responsible Conduct of Research: The data management firehose! C. Tobin Magle, PhD Please ask me for help with data management!
  3. 3. My Background: molecular microbiology (1) Magle CT et al Infect Immun. 2014 Feb;82(2):618-25. doi: 10.1128/IAI.00444-13. Epub 2013 Nov 25. (2) Sun W, Tanaka TQ, Magle CT, et al.. Sci Rep. 2014 Jan 17;4:3743. doi: 10.1038/srep03743.
  4. 4. Data Workshops
  5. 5. Individual help for ANY data topic How do I write a DMP? How do I organize my data? How do I clean and format my data? How do I use R? How do I get my data ready to share? How do I comply with funder mandates? What DM tools are there for collaboration? How do I use R?
  6. 6. Data Management Services https://lib.colostate.edu/services/data-management
  7. 7. What is data management? The policies, practices and procedures needed to manage the storage, access and preservation of data produced from a research project
  8. 8. data management != data sharing • but the same principles apply to both
  9. 9. *ok not everything, but most things
  10. 10. More researchers https://www.nsf.gov/statistics/2016/nsf16300/digest/nsf16300.pdf
  11. 11. See arXiv:1402.4578 for details
  12. 12. Working Email Data are extant (If status known) Status of data (if response) Response (if email working) doi:10.1016/j.cub.2013.11.014
  13. 13. We are losing vast amounts of data 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 11 1 1 1 1 1 1 1 0 0 0 0 0 0 0 00 0 00 0 1 1 1 1 1 0 Who is responsible?
  14. 14. CSU data policy General points: • The university owns, and is therefore ultimately responsible for research data • Researchers are the data managers • The university promotes openness http://policylibrary.colostate.edu/policy.aspx?id=737
  15. 15. You’re a Data Manager http://www.phdcomics.com/comics/archive.php?comicid=382
  16. 16. CSU data policy Research Data Associated with Theses and Dissertations To preserve the complete scholarly record of the author, data sets must be incorporated. Therefore, a student depositing their thesis or dissertation is required to make discoverable, accessible and available their associated data sets in accordance with this policy and provisions of the University’s Digital Repository. Access and rights management (embargo period, access limited to specific IP addresses) shall be the same for the associated data sets as it is for the thesis or the dissertation. http://policylibrary.colostate.edu/policy.aspx?id=737
  17. 17. When should data management happen? Throughout the whole research cycle
  18. 18. Hypothesis The research cycle
  19. 19. Hypothesis Experimental design The research cycle
  20. 20. Hypothesis Data Experimental design The research cycle
  21. 21. Hypothesis Data Experimental design Results The research cycle
  22. 22. Hypothesis Data Experimental design ResultsArticle The research cycle
  23. 23. Hypothesis Data Experimental design ResultsArticle The research cycle
  24. 24. Hypothesis Data Experimental design ResultsArticle Data Management Plans The research cycle
  25. 25. Hypothesis Raw data Experimental design Tidy Data ResultsArticle Data Management Plans Cleaning Analysis The research cycle
  26. 26. Hypothesis Raw data Experimental design Tidy Data ResultsArticle Data Management Plans Cleaning Sharing Analysis Open Data The research cycle
  27. 27. Hypothesis Raw data Experimental design Tidy Data ResultsArticle Data Management Plans Cleaning Sharing Analysis Open Data Code Reproducible Research The research cycle
  28. 28. Hypothesis Raw data Experimental design Tidy Data ResultsArticle Data Management Plans Cleaning Sharing Analysis Open Data Code Reproducible Research Reuse The research cycle
  29. 29. Hypothesis Raw data Experimental design Tidy Data ResultsArticle Data Management Plans Cleaning Sharing Analysis Open Data Code Reproducible Research Reuse The research cycle
  30. 30. What is research data? • “The recorded factual material commonly accepted in the scientific community as necessary to validate research findings” - White House Office of Management and Budget • Reality: Applies to any research product
  31. 31. Hypothesis Raw data Experimental design Tidy Data ResultsArticle Data Management Plans Cleaning Sharing Analysis Open Data Code Reproducible Research Reuse Working data vs. archived data Working Archived
  32. 32. What is a data management plan? A description of how you plan to describe, preserve and share your research data. Often required by funding agencies
  33. 33. Successful DMPs include • A data inventory, including type(s) and size • A strategy for describing the data • A plan for preserving the data • A method for access to the data Always make sure to follow funder requirements
  34. 34. Tool: DMPTool • Review requirements from different agencies • https://dmptool.org/guidance • Create new DMPs based on funding agency templates • Search public DMPs
  35. 35. Data inventory • What type of data are you going to collect? • What file type will be produced? • What size will these files be? How many files? • How will you organize the data? • What other research outputs will be produced? • Code/Software? • Templates/protocols?
  36. 36. Data inventory • What type of data are you going to collect? • What file type will be produced? • What size will these files be? How many files? • What other research outputs will be produced? • Code/Software? • Templates/protocols? miRNA sequences FASTQ files 1 GB per file x 64 strains x 3 replicates ------------------- ~200 GB R scripts for analysis and visualization Data use tutorials
  37. 37. Data formats • Avoid proprietary formats • Know what software can read your data Proprietary Format Open Format Excel (.xls, .xlsx) Comma Separated Values (.csv) Word (.doc, .docx) plain text (.txt) PowerPoint (.ppt, .pptx) PDF/A (.pdf) Photoshop (.psd) TIFF (.tif, .tiff) Quicktime (.mov) MPEG-4 (.mp4) MPEG 4 Protected audio (.m4p) MP3 (.mp3)
  38. 38. Q’s: Data Inventory What kind of data are you going to collect? What file type will be produced? What size will these files be? How many files? What other research outputs will be produced?
  39. 39. Folder systems • Identify ways to divide your data into categories (Attributes) • Top level organization is the most important attribute • Provide documentation
  40. 40. Hierarchical Organization my_thesis chapter1 chapter2 chapter3 chapter4 raw_data replicate1 replicate2 processed_data code Processing cleaning results tables figures
  41. 41. Q’s: Data Organization • What kinds of files are there? (See data inventory) • How could you group them? • Project? • Time? • Location? • File type? • What are the most important attributes?
  42. 42. Tool: Open Science Framework • Components • Add-ons • Contributors • Wiki http://help.osf.io/m/collaborating/l/524109-using-the-wiki http://www.slideshare.net/DuraSpace/121014- slides-roadmap-to-the-future-of-share
  43. 43. Organization rules • Be consistent • One directory per project • Separate subdirectories for • Raw data • Processed data • Code (processing and analysis) • Output • Make raw data read-only • Make README files http://help.osf.io/m/60347/l/611391-organizing-files
  44. 44. Example: Temperature data
  45. 45. A strategy for describing the data • Metadata: Relevant information for re-creation and re-use • Contact info • How data was collected • Details about collection • Date, location of collection • Units • Can be as simple as a text file
  46. 46. Metadata standards • Dublin Core: http://dublincore.org/documents/dcmi-terms/ • Can be applied to anything • Many discipline specific metadata standards • EML: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html • MIAME: http://fged.org/projects/miame/ • Search for other standards: • http://www.dcc.ac.uk/resources/metadata-standards • https://biosharing.org/standards/
  47. 47. Genomics example (NCBI template)
  48. 48. Q’s: Describe your data What do people need to know to reuse your data? Are there any discipline-specific metadata standards? What format will you describe your data in (text, XML, tabular)? What fields will you include (author, date, format, identifier?)
  49. 49. A plan for preserving the data • Where will it be stored? - Backups • Necessary metadata and other products • Who is responsible? • How long?
  50. 50. Ellin, A. Rutgers Student Offer $1,000 for Data on Stolen Laptop.abcNEWS via Good Morning America. April 26, 2013. http://abcnews.go.com/blogs/business/2013/04/rutgers-student-offers-1000-for-data-on-stolen-laptop/ Backup
  51. 51. Back up recommendations • Store in geographically distinct locations • How often? • Automation: Will you remember to do it manually? • Security: Are you working with PHI?
  52. 52. Q’s: Preservation plan What will you store? Who will be responsible for the data (person or position)? How long will you store it? Where will you store it? How will you back it up? *Differentiate between working vs. archived
  53. 53. A method to access the data • Important to funding agencies • Reproduce existing research • Promote further research • Must be easily available: • No “by request only” • Embargoes are “ok” • Data security: consider privacy and IP issues before sharing
  54. 54. Data access and sharing best practices • Non-proprietary formats • Include metadata • As open as possible • Follow CSU research data policy
  55. 55. Trusted Repositories: store and share • Discipline specific • Search: http://service.re3data.org/browse/by-subject/ • Generic • Figshare - https://figshare.com/ • Dryad - http://datadryad.org/ • CSU Digital Repository • http://lib.colostate.edu/digital-collections/ http://67.media.tumblr.com/6228cbe58a9652f1a85e8a b1ed08d715/tumblr_inline_n6oukhNlZW1qf11bs.png
  56. 56. Tool: CSU digital repository • Over 100 Datasets • Satisfy requirements for manuscripts and grants • At no cost <1 TB • $150/TB for 5 years • $300/TB for >5 years
  57. 57. Theses and Dissertation Data 1. Submit to ProQuest with thesis or dissertation • Supplemental data file • Only discoverable through thesis or dissertation 2. Submit to CSU Library separately • Requires distinctive descriptive metadata • Linked with thesis or dissertation • Data discoverable globally
  58. 58. Q’s: Access methods Where will people be able to access the data? Does your discipline have a repository? Are you complying with CSU’s data policy? How will you format the data for CSU digital repository?
  59. 59. Need help? • General: library_data@colostate.edu • Direct: tobin.magle@colostate.edu • DMPTool: http://dmptool.org/ • Data Management Services website: http://lib.colostate.edu/services/data-management

×