Talk for first-year PhD students at the CRG. The goal of the talk was to present scenarios that students will likely face and that can compromise reproducibility and efficiency in the analysis of data in the life sciences. Importantly, making the questions is probably more important than the given answers.
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Good practices (and challenges) for reproducibility
1. Good practices (and challenges) for reproducibility
“Give your samples a decent life”
Javier Quilez
2. Outline
● Make groups of 3 (ideally 2 wet-lab + 1 dry-lab)
● I will present sequentially several scenarios/challenges
● You will have some minutes to think how you will tackle them
● I will propose approaches that worked for me
2
4. The life of your sample
Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
4
5. What is your sample?
Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
5
6. What is your sample?
Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
This is NOT enough
6
7. ● Initial processing of the data
● Quality control
● Downstream analysis
● Reproducibility
● Data sharing and publication
Is all the information needed available?
7
8. ● What information (aka. metadata) will describe your experiment?
● How will you collect metadata?
● Who will have access to metadata?
● Will metadata be future-proof?
Think
8
9. Collect systematically the metadata of the experiments
● Do it before processing the data
● Short and easy to complete
● Instantly accessible by authorized members of the team
● Easy to parse for humans and computers
9
11. Experiments will happen over time
Time
Exp. 1
Untreated
ctrl.txt
Treated
t60.txt
Exp. 2
Treated
T60.txt
11
12. Which is your sample (and other issues)?
Untreated
ctrl.txt
Treated
t60.txt
Treated
T60.txt
? ?
What “*60.txt” file does correspond to each trated
experiment?
What “*60” and “ctrl” means may not be so obvious
and implies human interpretation whatsoever
Are both treated samples to be used with the same
untreated sample?
The variable use of lower/upper case complicates
computer searches
12
13. ● How will you name your samples?
● Will it be really unique?
● Will it provide any information about the sample and/or group similar samples?
● Is it future-proof (i.e. consider more samples will come)?
● What will you label with the sample name (i.e. tubes, files)?
Think
13
14. ● Simplest way: auto-incremental identifier (ID) (i.e. sample001, sample002, …)
● More complex options (sample ID based on metadata)
● Whichever you choose…
○ Unique
○ Computer-friendly (fixed length and pattern, all upper or lower case)
○ Anticipate the number of samples that can be reached
● Trace your sample with its ID through its life: from the tube to the files
Establish a system: each sample a unique identifier
14
19. ● How will you organize your raw data?
● How will you organize your processed data?
● How will you organize your analysis results data?
● Will human and computer searches be easy?
Think
19
20. The life of your sample
Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
20
21. The life of your sample
Experiment
(wet-lab domain)
Raw
Data
1
Analysis
results
3
21
2
Processed
data
22. (1) Raw data - 1 directory per instrument run
● Files as spit from the instrument
● Do not store modified, subsetted or merged files
● Quality control of raw files
22
23. (2) Processed data - 1 directory per sample
● Several subdirectories
○ Steps of the analysis pipeline
○ Logs of the programs used
○ File integrity verifications
● Subdirectories accommodate variations in the analysis pipelines
○ sample1/step1/program_a/sample1.txt
○ sample1/step1/program_b/sample1.txt
23
26. Data analysis hardly ever is a one-time task
Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
26
27. Can you process seamlessly multiple samples?
Time
ResultsData
Results
Results
Results
Results
Results
Data
...
27
28. ● Imagine you write code to process/analyze 1 sample:
○ How will it handle 100 samples?
○ Will 100 samples be processed in a reasonable time?
○ Will you have to manually configure sample-specific parameters?
○ Will you be able to run specific parts of your code?
Think
28
42. Data go through many procedures to generate results
Time
ResultsData
Results
Results
Results
Results
Results
Data
...
42
43. Can you or anybody else reproduce your results?
Results
Results
Results
Results
?
?
Little understanding, irreproducibility, identification of errors is harder
43
44. ● How will you document your procedures?
● How will you store your code?
● How others will have access to your documentation?
Think
44
45. ● Write in README files how and when software and accessory files are obtained
(e.g. genome reference sequence, annotation)
● Allocate a directory for any task (even as simple as sharing files)
● Code core analysis pipeline to log the output of the programs and verify file
integrity
● Document procedures using Markdown, Jupyter Notebooks, RStudio or alike
● Specify non-default variable values
Document, document and document
45
47. Take home message
What is your sample?
Which is your sample?
Where are data and results?
Can you processes
seamlessly multiple samples?
Can you or anybody else
reproduce your results?
47
48. Take home message
What is your sample?
Which is your sample?
Where are data and results?
Can you processes
seamlessly multiple samples?
Can you or anybody else
reproduce your results?
Collect systematically the metadata of the
experiments
48
49. Take home message
What is your sample?
Which is your sample?
Where are data and results?
Can you processes
seamlessly multiple samples?
Can you or anybody else
reproduce your results?
Collect systematically the metadata of the
experiments
Establish a system: each sample a unique identifier
49
50. Take home message
What is your sample?
Which is your sample?
Where are data and results?
Can you processes
seamlessly multiple samples?
Can you or anybody else
reproduce your results?
Collect systematically the metadata of the
experiments
Establish a system: each sample a unique identifier
Structured and hierarchical organization of the data
50
51. Take home message
What is your sample?
Which is your sample?
Where are data and results?
Can you processes
seamlessly multiple samples?
Can you or anybody else
reproduce your results?
Collect systematically the metadata of the
experiments
Establish a system: each sample a unique identifier
Structured and hierarchical organization of the data
Scalability, parallelization, automatic configuration and
modularity of the code
51
52. Take home message
What is your sample?
Which is your sample?
Where are data and results?
Can you processes
seamlessly multiple samples?
Can you or anybody else
reproduce your results?
Collect systematically the metadata of the
experiments
Establish a system: each sample a unique identifier
Structured and hierarchical organization of the data
Scalability, parallelization, automatic configuration and
modularity of the code
Document, document and document!
52
53. In case you forget the take home message…
The human factor is the greatest hurdle for reproducibility
Limit or control human intervention by automating every step of
the data analysis as much as possible
It’s not you, it’s the lab culture
53
55. Your involvement in the data analysis is a choice
The data analysis itself is not
55
56. Your involvement in the data analysis is a choice
The data analysis itself is not
56
Your
autonomy
Dependenceon
bioinformaticians
Your involvement in the data analysis