SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
Big Datasets and Highly
Sensitive Data
Bennet McComish
31 July 2017
Computational Genomics
Study of the structure, function, evolution, and mapping of genomes
Genes control our basic biology, how the body works, how we respond to
drugs
Changes in your genome make you who you are
They can also cause disease (such as cancer) or mean your cancer therapy
doesn’t work (or works really well)
We study those changes to understand and improve your health
2/14
What is the human genome?
The genome is basically a string of letters (A T C G)
1 human genome = 3.2 billion letters or ‘bases’ spread across 23
chromosomes
3% of the genome (3 million bases) ‘coding’ for ~25,000 genes
Print version of one genome at the “Wellcome Collection”
120 books, 1000 pages each at 4.5 point text
3/14
Genome sequencing
Technology now allows us to read the code of our genomes
We have a human ‘reference’ genome – made of the most common (3.2
billion base) sequence
We compare a person’s genome with the reference to find all the ‘different’
sites (~3 million per person or 0.1%)
Then only focus on the places where there are differences
4/14
Genome variation
5/14
Approaching the
"$1000 genome"
Exponential
increase in the
number of
genomes being
sequenced
Bottleneck has
moved from data
generation to data
analysis
Cost of sequencing
6/14
"Big data"
Hiseq 200G run
Image data 32 TB discarded
Intensity data 2 TB usually discarded
Raw sequence and quality score data 250 GB backed up
Aligned sequence 100 GB aligned to ref. genome
Variation data 1-10 GB used in most analysis
Filtered variants of interest 50-500 MB depends on study
7/14
One study: 254 samples from 5 large families
Don't try to drink from the fire hydrant!
Use smart study design
Filter the data:
Data overload?
changes that alter proteins
changes that run in families

·
·
8/14
Pipelines
Use fast parallelised analysis pipelines where possible
Even parallelised pipeline takes several weeks to align 30 samples and call
variants
Makes it difficult to use standard HPC queuing systems
9/14
Menzies Computational Genomics
Cluster
Sunnydale
4 compute nodes
250 CPUs
2 TB RAM
214 TB working data
200 TB secure archive storage
·
·
·
·
·
10/14
Data storage requirements
Australian code for the responsible conduct of research requires us to keep
research data and primary materials
All raw sequence data and final filtered data must be kept
Can discard some intermediate files, but need a large amount of fast
working storage
Data generation is now much cheaper and faster than data analysis
Data storage, transfer and analysis now critical
11/14
Indigenous genomes
High incidence of vulvar cancer in East Arnhem indigenous population
Ten years' work securing appropriate consent
Consent strictly limited to vulvar cancer study - indigenous communities
often wary of genetic research
Risk management - public perception and trust is often biggest risk
identified - far worse than losing data
12/14
Family studies
We infer family relationships from genetic data
These sometimes differ from those reported by the families
We can also infer information about family members not involved in the
study
Full pedigrees can't always be published or shared
13/14
Genomes technically identifiable
Privacy Act 1988 - information is "personal" if identity "can reasonably be
ascertained" from it
Identifying someone from their genome sequence is feasible and getting
easier
Gymrek et al. (2013) Science 339:321
Shared/cloud resources more challenging to use in terms of data privacy
14/14

Weitere Àhnliche Inhalte

Was ist angesagt?

Neuromics base presentation 2019
Neuromics base presentation 2019Neuromics base presentation 2019
Neuromics base presentation 2019Pete Shuster
 
Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Sreekanth Gali
 
Genome data management
Genome data managementGenome data management
Genome data managementShareb Ismaeel
 
Argumentative essay power point
Argumentative essay power pointArgumentative essay power point
Argumentative essay power pointsamasewa
 
Choose a saviour for various life threatening diseases
Choose a saviour for various life threatening diseasesChoose a saviour for various life threatening diseases
Choose a saviour for various life threatening diseasesNavya_Sharma
 
Biological Database
Biological DatabaseBiological Database
Biological DatabaseSombir Kashyap
 
Biological Databases
Biological DatabasesBiological Databases
Biological DatabasesShweta Kagliwal
 
FOLDING (Central dogma of genetics)
FOLDING (Central dogma of genetics) FOLDING (Central dogma of genetics)
FOLDING (Central dogma of genetics) Maria Giraldo
 
Biological databases
Biological databasesBiological databases
Biological databasesSucheta Tripathy
 
Folding Aleja RamĂ­rez
Folding Aleja RamĂ­rezFolding Aleja RamĂ­rez
Folding Aleja RamĂ­rezMaria Alejandra
 
EACR Travel Grant Page
EACR Travel Grant PageEACR Travel Grant Page
EACR Travel Grant PageDino Masic
 
Presentation1
Presentation1Presentation1
Presentation1afkhokher
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformaticsnadeem akhter
 
Advances in below and above-ground phenotyping
Advances in below and above-ground phenotypingAdvances in below and above-ground phenotyping
Advances in below and above-ground phenotypingICRISAT
 

Was ist angesagt? (20)

Kegg databse
Kegg databseKegg databse
Kegg databse
 
Neuromics base presentation 2019
Neuromics base presentation 2019Neuromics base presentation 2019
Neuromics base presentation 2019
 
Resume_020717
Resume_020717Resume_020717
Resume_020717
 
Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02
 
Genome data management
Genome data managementGenome data management
Genome data management
 
Databases ii
Databases iiDatabases ii
Databases ii
 
Argumentative essay power point
Argumentative essay power pointArgumentative essay power point
Argumentative essay power point
 
Choose a saviour for various life threatening diseases
Choose a saviour for various life threatening diseasesChoose a saviour for various life threatening diseases
Choose a saviour for various life threatening diseases
 
UNMSymposium2014
UNMSymposium2014UNMSymposium2014
UNMSymposium2014
 
Biological Database
Biological DatabaseBiological Database
Biological Database
 
Biological Databases
Biological DatabasesBiological Databases
Biological Databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 
FOLDING (Central dogma of genetics)
FOLDING (Central dogma of genetics) FOLDING (Central dogma of genetics)
FOLDING (Central dogma of genetics)
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Rishi
RishiRishi
Rishi
 
Folding Aleja RamĂ­rez
Folding Aleja RamĂ­rezFolding Aleja RamĂ­rez
Folding Aleja RamĂ­rez
 
EACR Travel Grant Page
EACR Travel Grant PageEACR Travel Grant Page
EACR Travel Grant Page
 
Presentation1
Presentation1Presentation1
Presentation1
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
Advances in below and above-ground phenotyping
Advances in below and above-ground phenotypingAdvances in below and above-ground phenotyping
Advances in below and above-ground phenotyping
 

Ähnlich wie Big Datasets and Highly Sensitive Data

Ekrons
 
Data analytics challenges in genomics
Data analytics challenges in genomicsData analytics challenges in genomics
Data analytics challenges in genomicsmikaelhuss
 
The Human Genome Project
The Human Genome Project The Human Genome Project
The Human Genome Project Astghik Stepanyan
 
Complete assignment on human Genome Project
Complete assignment on human Genome ProjectComplete assignment on human Genome Project
Complete assignment on human Genome Projectaafaq ali
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009Sean Davis
 
Clinical Assessment In Incorporating a Personal Genome
Clinical Assessment In Incorporating a Personal GenomeClinical Assessment In Incorporating a Personal Genome
Clinical Assessment In Incorporating a Personal GenomeDiego Herrera
 
Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...
Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...
Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...Tom Connor
 
Cancer genome repository_berkeley
Cancer genome repository_berkeleyCancer genome repository_berkeley
Cancer genome repository_berkeleyShyam Sarkar
 
rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritisAnkit Bhardwaj
 
2014 whitney-public-talk
2014 whitney-public-talk2014 whitney-public-talk
2014 whitney-public-talkc.titus.brown
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
Genetic engineering and biotechnology.pptx
Genetic engineering and biotechnology.pptxGenetic engineering and biotechnology.pptx
Genetic engineering and biotechnology.pptxTanu712650
 
Next-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptxNext-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptxSwetaTripathi13
 
Targeting the $100 genome | Insight, issue 5
Targeting the $100 genome | Insight, issue 5Targeting the $100 genome | Insight, issue 5
Targeting the $100 genome | Insight, issue 5Team Consulting Ltd
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
Annotating The Biomedical Literature For The Human Variome
Annotating The Biomedical Literature For The Human VariomeAnnotating The Biomedical Literature For The Human Variome
Annotating The Biomedical Literature For The Human VariomeShannon Green
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics TechnologiesSean Davis
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 

Ähnlich wie Big Datasets and Highly Sensitive Data (20)

 
Data analytics challenges in genomics
Data analytics challenges in genomicsData analytics challenges in genomics
Data analytics challenges in genomics
 
The Human Genome Project
The Human Genome Project The Human Genome Project
The Human Genome Project
 
Complete assignment on human Genome Project
Complete assignment on human Genome ProjectComplete assignment on human Genome Project
Complete assignment on human Genome Project
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009
 
Clinical Assessment In Incorporating a Personal Genome
Clinical Assessment In Incorporating a Personal GenomeClinical Assessment In Incorporating a Personal Genome
Clinical Assessment In Incorporating a Personal Genome
 
Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...
Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...
Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...
 
Cancer genome repository_berkeley
Cancer genome repository_berkeleyCancer genome repository_berkeley
Cancer genome repository_berkeley
 
rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritis
 
Big Data Field Museum
Big Data Field MuseumBig Data Field Museum
Big Data Field Museum
 
2014 whitney-public-talk
2014 whitney-public-talk2014 whitney-public-talk
2014 whitney-public-talk
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Genetic engineering and biotechnology.pptx
Genetic engineering and biotechnology.pptxGenetic engineering and biotechnology.pptx
Genetic engineering and biotechnology.pptx
 
Next-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptxNext-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptx
 
Targeting the $100 genome | Insight, issue 5
Targeting the $100 genome | Insight, issue 5Targeting the $100 genome | Insight, issue 5
Targeting the $100 genome | Insight, issue 5
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Annotating The Biomedical Literature For The Human Variome
Annotating The Biomedical Literature For The Human VariomeAnnotating The Biomedical Literature For The Human Variome
Annotating The Biomedical Literature For The Human Variome
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics Technologies
 
2014 naples
2014 naples2014 naples
2014 naples
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 

Mehr von ARDC

Introduction to ADA
Introduction to ADAIntroduction to ADA
Introduction to ADAARDC
 
Architecture and Standards
Architecture and StandardsArchitecture and Standards
Architecture and StandardsARDC
 
Data Sharing and Release Legislation
Data Sharing and Release Legislation   Data Sharing and Release Legislation
Data Sharing and Release Legislation ARDC
 
Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)ARDC
 
Investigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspectiveInvestigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspectiveARDC
 
NCRIS and the health domain
NCRIS and the health domainNCRIS and the health domain
NCRIS and the health domainARDC
 
International perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research dataInternational perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research dataARDC
 
Clinical trials data sharing
Clinical trials data sharingClinical trials data sharing
Clinical trials data sharingARDC
 
Clinical trials and cohort studies
Clinical trials and cohort studiesClinical trials and cohort studies
Clinical trials and cohort studiesARDC
 
Introduction to vision and scope
Introduction to vision and scopeIntroduction to vision and scope
Introduction to vision and scopeARDC
 
FAIR for the future: embracing all things data
FAIR for the future: embracing all things dataFAIR for the future: embracing all things data
FAIR for the future: embracing all things dataARDC
 
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian DuncanARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian DuncanARDC
 
Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128ARDC
 
Research data management and sharing of medical data
Research data management and sharing of medical dataResearch data management and sharing of medical data
Research data management and sharing of medical dataARDC
 
Findable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) dataFindable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) dataARDC
 
Applying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and ChallengesApplying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and ChallengesARDC
 
How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018ARDC
 
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global SprintReady, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global SprintARDC
 
How FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of dataHow FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of dataARDC
 
Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018ARDC
 

Mehr von ARDC (20)

Introduction to ADA
Introduction to ADAIntroduction to ADA
Introduction to ADA
 
Architecture and Standards
Architecture and StandardsArchitecture and Standards
Architecture and Standards
 
Data Sharing and Release Legislation
Data Sharing and Release Legislation   Data Sharing and Release Legislation
Data Sharing and Release Legislation
 
Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)
 
Investigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspectiveInvestigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspective
 
NCRIS and the health domain
NCRIS and the health domainNCRIS and the health domain
NCRIS and the health domain
 
International perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research dataInternational perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research data
 
Clinical trials data sharing
Clinical trials data sharingClinical trials data sharing
Clinical trials data sharing
 
Clinical trials and cohort studies
Clinical trials and cohort studiesClinical trials and cohort studies
Clinical trials and cohort studies
 
Introduction to vision and scope
Introduction to vision and scopeIntroduction to vision and scope
Introduction to vision and scope
 
FAIR for the future: embracing all things data
FAIR for the future: embracing all things dataFAIR for the future: embracing all things data
FAIR for the future: embracing all things data
 
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian DuncanARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
 
Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128
 
Research data management and sharing of medical data
Research data management and sharing of medical dataResearch data management and sharing of medical data
Research data management and sharing of medical data
 
Findable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) dataFindable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) data
 
Applying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and ChallengesApplying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and Challenges
 
How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018
 
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global SprintReady, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
 
How FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of dataHow FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of data
 
Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018
 

KĂŒrzlich hochgeladen

JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 

KĂŒrzlich hochgeladen (20)

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 

Big Datasets and Highly Sensitive Data

  • 1. Big Datasets and Highly Sensitive Data Bennet McComish 31 July 2017
  • 2. Computational Genomics Study of the structure, function, evolution, and mapping of genomes Genes control our basic biology, how the body works, how we respond to drugs Changes in your genome make you who you are They can also cause disease (such as cancer) or mean your cancer therapy doesn’t work (or works really well) We study those changes to understand and improve your health 2/14
  • 3. What is the human genome? The genome is basically a string of letters (A T C G) 1 human genome = 3.2 billion letters or ‘bases’ spread across 23 chromosomes 3% of the genome (3 million bases) ‘coding’ for ~25,000 genes Print version of one genome at the “Wellcome Collection” 120 books, 1000 pages each at 4.5 point text 3/14
  • 4. Genome sequencing Technology now allows us to read the code of our genomes We have a human ‘reference’ genome – made of the most common (3.2 billion base) sequence We compare a person’s genome with the reference to find all the ‘different’ sites (~3 million per person or 0.1%) Then only focus on the places where there are differences 4/14
  • 6. Approaching the "$1000 genome" Exponential increase in the number of genomes being sequenced Bottleneck has moved from data generation to data analysis Cost of sequencing 6/14
  • 7. "Big data" Hiseq 200G run Image data 32 TB discarded Intensity data 2 TB usually discarded Raw sequence and quality score data 250 GB backed up Aligned sequence 100 GB aligned to ref. genome Variation data 1-10 GB used in most analysis Filtered variants of interest 50-500 MB depends on study 7/14
  • 8. One study: 254 samples from 5 large families Don't try to drink from the fire hydrant! Use smart study design Filter the data: Data overload? changes that alter proteins changes that run in families
 · · 8/14
  • 9. Pipelines Use fast parallelised analysis pipelines where possible Even parallelised pipeline takes several weeks to align 30 samples and call variants Makes it difficult to use standard HPC queuing systems 9/14
  • 10. Menzies Computational Genomics Cluster Sunnydale 4 compute nodes 250 CPUs 2 TB RAM 214 TB working data 200 TB secure archive storage · · · · · 10/14
  • 11. Data storage requirements Australian code for the responsible conduct of research requires us to keep research data and primary materials All raw sequence data and final filtered data must be kept Can discard some intermediate files, but need a large amount of fast working storage Data generation is now much cheaper and faster than data analysis Data storage, transfer and analysis now critical 11/14
  • 12. Indigenous genomes High incidence of vulvar cancer in East Arnhem indigenous population Ten years' work securing appropriate consent Consent strictly limited to vulvar cancer study - indigenous communities often wary of genetic research Risk management - public perception and trust is often biggest risk identified - far worse than losing data 12/14
  • 13. Family studies We infer family relationships from genetic data These sometimes differ from those reported by the families We can also infer information about family members not involved in the study Full pedigrees can't always be published or shared 13/14
  • 14. Genomes technically identifiable Privacy Act 1988 - information is "personal" if identity "can reasonably be ascertained" from it Identifying someone from their genome sequence is feasible and getting easier Gymrek et al. (2013) Science 339:321 Shared/cloud resources more challenging to use in terms of data privacy 14/14