SlideShare ist ein Scribd-Unternehmen logo
1 von 37
The next terminal – Jupyter
With examples from Bioinformatics
@lynnlangit
“
”
How often do you use
the terminal?
@lynnlangit
Terminal Customizations
Prompt Output Aesthetics Code Comments Graphics
@lynnlangit
Terminalimproved
Terminalimproved
What does this Code do?
@lynnlangit
“
”
But it’s not good enough
Why not?
@lynnlangit
Machine Learning
Too much data to process? Or too much code? Can you ‘see’ what is happening?
@lynnlangit
What does this Code do?
Which algorithm?
@lynnlangit
Visualizing Data Processing ML Code
Which algorithm?
@lynnlangit
Now – more data, much more…
IoT increases data volume and complexity exponentially
@lynnlangit
“
”
Inspired by
Mathematica
Thanks Steven Wolfram
If you can SEE it (your data and code), you can work with it better
@lynnlangit
Next terminal -> a better Python REPL
• Fernando Perez in 2001
• IPython (interactive)
• Modeled - Mathematica Notebooks
• IP(y): Notebook -> in a browser
• 2012 IPython -> Jupyter Notebook
@lynnlangit
Enter Jupyter Notebooks
@lynnlangit
Jupyter Notebooks supports ML Lifecycle
1. Collect
Data
Retrieve Files
Query SQL Databases
Call Web Services
“Scrape” Web Pages
2.
Prepare
Data
Explore Data
Validate Data
Clean Data
Features / Data
4.
Evaluate
Model
Test Performance
Compare Models
Validate Model
Visualize
5. Deploy
Model
Export Model File
Prepare Job
Deploy Container
Re-package Model
Execute code blocks:
- Python, R… code
- SQL queries
- Shell commands
3. Train
Model
Prepare Training Set
Experiment
Test Model
Visualize
Write Documentation:
- Markdown language
Visualize Data
- Viz tools…
Jupyter Visualizations –
so many possibilities
Notebook Customizations
Multiple
Runtimes
Languages
Share output
Code or
Equations
LaTex
Math
Comments
Markdown
Wiki-like
Graphics
Visualizations
Charting
Results
LIVE
DOCUMENTATION
Reproducible
Research
@lynnlangit
Example
Jupyter locally
@lynnlangit
Mathematica evolved…
Jupyter Notebook
Market leader
Started for single use
Academic community
GitHub integration
Added Jupyter Hub for
collaboration
Zeppelin Notebook
Start for collaboration
Enterprise
Security
Vendor Notebook
Databricks for Apache Spark
Jupyter-like, but proprietary
format
@lynnlangit
Running Notebooks
Desktop
Install and run
Local Server
Can use Jupyter Hub for groups
Cloud
Large number of options
@lynnlangit
Extending, Refactoring Open Notebooks
• Write functions in one notebook
• Link to another notebook
• Write extensions (nbextensions.com)
Up the bar
Personalized medicine via genomic analysis
@lynnlangit
Reproducible Research – Experiments as Code
@lynnlangit
Bioinformatics | Denis C. Bauer | @allPowerde|
GT-Scan2
How can genome engineering
be made more effective?
Variant Spark
How to find disease genes in
population-size cohorts?
Genomic
Research
Tools
Two
Examples
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Machine learning…
on 1.7 Trillion data points
https://www.projectmine.com/about/
Bioinformatics | Denis C. Bauer | @allPowerde|
VariantSpark - Parallelize Random Forest for scalability
• Spark ML’s RF was designed for ‘Big’ low dimensional data.
• The full genome-wide profile does NOT fit into the executors memory
“Cursed” BigData: e.g. Genomics
Moderate number of samples with many features
Feature set too large to be handled by single executer
Bioinformatics | Denis C. Bauer | @allPowerde|
Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK
Flip the matrix: partition by column
VariantSpark - Parallelize RF to scale with features
Bioinformatics | Denis C. Bauer | @allPowerde|
Wide RF scalable with features and samples
# set up context and input parameters
spark = SparkSession(sc)
vc = VariantsContext(spark)
label = vc.load_label('dius/data/chr22-labels.csv', 'col_name')
features = vc.import_vcf('dius/data/chr22_1000.vcf')
# instantiate analysis (parameters are type-checked)
imp_analysis = features.importance_analysis(label)
# get significant factors as both a tuple list and a dataframe
imp_vars = imp_analysis.important_variables(20)
most_imp_var = imp_vars[0][0]
imp_df = imp_analysis.variable_importance()
oob_error = imp_analysis.oob_error()
# convert to work with common Python tools
pandas_imp_df = imp_df.toPandas()
New -- Python API for VariantSpark
Demo VariantSpark
Jupyter for Genomics Research
@lynnlangit
Cloud-based Jupyter
PaaS
• AWS SageMaker
• Azure Notebooks
• Others…
@lynnlangit
Example - GT-Scan2
Jupyter for Genomics Research
@lynnlangit
Tools for Jupyter
• Binder for GitHub
• Point to your GitHub Repo
• Jupyter Notebooks
• Requirements.txt
• It builds a Docker image
• You can run your Notebooks
@lynnlangit
Example
Binder
@lynnlangit
Future of Jupyter for Research
Academic
Institutions
and
Research
Labs
UC Berkeley, Davis, San Diego
Cal Poly San Luis Obispo
Clemson University
UC Boulder
U of Illinois, Minnesota, Missouri, Rochester, Texas
MIT
Michigan State U
Texas A & M
@lynnlangit

Weitere ähnliche Inhalte

Was ist angesagt?

Charles_Qian_Resume
Charles_Qian_ResumeCharles_Qian_Resume
Charles_Qian_Resume
Charles Qian
 
IT Services - TCO Study by Frost & Sullivan
IT Services - TCO Study by Frost & SullivanIT Services - TCO Study by Frost & Sullivan
IT Services - TCO Study by Frost & Sullivan
CTRLS
 

Was ist angesagt? (20)

h2oensemble with Erin Ledell at useR! Aalborg
h2oensemble with Erin Ledell at useR! Aalborgh2oensemble with Erin Ledell at useR! Aalborg
h2oensemble with Erin Ledell at useR! Aalborg
 
Accelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the CloudAccelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the Cloud
 
UberCloud Webinar Abaqus and cloud computing
UberCloud Webinar Abaqus and cloud computingUberCloud Webinar Abaqus and cloud computing
UberCloud Webinar Abaqus and cloud computing
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
H2O Overview with Amy Wang at useR! Aalborg
H2O Overview with Amy Wang at useR! AalborgH2O Overview with Amy Wang at useR! Aalborg
H2O Overview with Amy Wang at useR! Aalborg
 
Dev Games!
Dev Games!Dev Games!
Dev Games!
 
Charles_Qian_Resume
Charles_Qian_ResumeCharles_Qian_Resume
Charles_Qian_Resume
 
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
 
Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup
Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd MeetupOptimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup
Optimizing Bursty Hadoop on AWS - Big Data Cloud - June 3rd Meetup
 
(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014
(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014
(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
 
AWS Dublin Briefing - Cool AWS Use Cases
AWS Dublin Briefing - Cool AWS Use CasesAWS Dublin Briefing - Cool AWS Use Cases
AWS Dublin Briefing - Cool AWS Use Cases
 
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
 
IT Services - TCO Study by Frost & Sullivan
IT Services - TCO Study by Frost & SullivanIT Services - TCO Study by Frost & Sullivan
IT Services - TCO Study by Frost & Sullivan
 
Recommender Systems at Scale
Recommender Systems at ScaleRecommender Systems at Scale
Recommender Systems at Scale
 
Tale of Two Workloads And One Cloud
Tale of Two Workloads And One CloudTale of Two Workloads And One Cloud
Tale of Two Workloads And One Cloud
 
Developing and deploying big data machine learning models
Developing and deploying big data machine learning modelsDeveloping and deploying big data machine learning models
Developing and deploying big data machine learning models
 
Q&a
Q&aQ&a
Q&a
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 

Ähnlich wie Understanding Jupyter notebooks using bioinformatics examples

Big data analytics for transport
Big data analytics for transportBig data analytics for transport
Big data analytics for transport
UKinItaly
 

Ähnlich wie Understanding Jupyter notebooks using bioinformatics examples (20)

Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
GraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos GuestrinGraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos Guestrin
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Novi sad ai event 1-2018
Novi sad ai event 1-2018Novi sad ai event 1-2018
Novi sad ai event 1-2018
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Apache Spark and future of advanced analytics
Apache Spark and future of advanced analyticsApache Spark and future of advanced analytics
Apache Spark and future of advanced analytics
 
End-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooEnd-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics Zoo
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Big data analytics for transport
Big data analytics for transportBig data analytics for transport
Big data analytics for transport
 
2951085 dzone-2016guidetobigdata
2951085 dzone-2016guidetobigdata2951085 dzone-2016guidetobigdata
2951085 dzone-2016guidetobigdata
 
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
 

Mehr von Lynn Langit

Mehr von Lynn Langit (20)

Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless Architectures
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina Language
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa Skills
 
Practical cloud
Practical cloudPractical cloud
Practical cloud
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids Programming
 
Practical Cloud
Practical CloudPractical Cloud
Practical Cloud
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for Bioinformatics
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud Platform
 
SQL Server on Google Cloud Platform
SQL Server on Google Cloud PlatformSQL Server on Google Cloud Platform
SQL Server on Google Cloud Platform
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL Server
 
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinBuilding a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
 
What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'
 
Teaching Kids Programming for Developers
Teaching Kids Programming for DevelopersTeaching Kids Programming for Developers
Teaching Kids Programming for Developers
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Understanding Jupyter notebooks using bioinformatics examples

  • 1. The next terminal – Jupyter With examples from Bioinformatics @lynnlangit
  • 2. “ ” How often do you use the terminal? @lynnlangit
  • 3. Terminal Customizations Prompt Output Aesthetics Code Comments Graphics @lynnlangit
  • 6. What does this Code do? @lynnlangit
  • 7. “ ” But it’s not good enough Why not? @lynnlangit
  • 8. Machine Learning Too much data to process? Or too much code? Can you ‘see’ what is happening? @lynnlangit
  • 9. What does this Code do? Which algorithm? @lynnlangit
  • 10. Visualizing Data Processing ML Code Which algorithm? @lynnlangit
  • 11. Now – more data, much more… IoT increases data volume and complexity exponentially @lynnlangit
  • 12. “ ” Inspired by Mathematica Thanks Steven Wolfram If you can SEE it (your data and code), you can work with it better @lynnlangit
  • 13. Next terminal -> a better Python REPL • Fernando Perez in 2001 • IPython (interactive) • Modeled - Mathematica Notebooks • IP(y): Notebook -> in a browser • 2012 IPython -> Jupyter Notebook @lynnlangit
  • 15. Jupyter Notebooks supports ML Lifecycle 1. Collect Data Retrieve Files Query SQL Databases Call Web Services “Scrape” Web Pages 2. Prepare Data Explore Data Validate Data Clean Data Features / Data 4. Evaluate Model Test Performance Compare Models Validate Model Visualize 5. Deploy Model Export Model File Prepare Job Deploy Container Re-package Model Execute code blocks: - Python, R… code - SQL queries - Shell commands 3. Train Model Prepare Training Set Experiment Test Model Visualize Write Documentation: - Markdown language Visualize Data - Viz tools…
  • 16. Jupyter Visualizations – so many possibilities
  • 17. Notebook Customizations Multiple Runtimes Languages Share output Code or Equations LaTex Math Comments Markdown Wiki-like Graphics Visualizations Charting Results LIVE DOCUMENTATION Reproducible Research @lynnlangit
  • 19. Mathematica evolved… Jupyter Notebook Market leader Started for single use Academic community GitHub integration Added Jupyter Hub for collaboration Zeppelin Notebook Start for collaboration Enterprise Security Vendor Notebook Databricks for Apache Spark Jupyter-like, but proprietary format @lynnlangit
  • 20. Running Notebooks Desktop Install and run Local Server Can use Jupyter Hub for groups Cloud Large number of options @lynnlangit
  • 21. Extending, Refactoring Open Notebooks • Write functions in one notebook • Link to another notebook • Write extensions (nbextensions.com)
  • 22. Up the bar Personalized medicine via genomic analysis @lynnlangit
  • 23. Reproducible Research – Experiments as Code @lynnlangit
  • 24. Bioinformatics | Denis C. Bauer | @allPowerde| GT-Scan2 How can genome engineering be made more effective? Variant Spark How to find disease genes in population-size cohorts? Genomic Research Tools Two Examples
  • 25. Transformational Bioinformatics | Denis C. Bauer | @allPowerde Machine learning… on 1.7 Trillion data points https://www.projectmine.com/about/
  • 26. Bioinformatics | Denis C. Bauer | @allPowerde| VariantSpark - Parallelize Random Forest for scalability • Spark ML’s RF was designed for ‘Big’ low dimensional data. • The full genome-wide profile does NOT fit into the executors memory “Cursed” BigData: e.g. Genomics Moderate number of samples with many features Feature set too large to be handled by single executer
  • 27. Bioinformatics | Denis C. Bauer | @allPowerde| Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK Flip the matrix: partition by column VariantSpark - Parallelize RF to scale with features
  • 28. Bioinformatics | Denis C. Bauer | @allPowerde| Wide RF scalable with features and samples
  • 29. # set up context and input parameters spark = SparkSession(sc) vc = VariantsContext(spark) label = vc.load_label('dius/data/chr22-labels.csv', 'col_name') features = vc.import_vcf('dius/data/chr22_1000.vcf') # instantiate analysis (parameters are type-checked) imp_analysis = features.importance_analysis(label) # get significant factors as both a tuple list and a dataframe imp_vars = imp_analysis.important_variables(20) most_imp_var = imp_vars[0][0] imp_df = imp_analysis.variable_importance() oob_error = imp_analysis.oob_error() # convert to work with common Python tools pandas_imp_df = imp_df.toPandas() New -- Python API for VariantSpark
  • 30. Demo VariantSpark Jupyter for Genomics Research @lynnlangit
  • 31.
  • 32. Cloud-based Jupyter PaaS • AWS SageMaker • Azure Notebooks • Others… @lynnlangit
  • 33. Example - GT-Scan2 Jupyter for Genomics Research @lynnlangit
  • 34.
  • 35. Tools for Jupyter • Binder for GitHub • Point to your GitHub Repo • Jupyter Notebooks • Requirements.txt • It builds a Docker image • You can run your Notebooks @lynnlangit
  • 37. Future of Jupyter for Research Academic Institutions and Research Labs UC Berkeley, Davis, San Diego Cal Poly San Luis Obispo Clemson University UC Boulder U of Illinois, Minnesota, Missouri, Rochester, Texas MIT Michigan State U Texas A & M @lynnlangit

Hinweis der Redaktion

  1. http://www.omgubuntu.co.uk/2017/06/terminus-modern-highly-configurable-terminal-app-windows-mac-linux
  2. telnet towel.blinkenlights.nl
  3. Left-skewed, negative distribution
  4. History talk from Cristian Prieto (NDC Oslo 2016) -- https://vimeo.com/223984769 http://blog.fperez.org/2012/01/ipython-notebook-historical.html
  5. Local install pip install –iPython all -OR- can use anaconda, which installs Jupyter notebooks by default pip install jupyter[all] and you can pip install R You can use Docker – 2.1 GB image contains all libraries or you can use Azure Notebooks or AWS SageMaker Notebooks Only Python2 is installed by default, you can install other runtimes Start and run in local browser (no database, uses local .json files) IPython notebook -> localhost:8888/tree Use GitHub-flavor Markdown (by default) https://dwhsys.com/2017/03/25/apache-zeppelin-vs-jupyter-notebook/
  6. https://github.com/ipython-contrib/jupyter_contrib_nbextensions pip install jupyter_contrib_nbextensions –OR- conda install -c conda-forge jupyter_contrib_nbextensions
  7. https://github.com/Microsoft/Elevation/blob/master/notebooks/aggregation.ipynb https://www.microsoft.com/en-us/research/project/crispr/
  8. Using this instead?
  9. Less conclusion, more implementation
  10. https://www.gt-scan.net/ --AND- AMA with Dr, Bauer -- https://www.reddit.com/r/science/comments/5fiicm/science_ama_series_im_denis_bauer_a_team_leader/
  11. https://medium.com/@lynnlangit/aws-sagemaker-for-bioinformatics-b8e8a96479d8 Jupyter on GCE VM -- https://towardsdatascience.com/running-jupyter-notebook-in-google-cloud-platform-in-15-min-61e16da34d52
  12. https://mybinder.org/ -ALSO- https://nbviewer.jupyter.org/ - allows you to run notebooks stored in GitHub
  13. http://jupyterhub-tutorial.readthedocs.io/en/latest/ https://github.com/jupyterhub/jupyterhub-tutorial/blob/master/JupyterHub.pdf http://jupyterhub.readthedocs.io/en/latest/gallery-jhub-deployments.html