SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Leveraging Big Data to Accelerate
Biomedical Research:
Overlaying Computational Knowledge, Natural Language Processing, Artificial
Intelligence, and Data Standardization onto Pubmed and Affiliated Databases to
shift from a Google Paradigm to a Wolfram Alpha Paradigm

Steven Koval, Milap Thaker, and Ping Zhu
Information Systems 350.620.72
The Johns Hopkins University
William Agresti, Professor
12 December 2013
How is Data Used in the Biomedical
Field?
Types of Data (Data Diversity)
Challenge: Data is both quantitative and visual




Cell Images

Gene Amplifications



Sequences

Adrenal_ALP_L311
D
Adrenal_ALP_L311
D
Adrenal_ALP_L311
D
Adrenal_ALP_L311
D
Adrenal_ALP_L311
D



5379

5167
5167
5167
4750

Blots/Gels

TGAGATGAAGCACTGT
AGCTCT
TGGAAGACTAGTGATTT
TGTTGT
TGGAAGACTAGTGATTT
TGTTGT
TGGAAGACTAGTGATTT
TGTTGT
TCAGTGCACTACAGAA
CTTTGT

22

23
23
23
22
Big Data vs. Biology - Challenges


Open science
Biological research and discovery calls for the accessibility to all
kinds of data sources.



There are too many types to be standardized
From small scale Western Blot gel pictures to the big data sets of
next generation sequencing



Utility of data is low
Too many data to be analyzed and utilized effectively.



Requires high analytic skills
With too many variables, gene regulation, for example, is a much
more complicated network than, for example, a Facebook
account.
Big Data and Biology: Recommendations








New search engine
Go beyond Google search, and allow scientists to ask precise
academic questions and get accurate snap shots of the specific
biological topics - e.g. What’s the statistical confidence of p53 as it
relates to breast cancer based on the scientific publications with an
IF (impact factor)of 4 and above during the past 5 years.
Tap into massive existing data first
By analyzing massive clinical data, can we answer questions, such
as the best cure plan for a woman age 60-80 breast cancer with a
high blood pressure?
A standardized operation model
Unify and standardize the data of all kinds in biological experiments,
such as images, graphs, curves, numbers, binary factors, and
descriptions in words.
New analytic tools
New ways of modeling and simulation, e.g. Use computer generated
models and simulations vs. live animal experiments to simulate a
gene network regulatory state. Map the network with all the data
sources and set a reliable research foundation.
Big Data – “Computational Knowledge
Engine” (Health/Medicine) – IPSO Model
1.
2.
3.

4.

5.

System Boundary – Data bits/Queries moving across system
boundary
Input – Keyword or Question into search field (What you want
answered) Example – What is P53 in cancer?
Process – Digitize keyword(s) or question. Algorithmic Keyword
Match and/or queries by sending bits of data to the Data Mart for
retrieval
Storage – Storage warehouse includes all indexed “Human Expert
Knowledge” material related to health/medicine. Example – Genetic
analysis and Gene study
Output – Bits of data will travel from storage warehouse to “inquirer”
with answers, and directly related material. No noise, just pertinent
results (Graphs, Definitions, etc…)
Big Data – “Computational Knowledge
Engine” (Health/Medicine)





Expert System – Software that uses a knowledge
base of human expertise for problem solving
Different then a search engine, this is a
computational knowledge engine!
Leverage Big Data from Pubmed, PLoS, CDC, NIH,
etc… expert knowledge
What Does Our Software Do?
Full Corpus of
Knowledge

• Everything that can be
known about Biology
• Most of this will not be
relevant to practical studies
• A lot of research is wasted
time or reproving the known

What We
Need to Know

• This is where our artificial
intelligence will lead the
researcher – from what they
know  what they need to
know through practical
knowledge expansion

What We
Know

• Information we already have
about our topic
• Subject knowledge on a
subset of biology
How Will It Do This?








Uses natural language processing to determine meanings of
phrases in scientific publications, and converts those to logical
statements that are then aggregated (semantic search applied
to Pubmed).
Inputs such as “p53 in cancer X” no longer leads to endless
lists of papers – instead, the output is “p53 is found to be
downregulated in cancer X.” This is a YES/NO system
allowing for subsequent conditional logic to make
determinations in future NIH funding
Professors use their PhD Postdocs as reading machines – this
will produce an actual quality-controlled data scouring system
that will probabilistically aggregate data on their subject as
bivalent statements (p53 causes cell death) as opposed to
abstract observations (p53 may have a role in cell death).
Quantitation will allow the AI to say, “there is a 95% correlation
between p53 upregulation and cell death in breast cancer”
Conclusion









Every single piece of public data found by a biological
researcher (Pubmed/ExPasy/Human Genome/NIH) is
converted to data that can go into a database cell.
Gene sequences, western blots, gel images, chemical
interactions – literally everything is reduced to data that
can be housed in a single super database.
Outputs can be simple Wolfram-Alpha style bivalent
responses.
Simultaneous simple artificial intelligence leverages the
big data and produces new areas of tactical research –
expanding from what we know to identifying what we
should know .
In essence, we are applying the conditional logic of
mathematics to biology.

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Kürzlich hochgeladen (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Empfohlen

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Empfohlen (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

Milap Thaker et. al: Leveraging big data in biological science. Presented at Johns Hopkins University 2013.

  • 1. Leveraging Big Data to Accelerate Biomedical Research: Overlaying Computational Knowledge, Natural Language Processing, Artificial Intelligence, and Data Standardization onto Pubmed and Affiliated Databases to shift from a Google Paradigm to a Wolfram Alpha Paradigm Steven Koval, Milap Thaker, and Ping Zhu Information Systems 350.620.72 The Johns Hopkins University William Agresti, Professor 12 December 2013
  • 2. How is Data Used in the Biomedical Field?
  • 3. Types of Data (Data Diversity) Challenge: Data is both quantitative and visual   Cell Images Gene Amplifications  Sequences Adrenal_ALP_L311 D Adrenal_ALP_L311 D Adrenal_ALP_L311 D Adrenal_ALP_L311 D Adrenal_ALP_L311 D  5379 5167 5167 5167 4750 Blots/Gels TGAGATGAAGCACTGT AGCTCT TGGAAGACTAGTGATTT TGTTGT TGGAAGACTAGTGATTT TGTTGT TGGAAGACTAGTGATTT TGTTGT TCAGTGCACTACAGAA CTTTGT 22 23 23 23 22
  • 4. Big Data vs. Biology - Challenges  Open science Biological research and discovery calls for the accessibility to all kinds of data sources.  There are too many types to be standardized From small scale Western Blot gel pictures to the big data sets of next generation sequencing  Utility of data is low Too many data to be analyzed and utilized effectively.  Requires high analytic skills With too many variables, gene regulation, for example, is a much more complicated network than, for example, a Facebook account.
  • 5. Big Data and Biology: Recommendations     New search engine Go beyond Google search, and allow scientists to ask precise academic questions and get accurate snap shots of the specific biological topics - e.g. What’s the statistical confidence of p53 as it relates to breast cancer based on the scientific publications with an IF (impact factor)of 4 and above during the past 5 years. Tap into massive existing data first By analyzing massive clinical data, can we answer questions, such as the best cure plan for a woman age 60-80 breast cancer with a high blood pressure? A standardized operation model Unify and standardize the data of all kinds in biological experiments, such as images, graphs, curves, numbers, binary factors, and descriptions in words. New analytic tools New ways of modeling and simulation, e.g. Use computer generated models and simulations vs. live animal experiments to simulate a gene network regulatory state. Map the network with all the data sources and set a reliable research foundation.
  • 6. Big Data – “Computational Knowledge Engine” (Health/Medicine) – IPSO Model 1. 2. 3. 4. 5. System Boundary – Data bits/Queries moving across system boundary Input – Keyword or Question into search field (What you want answered) Example – What is P53 in cancer? Process – Digitize keyword(s) or question. Algorithmic Keyword Match and/or queries by sending bits of data to the Data Mart for retrieval Storage – Storage warehouse includes all indexed “Human Expert Knowledge” material related to health/medicine. Example – Genetic analysis and Gene study Output – Bits of data will travel from storage warehouse to “inquirer” with answers, and directly related material. No noise, just pertinent results (Graphs, Definitions, etc…)
  • 7. Big Data – “Computational Knowledge Engine” (Health/Medicine)    Expert System – Software that uses a knowledge base of human expertise for problem solving Different then a search engine, this is a computational knowledge engine! Leverage Big Data from Pubmed, PLoS, CDC, NIH, etc… expert knowledge
  • 8. What Does Our Software Do? Full Corpus of Knowledge • Everything that can be known about Biology • Most of this will not be relevant to practical studies • A lot of research is wasted time or reproving the known What We Need to Know • This is where our artificial intelligence will lead the researcher – from what they know  what they need to know through practical knowledge expansion What We Know • Information we already have about our topic • Subject knowledge on a subset of biology
  • 9. How Will It Do This?     Uses natural language processing to determine meanings of phrases in scientific publications, and converts those to logical statements that are then aggregated (semantic search applied to Pubmed). Inputs such as “p53 in cancer X” no longer leads to endless lists of papers – instead, the output is “p53 is found to be downregulated in cancer X.” This is a YES/NO system allowing for subsequent conditional logic to make determinations in future NIH funding Professors use their PhD Postdocs as reading machines – this will produce an actual quality-controlled data scouring system that will probabilistically aggregate data on their subject as bivalent statements (p53 causes cell death) as opposed to abstract observations (p53 may have a role in cell death). Quantitation will allow the AI to say, “there is a 95% correlation between p53 upregulation and cell death in breast cancer”
  • 10. Conclusion      Every single piece of public data found by a biological researcher (Pubmed/ExPasy/Human Genome/NIH) is converted to data that can go into a database cell. Gene sequences, western blots, gel images, chemical interactions – literally everything is reduced to data that can be housed in a single super database. Outputs can be simple Wolfram-Alpha style bivalent responses. Simultaneous simple artificial intelligence leverages the big data and produces new areas of tactical research – expanding from what we know to identifying what we should know . In essence, we are applying the conditional logic of mathematics to biology.

Hinweis der Redaktion

  1. Today, life science is a science more and more depending on data. Here is a flow chart from BioVance a full service CRO (contract research organization) providing research services for pharmaceutical companies. Fundamentally, the drug discovery process follows a pattern of making hypothesis, testing; expanding sample numbers, testing; expanding sample numbers, testing, and expanding sample numbers, which goes on and on to complete all the phases of trials. From preclinical to Phase 1, 2, and 3, each time, the sample number enlarged and the data quantity expanded.
  2. As a matter of fact, far before preclinical trial on animal, there were many trials and tests had been performed on the molecular level on cells and tissues in order to gain the knowledge of biomarkers. These researches are performed in university labs, and in the institutions such as NIH and NCI. All these discoveries will be published in papers such as Nature and Science, and be presented in international academic conferences such as AACR or Neuroscience. Some of these discoveries will be collected into public databases, such as mirBase for microRNAs by Sanger Institute in UK.In this whole process, more and more data is generating every day. Given the presence of the new technology, such as next generation sequencing, the amount of data that an every day biologist to handle is growing in an unexpected speed.
  3. Open scienceFor biologists the pure size of the amount of data, and the variety of sources cause a challenge for them to access the full benefit. Here are just examples for some data sources, NCBI Refseq, UCSC Known Gene 6.0, Gencode v13; plushundreds and thousands of literatures from SCI to school publications, from keystone speeches to conference posters. All these are constantly updated and evolving. Currently, the best way to reach, compare, and summarize all these resources are still heavily depending on diligent human work. Due to the hardness and extend of the work, many open sources are not really being reached and utilized, but left in unattended corners as if do not ever exist. Too many type to be standardizedBiological data is much difficult to organize with its different types and forms, from a Western Blotting gel picture to hundreds millions of reads of the next generation sequencing. Adding in clinical symptoms, chemical or other exposures, and demographics, it is a very complicated analysis problem. Utility of data is low.Non only due to the complexity of the data, the utility of the data in biology is low is also due to its pure amount. Simply to put it, to most biologists, it is just too costly to compute the data available or generated. The tools for this level of computing is still very primitive, and uncoordinated. Require high analytic skillsUnlike a social network, the analyst understands exactly what data they are collecting mean. For example, each node in the network represents a Facebook account. However, for biologists, when they look at biological data, they do not know exactly what they are looking at. The data they used to construct the networks is noisy and imprecise, and they do not have a good understanding of how many different variables interact in these gene network yet. Although a gene regulatory net work is smaller than a social network, it is harder to define which gene controls the expression of the other genes.
  4. New search engineGoogle search will only give you a pile of answers with the key words that you type. To be beneficial from all kinds of data bases available in the field. We need a computation capacity which will allow scientists to ask precise academic questions and get accurate snap shots of the specific biological topics. e.g. What’s the statistic confidence of P53 related to breast cancer based on the SCI publications with IF 4 and above during the past 5 years. Instead of offering a stack of papers or records with the keywords “P53”, the system will be able to tell you that the statistic confidence of a positive relation between P53 and breast cancer = 71%, based on 205 publications from SCI with an IF >4 during the past 5 years.Tap into massive existing data first Another untapped resource is the massive clinical data. Then, we need a system to analyze and answer questions, such as what‘s the best cure plan for woman breast cancer, age 60-80 with a high blood pressure. Traditionally, doctors have to memorize all the patients history so as to accumulate experiences and sharpen their skills. This system will obviously offer physicians and researchers a new respective to work with in an instant way. A standardized operation modelUnify and standardize the data of all kinds in biological experiments, such as images, graphs, curves, numbers, binary factors, and descriptions in words is the first step to use the massive data in the field. The traditional descriptive nature of the biomedical field can be easily understood just by our experience with ultra sound. The doctors will write a paragraph under the picture of the ultra sound descriptively, and the numbers are derived by observations. Even though the genomic data is pure numerical, with ATGC counted in sequences, the interpretation of gene expression level in the functionalities of the gene network becomes descriptive again. We can not simply say gene A regulate gene B is definite. Sometimes, gene A regulate gene B indirectly through other mechanisms, such as certain epigenetic factors, or under certain disease state. New analytic toolsNew ways of modeling and simulation has been popular in other fields. Computing can simulate many real world occasions from car crash to robotic assembly lines. Use computer generated models and simulations vs. live animal experiments to simulate a gene network regulatory state is possible. Map the network with all the data sources available will set a reliable research foundation for biomedical researchers and save them tremendous time in fishing in the oceans of information and imagine their relationships.
  5. I wanted to utilize the IPSO diagram/model to portray how our computational knowledge engine would function: System Boundary – Data bits/Queries moving across system boundaryInput – Keyword or Question into search field (What you want answered) Example – What is P53 in cancer?Process – Digitize keyword(s) or question. Algorithmic Keyword Match and/or queries by sending bits of data to the Data Mart for retrievalStorage – Storage warehouse includes all indexed “Human Expert Knowledge” material related to health/medicine. Example – Genetic analysis and Gene studyEach Indexed Item will be attached to graphs, images, definitions and directly relevant material within the storage warehouse. Our intent is to deliver comprehensive “answers” to our inquirers.Output – Bits of data will travel from storage warehouse to “inquirer” with answers, and directly related material. No noise, just pertinent results (Graphs, Definitions, etc…!)
  6. In class we briefly went over the topic of “expert system”. I believe that our desired product directly relates. We want to use approved human expertise within the Health/Medicine field. Accredited material found on health care sites such as PubMed, CDC, NIH and WebMD will comprise our human expertise storage warehouse. A current “computational knowledge engine” called “Wolfram Alpha” is an excellent example of what we are looking to create. I have included a video in our PowerPoint presentation that explains what a “computational knowledge engine” does and how it differentiates from a standard search engine. Wolfram Alpha examines many fields and does not have a very extensive health/medicine database. Our product will solely focus on health/medicine materials.
  7. The purpose of education is to test the boundaries of what we know, and then to expand those boundaries to what we do not know. The problem we run into is learning that goes into areas that are irrelevant (science) as opposed to applied science (technology). The vision of our software is to convert qualitative and other data into quantitative data that can be housed in a table. Thus, the data will be manipulate-able by other higher order softwares to produce trends.The ideal software will take us from what we know to what we need to know in a way that is logical and does not waste limited scientific resources or time on endless quests or irrelevant research. The software will tell scientists what next to study to drive scientific research forward.
  8. The purpose of our project is to create a tool that rapidly improves the production of scientists by arming them with better data.  Our tool is a meta level scientific publication crawler that uses natural language processing to determine meanings of phrases in scientific publications, and converts those to logical statements that are then aggregated.  Once aggregated, the tool will apply basic logic functions and statistics to determine probable scientific truths.  These truths will then be the output of our web tool.This program is intended to apply computational knowledge and organization to scientific publications, much in the way that Wolfram Alpha provides computational knowledge to mathematics.  When provided with the query, "square root of 2," Wolfram Alpha does not simply respond with all kinds of data, including irrelevant data such as, for example, papers on "a computational method for the fact checking of the square root of two," or with images of two four-equal-sided polygons - rather, Wolfram Alpha gives you a very specific output because it views the words "square root" and "two" as specific sub parameters of a function which it is to operate.
  9. Our software, when given an input such as "role of protein p53 in cancer amongst women over age 50" will not simply return endless lists of articles on various p53 studies - rather, it will give a very specific output, such as "high correlation (ideally quantified as a percentage) likelihood of p53 impact in cancer for this age group."  It will do this by crawling the endless lists mentioned earlier, and use basic natural language processing to convert papers such as "p53 in cancer - qPCR data reveals gene upregulation in metasticizing cells," from University X and "p53 increases in densitometric analyses of western blots of metasticizing cancers," from University Y.