SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Design Patterns for Big Data
Architecture: Best Strategies for
Streamlined [Simple, Powerful]
Design

Allen Day, PhD
Data Scientist, MapR Technologies
October 2013
©MapR Technologies - Confidential
Me, Us
• Allen Day, Principal Data Scientist, MapR
R contributor (10 yr), Hadoop (6 yr)
Human Genetics (UCLA Medicine), Machine Learning

• MapR
Distributes open source components for Hadoop
Adds major enhancements for performance, high-availability, and
ease-of-use

• See Also
– “allenday” most places (twitter, github, etc.)
– aday@maprtech.com, allenday@allenday.com
– @mapR
©MapR Technologies - Confidential
Three Business Use Cases
Personalized
Search

©MapR Technologies - Confidential

Personalized
Medicine

Market
Segmentation
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom ranking
of results

• Patient medical
history
• Genomic info.
• Match against
database of
therapies

©MapR Technologies - Confidential

Market
Segmentation
• Group similar
customers
• Target with
cross-sell / upsell campaign
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom ranking
of results

• Patient medical
history
• Genomic info.
• Match against
database of
therapies

Personal data

Personal data

©MapR Technologies - Confidential

Market
Segmentation
• Group similar
customers
• Target with
cross-sell / upsell campaign

Marketing

Which ones are similar?
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom ranking
of results

• Patient medical
history
• Genomic info.
• Match against
database of
therapies

Personal data

Personal data

©MapR Technologies - Confidential

Market
Segmentation
• Group similar
customers
• Target with
cross-sell / upsell campaign

Marketing

Which ones are similar?
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom ranking
of results

• Patient medical
history
• Genomic info.
• Match against
database of
therapies

Personal data

Personal data

©MapR Technologies - Confidential

Market
Segmentation
• Group similar
customers
• Target with
cross-sell / upsell campaign

How can you tell?

Marketing
But First…

WHAT IS A DESIGN PATTERN?

©MapR Technologies - Confidential
“a design pattern is a general reusable
solution to a commonly occurring
problem within a given context in software
design. A design pattern is not a finished
design that can be transformed directly
into source or machine code. It is a
description or template for how to solve a
problem that can be used in many
different situations”
http://en.wikipedia.org/wiki/Software_design_pattern

©MapR Technologies - Confidential
Design Pattern Idea
• a general reusable solution to a commonly
occurring problem
• not a finished design
• not code
• can be used in many different situations

©MapR Technologies - Confidential
History of SW Design Patterns

1977
Architecture &
Civil Engineering

©MapR Technologies - Confidential

1994
OO Software
Architecture

2012
Parallelization
Software

?
Application
Parallelization
Not Just Software Designs

http://en.wikipedia.org/wiki/A-line
©MapR Technologies - Confidential
Choose a Pattern: Volume & Velocity
1. How big is your target data?
<10 GB

mid
?

?

A

Single element
at a time

>200 GB

2. How big is your query data?
One pass
over 100%

B

C

Big storage

Streaming

Multiple passes
over big chunks

3. How fast do you need a result?
Throughput >
response
D

©MapR Technologies - Confidential

Nearline
Analytics

< 100s
(human scale)
E
Exploratory
Analysis
Twitter Zeitgeist as a
Composite of Design Patterns
Live data source
e.g.
Twitter Firehose

B

C

Big storage

Streaming

D
©MapR Technologies - Confidential

Nearline
Analytics

Downstream applications
Big Data Application Shapes
1. How big is your input record?
2. How big is the data that is relevant to processing the
input record?
3. How big is the total data that could be relevant to
processing the input?
4. How fast do inputs flow in?
5. How fast do outputs need to flow out?
6. How complex (unstructured) are 1-5?
7. How predictable are 1-6? (spikiness, variance)
8. Is accuracy more important than speed?
9. Does the processing contain cycles (feedback loops)?

Volume

Velocity
Variety
Intents & Methods

©MapR Technologies - Confidential
Percolation in Classic Form
Real-time data
source
Real-time
insertion

Data
store

Offline
percolation
of recent data

Large-scale Incremental Processing Using Distributed Transactions and Notifications
http://research.google.com/pubs/pub36726.html
©MapR Technologies - Confidential
Percolation in Classic Form
Real-time data
source
Real-time
insertion

Data
store

Offline
percolation
of recent data

Queued data are unavailable for
action – not percolation
Queue
©MapR Technologies - Confidential

Real-time
insertion

Delayed
insertion

Data
store
Percolation in Classic Form
Real-time data
source
Real-time
insertion

©MapR Technologies - Confidential

Data
store

Offline
percolation
of recent data
Percolation of a Composite Store
Real-time data
source
Real-time
insertion

Data
store

Offline
percolation
Index

Both parts visible

©MapR Technologies - Confidential
Market Segmentation
• Divide customers into subsets with common
needs
• Design specific strategies for each subset
• Major emphasis on “fresh” data

©MapR Technologies - Confidential
Market Segmentation
Feature
Extraction
Real-time
transactions
Customer
history

What does
this have to
do with
percolation
©MapR Technologies - Confidential

Assign
Segment
(search)
db
Market
Segments

query
Clustering
Percolator 1
Feature
Extraction
Real-time
transactions
Customer
history

©MapR Technologies - Confidential

Feature extraction is
percolation because it is
triggered by the arrival of a
new record and because it
updates that new record.
Percolator 2
Real-time
transactions
Customer
history

Market segment assignment
is percolation because it is
triggered by the arrival of a
new record and because
only that record's segment is
updated.

©MapR Technologies - Confidential

Assign
Segment
(search)
db
Market
Segments

query

What about
the
clustering
Scheduled Update - Not Percolation

Customer
history

Clustering
The clustering loop is not
percolation since it runs at
fixed intervals instead of
incrementally as updates are
received. It also doesn't
update just a single
customer record.

©MapR Technologies - Confidential

Market
Segments
Personalized Search
• Observe web users’ activity over an extended
period
• Understand individual user interests
• Customize search results for each user
• …as fast as possible

©MapR Technologies - Confidential
Personal Search History and Web Index
Search
Persona
Activity

db
query

Persona update
Histories
trigger

query

Search
Web
Crawl

feature
extraction

Doc
Store
©MapR Technologies - Confidential

db

update

trigger

Doc
Index

Persona
Index
Percolator 1

Expensive feature
extraction does not
block document ingest

Web
Crawl

feature
extraction

Doc
Store
©MapR Technologies - Confidential
Percolators 2 and 3
Persona
Activity
Persona update
Histories

Web
Crawl
Doc
Store
©MapR Technologies - Confidential

update

Doc
Index

Persona
Index
Percolator 4
Updates to personas
trigger updates in
related personas

Search
Persona
Activity

db
query

Persona update
Histories

©MapR Technologies - Confidential

Persona
Index
Percolator 5?

Persona
Index

Persona
Histories
trigger

query

Search
db

trigger

Doc
Index
©MapR Technologies - Confidential

Persona and doc
index updates trigger a
personalization refresh
Pattern Context
Persona
Activity

Web
Crawl

©MapR Technologies - Confidential

Encapsulated
Process
Cyclic Dependency Graph

©MapR Technologies - Confidential
Percolator Thoughts
• M7 tables are great as the first persistence point
in percolation
• In-memory flag column family works great for
triggering updates
– Efficient - eliminates need for queuing
– Fast triggering with row & column Bloom filters

• Percolation is best supported by dedicated
column families
– Percolators I/O characteristics differ
– M7 works especially well because it supports lots of
column families

©MapR Technologies - Confidential
Cyclic Dependency Graph, M7 Schema

©MapR Technologies - Confidential
Personalized Medicine
5. Interpretation
& Follow-up

4. Reporting

1. Select Tests

2. Draw Biosample

3. Genome Sequencing
& Analysis
©MapR Technologies - Confidential
Personalized Medicine Applications
• Pre-conception screening
• Clinical research & trials
– Drug re-targeting

• Therapeutics
– Companion diagnostics
– Therapy selection
©MapR Technologies - Confidential
Personalized Medicine
Patient
history
(EHR)

EHR
archive

Insert
(eventually)

db
Sequence
extraction
Genome
Sample

Patient
health
context

query

Search

Ranked
therapies

Here we do not see real-time data
pushed to a persistence layer and
processed offline. This pattern does

©MapR Technologies - Confidential
Personalized Medicine
Patient
history
(EHR)

EHR
archive

Insert
(eventually)

db
Sequence
extraction
Genome
Sample

©MapR Technologies - Confidential

Patient
health
context

query

Search

User-based recommendation pattern

Ranked
therapies
Recommendation in Classic Form

Queue

History
Archive

db
Recent
history

©MapR Technologies - Confidential

query

User
Search

Ranked
similar
histories
Item-Based Recommendation
in Classic Form
Queue

History
archive

Cooccurrence
analysis

Off-line analysis

Recent
history
query

Item
linkage
db

Search

©MapR Technologies - Confidential

Interactive recommendation

Ranked
items
Recommendation Thoughts
• Item-based recommendation is for efficiency
– expensive step in computing co-occurrence can be
done offline and cached prior to a user query

• User-based recommendation is for accuracy
– user comparisons are done online to find the current
best recommendation

• MapR is great for recommendation
– M7 tables are high I/O performance, can eliminate
queues
– Faster archive updates with optimized MapReduce
– High-availability for mission LIFE critical applications

©MapR Technologies - Confidential
Business Use Cases
& Design Patterns
Recommender –
Personalized
Medicine

Pattern X –
Health data

Percolator –
Personalized
Search

Percolator –
Other Industry

Percolator –
Personalized
Medicine

Pattern X –
Other Industry

©MapR Technologies - Confidential
Summary: Best Practices
• Look at the big picture
– Find recurring patterns

• Design systems at a high-level
– Solve problems once and reuse components
– Increase R&D productivity
– Decrease operational and maintenance overhead

©MapR Technologies - Confidential
Thank
You!

Allen Day, PhD
Principal Data Scientist, MapR Technologies
aday@maprtech.com, allenday@allenday.com
@allenday, @mapr
©MapR Technologies - Confidential

Weitere ähnliche Inhalte

Was ist angesagt?

20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
 
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelMoving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelDataWorks Summit
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalHortonworks
 
7 Ingredients to Create Real Value From Hadoop
7 Ingredients to Create Real Value From Hadoop7 Ingredients to Create Real Value From Hadoop
7 Ingredients to Create Real Value From HadoopActian Corporation
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
 
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryDataWorks Summit
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at WalgreensDataWorks Summit
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark DataWorks Summit/Hadoop Summit
 
Making Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeMaking Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeDataWorks Summit
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonMapR Technologies
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopBigData Research
 
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...VMware Tanzu
 
The Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in ChurnThe Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in ChurnCloudera, Inc.
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...Revolution Analytics
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...Hortonworks
 

Was ist angesagt? (20)

20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelMoving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
7 Ingredients to Create Real Value From Hadoop
7 Ingredients to Create Real Value From Hadoop7 Ingredients to Create Real Value From Hadoop
7 Ingredients to Create Real Value From Hadoop
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy Industry
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Making Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeMaking Bank Predictive and Real-Time
Making Bank Predictive and Real-Time
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
 
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
 
The Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in ChurnThe Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in Churn
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica Webinar
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
 

Ähnlich wie 20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns

Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR Technologies
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...MapR Technologies
 
Key Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareKey Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareMapR Technologies
 
Integrating Hadoop into your enterprise IT environment
Integrating Hadoop into your enterprise IT environmentIntegrating Hadoop into your enterprise IT environment
Integrating Hadoop into your enterprise IT environmentMapR Technologies
 
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus WebinarBuild and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus WebinarImpetus Technologies
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Precisely
 
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise WeAreEsynergy
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big DataInfochimps, a CSC Big Data Business
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Precisely
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014MapR Technologies
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics
 
Meruvian - Introduction to MapR
Meruvian - Introduction to MapRMeruvian - Introduction to MapR
Meruvian - Introduction to MapRThe World Bank
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...DataStax Academy
 

Ähnlich wie 20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns (20)

Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
 
Key Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareKey Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShare
 
Integrating Hadoop into your enterprise IT environment
Integrating Hadoop into your enterprise IT environmentIntegrating Hadoop into your enterprise IT environment
Integrating Hadoop into your enterprise IT environment
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus WebinarBuild and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
 
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
Meruvian - Introduction to MapR
Meruvian - Introduction to MapRMeruvian - Introduction to MapR
Meruvian - Introduction to MapR
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
 
SciDB
SciDBSciDB
SciDB
 

Mehr von Allen Day, PhD

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...Allen Day, PhD
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser UniversityAllen Day, PhD
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - WageningenAllen Day, PhD
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIAllen Day, PhD
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Allen Day, PhD
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsAllen Day, PhD
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Allen Day, PhD
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedAllen Day, PhD
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersAllen Day, PhD
 

Mehr von Allen Day, PhD (20)

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San Jose
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and Genomics
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, Abbreviated
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data Engineers
 

Kürzlich hochgeladen

NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 

Kürzlich hochgeladen (20)

NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 

20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns

  • 1. Design Patterns for Big Data Architecture: Best Strategies for Streamlined [Simple, Powerful] Design Allen Day, PhD Data Scientist, MapR Technologies October 2013 ©MapR Technologies - Confidential
  • 2. Me, Us • Allen Day, Principal Data Scientist, MapR R contributor (10 yr), Hadoop (6 yr) Human Genetics (UCLA Medicine), Machine Learning • MapR Distributes open source components for Hadoop Adds major enhancements for performance, high-availability, and ease-of-use • See Also – “allenday” most places (twitter, github, etc.) – aday@maprtech.com, allenday@allenday.com – @mapR ©MapR Technologies - Confidential
  • 3. Three Business Use Cases Personalized Search ©MapR Technologies - Confidential Personalized Medicine Market Segmentation
  • 4. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign
  • 5. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing Which ones are similar?
  • 6. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing Which ones are similar?
  • 7. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign How can you tell? Marketing
  • 8. But First… WHAT IS A DESIGN PATTERN? ©MapR Technologies - Confidential
  • 9. “a design pattern is a general reusable solution to a commonly occurring problem within a given context in software design. A design pattern is not a finished design that can be transformed directly into source or machine code. It is a description or template for how to solve a problem that can be used in many different situations” http://en.wikipedia.org/wiki/Software_design_pattern ©MapR Technologies - Confidential
  • 10. Design Pattern Idea • a general reusable solution to a commonly occurring problem • not a finished design • not code • can be used in many different situations ©MapR Technologies - Confidential
  • 11. History of SW Design Patterns 1977 Architecture & Civil Engineering ©MapR Technologies - Confidential 1994 OO Software Architecture 2012 Parallelization Software ? Application Parallelization
  • 12. Not Just Software Designs http://en.wikipedia.org/wiki/A-line ©MapR Technologies - Confidential
  • 13. Choose a Pattern: Volume & Velocity 1. How big is your target data? <10 GB mid ? ? A Single element at a time >200 GB 2. How big is your query data? One pass over 100% B C Big storage Streaming Multiple passes over big chunks 3. How fast do you need a result? Throughput > response D ©MapR Technologies - Confidential Nearline Analytics < 100s (human scale) E Exploratory Analysis
  • 14. Twitter Zeitgeist as a Composite of Design Patterns Live data source e.g. Twitter Firehose B C Big storage Streaming D ©MapR Technologies - Confidential Nearline Analytics Downstream applications
  • 15. Big Data Application Shapes 1. How big is your input record? 2. How big is the data that is relevant to processing the input record? 3. How big is the total data that could be relevant to processing the input? 4. How fast do inputs flow in? 5. How fast do outputs need to flow out? 6. How complex (unstructured) are 1-5? 7. How predictable are 1-6? (spikiness, variance) 8. Is accuracy more important than speed? 9. Does the processing contain cycles (feedback loops)? Volume Velocity Variety Intents & Methods ©MapR Technologies - Confidential
  • 16. Percolation in Classic Form Real-time data source Real-time insertion Data store Offline percolation of recent data Large-scale Incremental Processing Using Distributed Transactions and Notifications http://research.google.com/pubs/pub36726.html ©MapR Technologies - Confidential
  • 17. Percolation in Classic Form Real-time data source Real-time insertion Data store Offline percolation of recent data Queued data are unavailable for action – not percolation Queue ©MapR Technologies - Confidential Real-time insertion Delayed insertion Data store
  • 18. Percolation in Classic Form Real-time data source Real-time insertion ©MapR Technologies - Confidential Data store Offline percolation of recent data
  • 19. Percolation of a Composite Store Real-time data source Real-time insertion Data store Offline percolation Index Both parts visible ©MapR Technologies - Confidential
  • 20. Market Segmentation • Divide customers into subsets with common needs • Design specific strategies for each subset • Major emphasis on “fresh” data ©MapR Technologies - Confidential
  • 21. Market Segmentation Feature Extraction Real-time transactions Customer history What does this have to do with percolation ©MapR Technologies - Confidential Assign Segment (search) db Market Segments query Clustering
  • 22. Percolator 1 Feature Extraction Real-time transactions Customer history ©MapR Technologies - Confidential Feature extraction is percolation because it is triggered by the arrival of a new record and because it updates that new record.
  • 23. Percolator 2 Real-time transactions Customer history Market segment assignment is percolation because it is triggered by the arrival of a new record and because only that record's segment is updated. ©MapR Technologies - Confidential Assign Segment (search) db Market Segments query What about the clustering
  • 24. Scheduled Update - Not Percolation Customer history Clustering The clustering loop is not percolation since it runs at fixed intervals instead of incrementally as updates are received. It also doesn't update just a single customer record. ©MapR Technologies - Confidential Market Segments
  • 25. Personalized Search • Observe web users’ activity over an extended period • Understand individual user interests • Customize search results for each user • …as fast as possible ©MapR Technologies - Confidential
  • 26. Personal Search History and Web Index Search Persona Activity db query Persona update Histories trigger query Search Web Crawl feature extraction Doc Store ©MapR Technologies - Confidential db update trigger Doc Index Persona Index
  • 27. Percolator 1 Expensive feature extraction does not block document ingest Web Crawl feature extraction Doc Store ©MapR Technologies - Confidential
  • 28. Percolators 2 and 3 Persona Activity Persona update Histories Web Crawl Doc Store ©MapR Technologies - Confidential update Doc Index Persona Index
  • 29. Percolator 4 Updates to personas trigger updates in related personas Search Persona Activity db query Persona update Histories ©MapR Technologies - Confidential Persona Index
  • 30. Percolator 5? Persona Index Persona Histories trigger query Search db trigger Doc Index ©MapR Technologies - Confidential Persona and doc index updates trigger a personalization refresh
  • 32. Cyclic Dependency Graph ©MapR Technologies - Confidential
  • 33. Percolator Thoughts • M7 tables are great as the first persistence point in percolation • In-memory flag column family works great for triggering updates – Efficient - eliminates need for queuing – Fast triggering with row & column Bloom filters • Percolation is best supported by dedicated column families – Percolators I/O characteristics differ – M7 works especially well because it supports lots of column families ©MapR Technologies - Confidential
  • 34. Cyclic Dependency Graph, M7 Schema ©MapR Technologies - Confidential
  • 35. Personalized Medicine 5. Interpretation & Follow-up 4. Reporting 1. Select Tests 2. Draw Biosample 3. Genome Sequencing & Analysis ©MapR Technologies - Confidential
  • 36. Personalized Medicine Applications • Pre-conception screening • Clinical research & trials – Drug re-targeting • Therapeutics – Companion diagnostics – Therapy selection ©MapR Technologies - Confidential
  • 37. Personalized Medicine Patient history (EHR) EHR archive Insert (eventually) db Sequence extraction Genome Sample Patient health context query Search Ranked therapies Here we do not see real-time data pushed to a persistence layer and processed offline. This pattern does ©MapR Technologies - Confidential
  • 38. Personalized Medicine Patient history (EHR) EHR archive Insert (eventually) db Sequence extraction Genome Sample ©MapR Technologies - Confidential Patient health context query Search User-based recommendation pattern Ranked therapies
  • 39. Recommendation in Classic Form Queue History Archive db Recent history ©MapR Technologies - Confidential query User Search Ranked similar histories
  • 40. Item-Based Recommendation in Classic Form Queue History archive Cooccurrence analysis Off-line analysis Recent history query Item linkage db Search ©MapR Technologies - Confidential Interactive recommendation Ranked items
  • 41. Recommendation Thoughts • Item-based recommendation is for efficiency – expensive step in computing co-occurrence can be done offline and cached prior to a user query • User-based recommendation is for accuracy – user comparisons are done online to find the current best recommendation • MapR is great for recommendation – M7 tables are high I/O performance, can eliminate queues – Faster archive updates with optimized MapReduce – High-availability for mission LIFE critical applications ©MapR Technologies - Confidential
  • 42. Business Use Cases & Design Patterns Recommender – Personalized Medicine Pattern X – Health data Percolator – Personalized Search Percolator – Other Industry Percolator – Personalized Medicine Pattern X – Other Industry ©MapR Technologies - Confidential
  • 43. Summary: Best Practices • Look at the big picture – Find recurring patterns • Design systems at a high-level – Solve problems once and reuse components – Increase R&D productivity – Decrease operational and maintenance overhead ©MapR Technologies - Confidential
  • 44. Thank You! Allen Day, PhD Principal Data Scientist, MapR Technologies aday@maprtech.com, allenday@allenday.com @allenday, @mapr ©MapR Technologies - Confidential

Hinweis der Redaktion

  1. Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  2. Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  3. Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  4. Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  5. Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  6. Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  7. Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  8. Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  9. Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  10. Talk track: In market segmentation, you want to identify useful segments of your customer base to target for a market campaign, for retention, for specific product offerings, etc. What makes “good” segments depends on what you want to do and how the environment changes. You may not know ahead of time what categories make useful segments. One way to find this is to capture customer histories and do a clustering step for discovery and definition of the market segments.This market segment db is then queried and updated in response to new real-time data insertion or new rounds of clustering. Specific feature extraction may also be a useful step from the customer history persistence layer.
  11. Talk track: the feature extraction step could be triggered by real-time data insertion…
  12. Talk track: a second percolator processes new customer histories relative to the market segments.
  13. Talk track: the clustering step is not triggered by the real-time insertion; it is a scheduled step and thus not an example of percolation.What about the other use case we said was similar, the Genotyping?
  14. Here, we trigger updates to the persona index based on EITHERUpdates to persona history, ORUpdates to the document indexThe idea here being that if enough docs have changed or personas are finding “unusual” stuff, the persona is stale and we should recompute it
  15. Talk track: MapR advantages include the smooth use of HBase on a MapR cluster for the persistence layer at the insertion point, or even better, the use of MapR M7 tables instead. There are two specific advantages to M7 (besides the all-important reliability):a)Less risk of delays/ IO storms etc that can happen with HBase. This is VERY important when pushing real-time data to a data store.b) Strategic advantage of using in-memory flags on column families – very efficient in M7 where you can have lots of column families as opposed to only a few in HBase, operationally speaking.
  16. Best practice: use one column family per percolator to manage their independent i/o characteristicsPrevent i/o storms
  17. Talk track: Now let’s consider the other health data example, genome sequencing for personalized medicine. This is an approach that can be used to get the particular genomic characteristics of a cancerous tumor and compare to known patient histories in order to select the best option for a customized therapy.
  18. Talk track: While percolation is not used in this example, it does represent a specialized form of recommendation: user-based recommendation.In this genome sequencing/ personalized medicine example, A very high bar is set for the accuracy of the recommendation. Here a user-based pattern is best. Let’s look at the generalized form…
  19. Talk track: here is the basic pattern for user-based recommendation, as used in the real use case of personalized medicine. In contrast, In consumer recommendation for shopping or movie or music recommendation, rapid response is key and accuracy is slightly less important. There item-based recommendation is generally best, because the expensive step in computing co-occurrence can be done offline prior to a user query.
  20. Talk track: MapR advantages include the smooth use of HBase on a MapR cluster for the persistence layer at the insertion point, or even better, the use of MapR M7 tables instead. There are two specific advantages to M7 (besides the all-important reliability):a)Less risk of delays/ IO storms etc that can happen with HBase. This is VERY important when pushing real-time data to a data store.b) Strategic advantage of using in-memory flags on column families – very efficient in M7 where you can have lots of column families as opposed to only a few in HBase, operationally speaking.