SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Copyright 2017 Trend Micro Inc.1
Data Lake:
centralize in on-prem vs. decentralize on cloud
Jeff Hung /Trend Micro
Sep 30, 2017
Copyright 2017 Trend Micro Inc.2
@jeffhung
• Smart Protection Network (SPN)
• Big-data Platform & Solution
• Hadoop PROD since 2009
• Work on AWS/EMR since 2014
Copyright 2017 Trend Micro Inc.3 Copyright 2017 Trend Micro Inc.3
500,000,000
total malware in 2016
Malware Explosion
Copyright 2017 Trend Micro Inc.4
Web
Crawler
Trend Micro
Endpoint
Protection
Trend Micro
Mail Protection
Trend Micro
Web Protection
Honeypot
CDN / xSP
Researcher
Intelligence
From data to solution
Threat Protections
Regional distribution of ransomware threats
from January 2016 to March 2017
Copyright 2017 Trend Micro Inc.5
2009
~40 Hadoop nodes
~15 Service/user accounts
3 Teams
<50 TB storage
<100 Jobs per day
Copyright 2017 Trend Micro Inc.6
2014
~250 Hadoop nodes
~140 Service/user accounts
13 Teams
~1,500 TB storage
>16,000 Jobs per day
Copyright 2017 Trend Micro Inc.7
The SPN Journey
Why cloud?
2009 2014 2017
on-prem
on-cloud
AWS
Data
Center
Scalability Agility Cost
Copyright 2017 Trend Micro Inc.8
Big-data on the Cloud
EMR
Copyright 2017 Trend Micro Inc.9 Copyright 2017 Trend Micro Inc.9
Analytic EngineData Lake
Copyright 2017 Trend Micro Inc.10
Data Lake
Copyright 2017 Trend Micro Inc.11
“If you think of a datamart as a store of bottled water – cleansed and
packaged and structured for easy consumption – the data lake is a large
body of water in a more natural state. The contents of the data lake
stream in from a source to fill the lake, and various users of the lake can
come to examine, dive in, or take samples.”
– James Dixon, CTO of Pentaho
Data Lake
Exam / Dive in / Take Example
Stream in Dump in
Copyright 2017 Trend Micro Inc.12
Data Lake fixes 2 problems
• Unknown Questions
• Information Silos
Copyright 2017 Trend Micro Inc.13
Unknown Questions
• You don’t know what you don’t know
• Premature optimization is the root of all evil
Data Warehouse/Mart vs. Data Lake
structured and preprocessed
data
DATA structured / semi-structured /
unstructured / raw data
schema-on-write PROCESSING schema-on-read
expensive for large data volume STORAGE designed for low-cost storage
Copyright 2017 Trend Micro Inc.14
• Incompatible data system built by different teams
– Conway’s Law & NIH Syndrome
– Technology Limitation
• Hadoop come to rescue
– Free technology
– Free computing
– Free storage
Information Silos
But...
– Cost justification
– Development agility
Copyright 2017 Trend Micro Inc.15
Data Lake fixes 2 problems – How ?
• Unknown Questions  Preserve low-level visibility
• Information Silos  Make accessing data easy
We’ve done these right in on-premises datacenter.
Now we need to make them work on Cloud, too.
Copyright 2017 Trend Micro Inc.16
vs.
Centralized Infrastructure 
One Big Hadoop Cluster 
Fine-grain Resource Sharing 
Single Big-data Stack 
UNIX Style Permission Control 
 Decentralized on Cloud Services
 Many EMR Clusters / S3 Buckets
 Go Dutch & Pay-as-you-Go
 Diverse Technology
 Cloud Compatible ACL
Things apply to both scenarios: File Format & Schema Design
On-Prem vs. On Cloud
Copyright 2017 Trend Micro Inc.17
Centralized Infrastructure
• Have only one datacenter
• PROD/STG in same place
• Jobs & data in same cluster
• Do everything on our own
Decentralized on Cloud Services
• Could use multiple AWS regions
• PROD/STG in different regions
• Jobs & data in different places
• Could leverage cloud services
vs
Access data in one place Access data from anywhere
Copyright 2017 Trend Micro Inc.18
One Big Hadoop Cluster
• Runs all applications
in same big cluster
• Bigger cluster = better flexibility
• Resource managed by YARN
Many EMR Clusters / S3 Buckets
• Each applications runs
in its own EMR cluster
• Specific cluster = better flexibility
• Resource managed by AWS
vs
Work in shared place Work in my place
Copyright 2017 Trend Micro Inc.19
Fine-grain Resource Sharing
• Finer granularity by CPU/Mem
• Centralized budget account
• Do usage statistics on our own
Go Dutch & Pay-as-you-Go
• Granularity in EMR level
• Budget in each application team
• AWS provides billing reports
vs
Finer usage management Manage my own works
Copyright 2017 Trend Micro Inc.20
Single Big-data Stack
• Tend to select best practices
that are well tested
• Other technology brings trouble
• Easy to optimize from infra-level
• Access through HDFS interface
Diverse Technology
• Tend to allow using different
AWS services
• Technologies covered by AWS
• Infra optimization rely on AWS
• Different access mechanism
vs
Use verified best practice Use tools I like
Copyright 2017 Trend Micro Inc.21
UNIX Style Permission Control
• UNIX style “file” permissions
• Tend to make data lake
read-only to everybody
• No widely adopted encryptions
Cloud Compatible ACL
• IAM-based ACL policies
• Allow granting access rights
in dataset level
• S3 provides native encryptions
vs
Simple that just works Complex but rich
Copyright 2017 Trend Micro Inc.22
Make Accessing Data Easy
On-prem in datacenter
• Raw data is read-only to
everybody
• Canonical software stack with
best practices
• Plan ahead the infra by strong
team
On-cloud in AWS
• Simplify permission granting
process
• Encourage leveraging AWS
services
• Pay-as-you-Go by every involved
teams
Copyright 2017 Trend Micro Inc.23
• Size reduction to lower storage consumption
• Read/write performance to speed up computing
• Data characteristics to hold complex data structure
• Schema evolution which changes from time to time
• Tool interoperability for different kinds of access
File Format & Schema Design Considerations
23
• Size reduction
• Read/write performance
• Data characteristics
• Schema evolution
• Tool interoperability
Copyright 2017 Trend Micro Inc.24
Data Characteristics
• The characteristics of the data schema
– What will not work?
– The mitigations
Fundamental characteristics and principles for “logs”:
• Records shall be independent – don’t reference other record
• Records shall be self-contained – repeat info in path
• Record exist means something, not-exist may mean another
These can only be documented  need schema portal
24
Copyright 2017 Trend Micro Inc.25
Data Characteristics – Recursion
• Avoid recursive schema
• Recursive schema?
25
ProcessInfo {
process_id: long,
image_path: chararray,
image_sha1: bytearray,
is_detected: int,
action: long,
action_result: {
terminate: chararray
},
children_processes: [
ProcessInfo
]
}
root_process: ProcessInfo;
Bad Design
Copyright 2017 Trend Micro Inc.26
Data Characteristics – Recursion
• Mitigation: array and index/id
26
0
1
2
3
... ...
process_id 652
parent_process_id 360
... ...
process_id 696
parent_process_id 652
... ...
process_id 952
parent_process_id 696
... ...
process_id 828
parent_process_id 696
... ...
ProcessInfo {
process_id: long,
image_path: chararray,
image_sha1: bytearray,
is_detected: int,
action: long,
action_result: {
terminate: chararray
},
parent_process_id: long
}
process_infos: [ ProcessInfo ];
Good Design
Copyright 2017 Trend Micro Inc.27
Data Characteristics – Tabular or Nested?
• Depends on data nature
– Raw data like feedback logs are often nested
– Intermediate data are often tabular
• File Format Capability
• Strategy
– Parquet for nested data
– ORC for tabular data
27
Protobuf Parquet ORC key/value pairs
Good at nested data Good at tabular data
Copyright 2017 Trend Micro Inc.28
• Avoid custom key names for KV pairs
– Runtime-determined key names are hard to process
– Hard to parallelize key-enumeration (tool constraint)
• Mitigation
Data Characteristics – Custom Key Name
28
[
{ "name": "key1", "value": "value1" },
{ "name": "key2", "value": "value2" },
...
]
{ "key1": "value1", "key2": "value2", ... }
Copyright 2017 Trend Micro Inc.29
Data Characteristics – Not All JSON Works
• In summary, not all data presentable by JSON are easy to
process using big-data tools & other formats
• Be careful to design schema if you choose JSON
29
Characteristic Parquet ORC JSON
Recursion No Yes Yes
Tabular vs. Nested Nested Tabular Yes
Custom Key Name No No Yes
Copyright 2017 Trend Micro Inc.30
• Schema evolves from time to time
– Backward compatible is not that simple as imaged
– In reality each role will have multiple versions
• Roles:
Schema Evolution
30
Storage (HDFS or S3)
Data Provider Data Consumer
Storer Loader
DataDataData
Copyright 2017 Trend Micro Inc.31 Copyright 2017 Trend Micro Inc.31
Product/v1 Validator/v3 Data/v1 Loader/v3 Service1/v1
Product/v2 Validator/v3 Data/v2 Loader/v3
Product/v3 Validator/v3 Data/v3 Loader/v3 Service2/v3
31
Data Consumer
Storer DataDataData Loader
Data Provider
Product/v1 Validator/v1 Data/v1 Loader/v1 Service1/v1
Product/v1 Validator/v2 Data/v1 Loader/v2 Service1/v1
Product/v2 Validator/v2 Data/v2 Loader/v2 Service2/v2
Product/v1 Validator/v3 Data/v1 Loader/v2 Service1/v1
Product/v2 Validator/v3 Data/v2 Loader/v2 Service2/v2
Product/v3 Validator/v3 Data/v3 Loader/v2
Year 1 – Only one version in the universe
Year 2 – Second version appeared while v1 still exist
Year 3 – Collect v3 data in advance due to schedule or to accumulate data
Year 4 – Data consumer catch up and start using v3 data
Copyright 2017 Trend Micro Inc.32
• File format requirement to enable schema evolution
– Mostly “Loader” requirement, for Pig and Spark
• Requirements
Schema Evolution – Requirements
32
Processing Program
Data File
Future
Schema
Data File
Current
Schema
Load
Save
Internal Structure
Current
Schema
Process
Data File
Current
Schema
Missing
Fields
=
Default Value
or NULL
Unknown
Fields
=
Invisible
or Ignore
Data File
Previous
Schema
– Can load current version
– Can load previous version
– Can load future version
Copyright 2017 Trend Micro Inc.33
Data File
Current
Schema
Data File
Previous
Schema
Data File
Future
Schema
Processing Program
Data File
Current
Schema
Data File
Previous
Schema
Data File
Future
Schema
• File format requirement to enable schema evolution
– Mostly “Loader” requirement, for Pig and Spark
• Requirements
Schema Evolution – Requirements
33
– Can load current version
– Can load previous version
– Can load future version
– Store as original version
when collecting data
Load
Save
Internal Structure
Current
Schema
Process
Missing
Fields
=
Default Value
or NULL
Unknown
Fields
=
Invisible
or Ignore
Copyright 2017 Trend Micro Inc.34
• File format requirement to enable schema evolution
– Mostly “Loader” requirement, for Pig and Spark
• Requirements
– The internal structure of loaded data in Pig and Spark
cannot preserve entire original structure
– Write data ingestion tool in Java whenever possible
Schema Evolution – Requirements
34
– Can load current version
– Can load previous version
– Can load future version
– Store as original versionRequirement SF+PB Parquet ORC SF+PB Parquet ORC Parquet ORC
Load Previous v v v v v v v v
Load Current v v v v v v v v
Load Future v v v v v v v v
Save Original v v v x x x x x
Language Java Pig Spark
Copyright 2017 Trend Micro Inc.35
Preserve low-level visibility
• Choose file format based on usage scenario
• Deview schema to avoid bad design
Copyright 2017 Trend Micro Inc.36
Wrap ups
• Preserve low-level visibility
– Choose file format wisely
– Design schema carefully
• Make accessing data easy
– On-prem & on-cloud are different
– Do something to lower barrier
Copyright 2017 Trend Micro Inc.37 Copyright 2017 Trend Micro Inc.37
Any Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data ArchitecturesLynn Langit
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server ProLynn Langit
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...DataWorks Summit
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep duttaCapgemini
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach
 
Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?Guido Schmutz
 
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsVerizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsDataWorks Summit
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine LearningLynn Langit
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)Spark Summit
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big DataInfochimps, a CSC Big Data Business
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Casesboorad
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_dataNitin Kumar
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes John Archer
 

Was ist angesagt? (19)

IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server Pro
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?
 
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsVerizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine Learning
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes
 

Ähnlich wie [DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud

Cloud Data Strategy event London
Cloud Data Strategy event LondonCloud Data Strategy event London
Cloud Data Strategy event LondonMongoDB
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudJames Serra
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataStylight
 
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStackAdam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStackShapeBlue
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingMinhazul Arefin
 
(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...
(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...
(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...Amazon Web Services
 
Unit 3 -Data storage and cloud computing
Unit 3 -Data storage and cloud computingUnit 3 -Data storage and cloud computing
Unit 3 -Data storage and cloud computingMonishaNehkal
 
Big Data
Big DataBig Data
Big DataNeha Mehta
 
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)Amazon Web Services
 
Cloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native appsCloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native appsVMware Tanzu
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantagePrecisely
 
Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017Dr. Anita Goel
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics
 
Accelerating a Path to Digital With a Cloud Data Strategy
Accelerating a Path to Digital With a Cloud Data StrategyAccelerating a Path to Digital With a Cloud Data Strategy
Accelerating a Path to Digital With a Cloud Data StrategyMongoDB
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2Raul Chong
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learnJohn D Almon
 
SRV415 NEW LAUNCH! DynamoDB just got faster: Deep Dive on DAX and more
SRV415 NEW LAUNCH!  DynamoDB just got faster: Deep Dive on DAX and moreSRV415 NEW LAUNCH!  DynamoDB just got faster: Deep Dive on DAX and more
SRV415 NEW LAUNCH! DynamoDB just got faster: Deep Dive on DAX and moreAmazon Web Services
 

Ähnlich wie [DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud (20)

Cloud Data Strategy event London
Cloud Data Strategy event LondonCloud Data Strategy event London
Cloud Data Strategy event London
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
 
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStackAdam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computing
 
(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...
(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...
(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...
 
Unit 3 -Data storage and cloud computing
Unit 3 -Data storage and cloud computingUnit 3 -Data storage and cloud computing
Unit 3 -Data storage and cloud computing
 
Big Data
Big DataBig Data
Big Data
 
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
 
Cloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native appsCloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native apps
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
 
Hadoop
HadoopHadoop
Hadoop
 
Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
Accelerating a Path to Digital With a Cloud Data Strategy
Accelerating a Path to Digital With a Cloud Data StrategyAccelerating a Path to Digital With a Cloud Data Strategy
Accelerating a Path to Digital With a Cloud Data Strategy
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
SRV415 NEW LAUNCH! DynamoDB just got faster: Deep Dive on DAX and more
SRV415 NEW LAUNCH!  DynamoDB just got faster: Deep Dive on DAX and moreSRV415 NEW LAUNCH!  DynamoDB just got faster: Deep Dive on DAX and more
SRV415 NEW LAUNCH! DynamoDB just got faster: Deep Dive on DAX and more
 

KĂźrzlich hochgeladen

➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectBoston Institute of Analytics
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

KĂźrzlich hochgeladen (20)

➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud

  • 1. Copyright 2017 Trend Micro Inc.1 Data Lake: centralize in on-prem vs. decentralize on cloud Jeff Hung /Trend Micro Sep 30, 2017
  • 2. Copyright 2017 Trend Micro Inc.2 @jeffhung • Smart Protection Network (SPN) • Big-data Platform & Solution • Hadoop PROD since 2009 • Work on AWS/EMR since 2014
  • 3. Copyright 2017 Trend Micro Inc.3 Copyright 2017 Trend Micro Inc.3 500,000,000 total malware in 2016 Malware Explosion
  • 4. Copyright 2017 Trend Micro Inc.4 Web Crawler Trend Micro Endpoint Protection Trend Micro Mail Protection Trend Micro Web Protection Honeypot CDN / xSP Researcher Intelligence From data to solution Threat Protections Regional distribution of ransomware threats from January 2016 to March 2017
  • 5. Copyright 2017 Trend Micro Inc.5 2009 ~40 Hadoop nodes ~15 Service/user accounts 3 Teams <50 TB storage <100 Jobs per day
  • 6. Copyright 2017 Trend Micro Inc.6 2014 ~250 Hadoop nodes ~140 Service/user accounts 13 Teams ~1,500 TB storage >16,000 Jobs per day
  • 7. Copyright 2017 Trend Micro Inc.7 The SPN Journey Why cloud? 2009 2014 2017 on-prem on-cloud AWS Data Center Scalability Agility Cost
  • 8. Copyright 2017 Trend Micro Inc.8 Big-data on the Cloud EMR
  • 9. Copyright 2017 Trend Micro Inc.9 Copyright 2017 Trend Micro Inc.9 Analytic EngineData Lake
  • 10. Copyright 2017 Trend Micro Inc.10 Data Lake
  • 11. Copyright 2017 Trend Micro Inc.11 “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” – James Dixon, CTO of Pentaho Data Lake Exam / Dive in / Take Example Stream in Dump in
  • 12. Copyright 2017 Trend Micro Inc.12 Data Lake fixes 2 problems • Unknown Questions • Information Silos
  • 13. Copyright 2017 Trend Micro Inc.13 Unknown Questions • You don’t know what you don’t know • Premature optimization is the root of all evil Data Warehouse/Mart vs. Data Lake structured and preprocessed data DATA structured / semi-structured / unstructured / raw data schema-on-write PROCESSING schema-on-read expensive for large data volume STORAGE designed for low-cost storage
  • 14. Copyright 2017 Trend Micro Inc.14 • Incompatible data system built by different teams – Conway’s Law & NIH Syndrome – Technology Limitation • Hadoop come to rescue – Free technology – Free computing – Free storage Information Silos But... – Cost justification – Development agility
  • 15. Copyright 2017 Trend Micro Inc.15 Data Lake fixes 2 problems – How ? • Unknown Questions  Preserve low-level visibility • Information Silos  Make accessing data easy We’ve done these right in on-premises datacenter. Now we need to make them work on Cloud, too.
  • 16. Copyright 2017 Trend Micro Inc.16 vs. Centralized Infrastructure  One Big Hadoop Cluster  Fine-grain Resource Sharing  Single Big-data Stack  UNIX Style Permission Control   Decentralized on Cloud Services  Many EMR Clusters / S3 Buckets  Go Dutch & Pay-as-you-Go  Diverse Technology  Cloud Compatible ACL Things apply to both scenarios: File Format & Schema Design On-Prem vs. On Cloud
  • 17. Copyright 2017 Trend Micro Inc.17 Centralized Infrastructure • Have only one datacenter • PROD/STG in same place • Jobs & data in same cluster • Do everything on our own Decentralized on Cloud Services • Could use multiple AWS regions • PROD/STG in different regions • Jobs & data in different places • Could leverage cloud services vs Access data in one place Access data from anywhere
  • 18. Copyright 2017 Trend Micro Inc.18 One Big Hadoop Cluster • Runs all applications in same big cluster • Bigger cluster = better flexibility • Resource managed by YARN Many EMR Clusters / S3 Buckets • Each applications runs in its own EMR cluster • Specific cluster = better flexibility • Resource managed by AWS vs Work in shared place Work in my place
  • 19. Copyright 2017 Trend Micro Inc.19 Fine-grain Resource Sharing • Finer granularity by CPU/Mem • Centralized budget account • Do usage statistics on our own Go Dutch & Pay-as-you-Go • Granularity in EMR level • Budget in each application team • AWS provides billing reports vs Finer usage management Manage my own works
  • 20. Copyright 2017 Trend Micro Inc.20 Single Big-data Stack • Tend to select best practices that are well tested • Other technology brings trouble • Easy to optimize from infra-level • Access through HDFS interface Diverse Technology • Tend to allow using different AWS services • Technologies covered by AWS • Infra optimization rely on AWS • Different access mechanism vs Use verified best practice Use tools I like
  • 21. Copyright 2017 Trend Micro Inc.21 UNIX Style Permission Control • UNIX style “file” permissions • Tend to make data lake read-only to everybody • No widely adopted encryptions Cloud Compatible ACL • IAM-based ACL policies • Allow granting access rights in dataset level • S3 provides native encryptions vs Simple that just works Complex but rich
  • 22. Copyright 2017 Trend Micro Inc.22 Make Accessing Data Easy On-prem in datacenter • Raw data is read-only to everybody • Canonical software stack with best practices • Plan ahead the infra by strong team On-cloud in AWS • Simplify permission granting process • Encourage leveraging AWS services • Pay-as-you-Go by every involved teams
  • 23. Copyright 2017 Trend Micro Inc.23 • Size reduction to lower storage consumption • Read/write performance to speed up computing • Data characteristics to hold complex data structure • Schema evolution which changes from time to time • Tool interoperability for different kinds of access File Format & Schema Design Considerations 23 • Size reduction • Read/write performance • Data characteristics • Schema evolution • Tool interoperability
  • 24. Copyright 2017 Trend Micro Inc.24 Data Characteristics • The characteristics of the data schema – What will not work? – The mitigations Fundamental characteristics and principles for “logs”: • Records shall be independent – don’t reference other record • Records shall be self-contained – repeat info in path • Record exist means something, not-exist may mean another These can only be documented  need schema portal 24
  • 25. Copyright 2017 Trend Micro Inc.25 Data Characteristics – Recursion • Avoid recursive schema • Recursive schema? 25 ProcessInfo { process_id: long, image_path: chararray, image_sha1: bytearray, is_detected: int, action: long, action_result: { terminate: chararray }, children_processes: [ ProcessInfo ] } root_process: ProcessInfo; Bad Design
  • 26. Copyright 2017 Trend Micro Inc.26 Data Characteristics – Recursion • Mitigation: array and index/id 26 0 1 2 3 ... ... process_id 652 parent_process_id 360 ... ... process_id 696 parent_process_id 652 ... ... process_id 952 parent_process_id 696 ... ... process_id 828 parent_process_id 696 ... ... ProcessInfo { process_id: long, image_path: chararray, image_sha1: bytearray, is_detected: int, action: long, action_result: { terminate: chararray }, parent_process_id: long } process_infos: [ ProcessInfo ]; Good Design
  • 27. Copyright 2017 Trend Micro Inc.27 Data Characteristics – Tabular or Nested? • Depends on data nature – Raw data like feedback logs are often nested – Intermediate data are often tabular • File Format Capability • Strategy – Parquet for nested data – ORC for tabular data 27 Protobuf Parquet ORC key/value pairs Good at nested data Good at tabular data
  • 28. Copyright 2017 Trend Micro Inc.28 • Avoid custom key names for KV pairs – Runtime-determined key names are hard to process – Hard to parallelize key-enumeration (tool constraint) • Mitigation Data Characteristics – Custom Key Name 28 [ { "name": "key1", "value": "value1" }, { "name": "key2", "value": "value2" }, ... ] { "key1": "value1", "key2": "value2", ... }
  • 29. Copyright 2017 Trend Micro Inc.29 Data Characteristics – Not All JSON Works • In summary, not all data presentable by JSON are easy to process using big-data tools & other formats • Be careful to design schema if you choose JSON 29 Characteristic Parquet ORC JSON Recursion No Yes Yes Tabular vs. Nested Nested Tabular Yes Custom Key Name No No Yes
  • 30. Copyright 2017 Trend Micro Inc.30 • Schema evolves from time to time – Backward compatible is not that simple as imaged – In reality each role will have multiple versions • Roles: Schema Evolution 30 Storage (HDFS or S3) Data Provider Data Consumer Storer Loader DataDataData
  • 31. Copyright 2017 Trend Micro Inc.31 Copyright 2017 Trend Micro Inc.31 Product/v1 Validator/v3 Data/v1 Loader/v3 Service1/v1 Product/v2 Validator/v3 Data/v2 Loader/v3 Product/v3 Validator/v3 Data/v3 Loader/v3 Service2/v3 31 Data Consumer Storer DataDataData Loader Data Provider Product/v1 Validator/v1 Data/v1 Loader/v1 Service1/v1 Product/v1 Validator/v2 Data/v1 Loader/v2 Service1/v1 Product/v2 Validator/v2 Data/v2 Loader/v2 Service2/v2 Product/v1 Validator/v3 Data/v1 Loader/v2 Service1/v1 Product/v2 Validator/v3 Data/v2 Loader/v2 Service2/v2 Product/v3 Validator/v3 Data/v3 Loader/v2 Year 1 – Only one version in the universe Year 2 – Second version appeared while v1 still exist Year 3 – Collect v3 data in advance due to schedule or to accumulate data Year 4 – Data consumer catch up and start using v3 data
  • 32. Copyright 2017 Trend Micro Inc.32 • File format requirement to enable schema evolution – Mostly “Loader” requirement, for Pig and Spark • Requirements Schema Evolution – Requirements 32 Processing Program Data File Future Schema Data File Current Schema Load Save Internal Structure Current Schema Process Data File Current Schema Missing Fields = Default Value or NULL Unknown Fields = Invisible or Ignore Data File Previous Schema – Can load current version – Can load previous version – Can load future version
  • 33. Copyright 2017 Trend Micro Inc.33 Data File Current Schema Data File Previous Schema Data File Future Schema Processing Program Data File Current Schema Data File Previous Schema Data File Future Schema • File format requirement to enable schema evolution – Mostly “Loader” requirement, for Pig and Spark • Requirements Schema Evolution – Requirements 33 – Can load current version – Can load previous version – Can load future version – Store as original version when collecting data Load Save Internal Structure Current Schema Process Missing Fields = Default Value or NULL Unknown Fields = Invisible or Ignore
  • 34. Copyright 2017 Trend Micro Inc.34 • File format requirement to enable schema evolution – Mostly “Loader” requirement, for Pig and Spark • Requirements – The internal structure of loaded data in Pig and Spark cannot preserve entire original structure – Write data ingestion tool in Java whenever possible Schema Evolution – Requirements 34 – Can load current version – Can load previous version – Can load future version – Store as original versionRequirement SF+PB Parquet ORC SF+PB Parquet ORC Parquet ORC Load Previous v v v v v v v v Load Current v v v v v v v v Load Future v v v v v v v v Save Original v v v x x x x x Language Java Pig Spark
  • 35. Copyright 2017 Trend Micro Inc.35 Preserve low-level visibility • Choose file format based on usage scenario • Deview schema to avoid bad design
  • 36. Copyright 2017 Trend Micro Inc.36 Wrap ups • Preserve low-level visibility – Choose file format wisely – Design schema carefully • Make accessing data easy – On-prem & on-cloud are different – Do something to lower barrier
  • 37. Copyright 2017 Trend Micro Inc.37 Copyright 2017 Trend Micro Inc.37 Any Questions?

Hinweis der Redaktion

  1. Conway’s law – architecture structure tend to mirror organization structure
  2. Read/write performance Read scenarios Input to EMR Ingest by Kinesis Write scenarios Accumulated from volume process Dump DB snapshot Size reduction Baseline: Sequence File + Protobuf Explicitly sub-schema Data characteristics tabular vs. nested recursive data structure different types in array/tuple key/value pairs with custom keys json subset Schema evolution Principle: backward compatible data x validator x loader Tool interoperability Pig vs. Spark Batch vs. Streaming (in Spark) Write to Firehose Read from S3 to EMR
  3. Recursion: Parquet (x), ORC (?)
  4. Recursion: Parquet (x), ORC (?)
  5. Custom Key Name: Parquet (?), ORC (?)