SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
巨量資料處理工具的過去、現在與未來
三種處理工具與未來的挑戰

Big Data Tools : PAST, NOW and FUTURE
- Three Types of Processing Tools and the Next Big Thing

National Center for High-performance Computing

Jazz Yao-Tsung Wang
<0303109@narlabs.org.tw>
WHO AM I ? JAZZ ?
• About Speaker :
– Jazz Yao-Tsung Wang / Associate Researcher
– Co-Founder and Evangelist of Hadoop.TW
– 0303109@narlabs.org.tw

• All slides and training materials could be found at
– http://trac.nchc.org.tw/cloud
– Be Green ! Try not to print the slides. It changes frequently.

FOSS User
Debian/Ubutnu
Access Grid
Motion/VLC
Red5
Debian Router
DRBL/Clonezilla
Hadoop

FOSS Developer
TRTC WSU/
Haduzilla /
Hadop4Win / Ezilla
FOSS Evangelist
DRBL/Clonezilla
Partclone/Tuxboot
Hadoop Ecosystem
2
演講大綱 Agenda

Linux is everywhere

開放無所不在

What is Big Data ?

何謂巨量資料

Big Data in Motion !

即時巨資應用

The Next Big Thing ? 下半場的重點
Conclusion

三大結論回顧
3
3 Buzzwords in 2013
三大年度熱門關鍵字

物聯網
Internet of Things
雲端運算
Cloud Computing
巨量資料
Big Data
4
巨量資料的奇幻漂流
     Life of Big Data

5
Linux Adoption in Enterprise
increase in last 3 years

Source: "2013 Enterprise End User Report",
Linux Foundation, March 2013
http://www.linuxfoundation.org/publications/linux-foundation/linux-adoption-trends-end-user-report-2013

6
Linux in the Cloud
← Amazon
Yahoo ↓

If you have 4000+
server, which OS will
you choose?
http://www.datacenterknowledge.com/archives/2010/09/20/inside-the-yahoo-computing-coop/

http://bits.blogs.nytimes.com/2013/01/08/amazons-unknown-unknowns/?_r=0

7
Linux in the Devices !!
Linux have dominated Embedded, Mobile .....
maybe Internet of Things in near future ….

http://crackberry.com/sites/crackberry.com/files/styles/large/public/topic_images/2013/ANDROID.png
http://boxysystems.com/myblog/wp-content/uploads/2011/01/google-chrome-OS-logo.jpg

Source: The most popular end-user Linux distributions are ...
http://www.zdnet.com/the-most-popular-end-user-linux-distributions-are-7000017223/

8
Why Linux ?
Total Cost of Ownership !

Source: How Red Hat Enterprise Linux trims total cost of ownership (TCO) in comparison to Windows Server, October 7, 2013
http://www.redhat.com/about/news/archive/2013/10/how-red-hat-enterprise-linux-trims-total-cost-of-ownership-in-comparison-to-windows-server

9
演講大綱 Agenda

Linux is everywhere

開放無所不在

What is Big Data ?

何謂巨量資料

Big Data in Motion !

即時巨資應用

The Next Big Thing ? 下半場的重點
Conclusion

三大結論回顧
10
巨量資料的三大挑戰
Challenges - 3 Vs of Big Data
Volume 資料數量
(amount of data)
EB

參考來源:
[1] Laney, Douglas. "3D Data Management: Controlling
Data Volume, Velocity and Variety" (6 February 2001)
[2] Gartner Says Solving 'Big Data' Challenge Involves
More Than Just Managing Volumes of Data, June 2011

Structured
結構化資料

Batch ( 批次作業 )

Semi-structured
半結構化資料

PB

Unstructured
非結構化資料
Variety 資料多樣性
(data types, sources)

Realtime ( 即時資料 )
TB

Velocity 資料增加率
(speed of data in/out)

巨量資料的挑戰在於如何管理「數量」、「增加率」與「多樣性」

11 11
知識源自彙整過去,智慧在能預測未來
Knowledge is from the PAST, Wisdom is for the FUTURE.
資料多寡不是重點,重點是
我們想要產生什麼價值呢?
時效合理嘛?成本合理嘛?

It does not
matter how big is
your data. The
goal is to create
VALUE within
reasonable time
period and total
cost of
ownership.
http://www.pursuantgroup.com/blog/tag/dikw-model/

12 12
PAST: Big Data at Rest

http://nextbigsound.com/

http://www.openpdc.com/
Source: 10 ways big data changes everything,
http://gigaom.com/2012/03/11/10-ways-big-data-is-changing-everything

13
處理巨量資料的三類技術 (1)
Data at Rest – MapReduce Framework

e

Volume

e
uc
ed
R
ap
M
Batch

Hadoop
HPCC
Unstructured
Variety

EB

PB

TB

Structured
Petabyte File System

m
ra
F

k
or
w

Realtime

Velocity

14 14
巨量資料處理的資訊架構
The SMAQ stack for big data
LAMP is for Web Services.

未來處理海量資料的人必需知道
You will need a big data stack called
SMAQ ( Storage, MapReduce and Query )

參考來源: The SMAQ stack for big data , Edd Dumbill , 22 September 2010 ,   
      http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html
圖片來源: http://smashingweb.ge6.org/wp-content/uploads/2011/10/apache-php-mysql-ubuntu.png

15
高資料通量處理平台 Hadoop
Key Concept : Data Locality

Hadoop is a framework for developer
to wrote and execute massive data
processing applications easily.
Hadoop includes two parts: HDFS and MapReduce.
Warehouse
for data source and
output results.
HDFS stores
unstructure data
and structure data

Processing

Grouping

Map
One in
One out

Reduce
Multiple in, One out
16
批次作業的運算時間
Processing Time of Batch Jobs

17
演講大綱 Agenda

Linux is everywhere

開放無所不在

What is Big Data ?

何謂巨量資料

Big Data in Motion !

即時巨資應用

The Next Big Thing ? 下半場的重點
Conclusion

三大結論回顧
18
NOW: Big Data in Motion

[ 金融 ] Trading Robot

[ 災防 ] 海嘯、土石流
Disaster Prevention
Tsunami Forecast
[ 資訊 ] 機房即時用電資訊監控、警訊

Realtime Data Center Power Usage
and related notifications

http://www.newmobilelife.com/2013/04/21/facebook-pue-real-time-charts/

19
處理巨量資料的三類技術 (2)
Data in Motion – In-Memory Processing

Volume
EB

Batch

HBase / Drill
/ Impala

PB

Unstructured
Variety

Structured

Realtime
TB

Velocity

20 20
Google 的技術演進 vs
Apache 專案
Big Query
(JSON, SQL-like)
Incremental Index Update
(Caffeine)
Graph Database

Dremel
(2010)

Apache Drill
(2012)

Percolator
(2010)
Pregel
(2009)

Apache Giraph
(2011)

BigTable
(2006)

Apache HBase
(2007)

MapReduce
(2004)

Hadoop MapReduce
(2006)

Google File System
(2003)

HDFS
(2006)

21
令人眼花撩亂的多樣化資料庫選擇
NoSQL vs NewSQL

http://www.infoq.com/news/2011/04/newsql

22 22
In-Memory Processing 的運算時間
以 HBase 為例

23
處理巨量資料的三類技術 (3)
Streaming Data Collection
Volume
EB

Structured

Batch
PB

Unstructured
Variety

Realtime
TB

Message Queue
Storm / Kafka

Velocity

24 24
Twitter Storm
+ Apache Kafka

http://blog.infochimps.com/2012/10/30/next-gen-real-time-streaming-storm-kafka-integration/

25
混合模式的巨量資料處理架構
Lambda Architecture for Big Data after 2013
HBase
Storm

ElephantDB
Or
Voldemort

Source: Lambda Architecture, 8. March 2013
http://www.ymc.ch/en/lambda-architecture-part-1
26
結論二: Golden Ratio for Big Data
1 core : 2+ GB RAM : 1 SSD Disk

FLOPS=~IOPS

27
演講大綱 Agenda

Linux is everywhere

開放無所不在

What is Big Data ?

何謂巨量資料

Big Data in Motion !

即時巨資應用

The Next Big Thing ? 下半場的重點
Conclusion

三大結論回顧
28
市場現況: Gartner Hype Cycle 2013
萌芽期

夢幻期

幻滅期

平原期

高原期

29
NEXT: Big Data Security
品質管控

當我們緊密相連 .....

世界政經:歐盟想分 Tweeter
找出經濟、政治的脈動

國家安全:美國 PRISM 計劃
( 網軍 ! 終極警探 4.0 )

權限管控
組織如何因應 APT ?
Big Data 平台本身的安全性 ?

有太多安全的問題等待解決!

數量管控

Source: Gartner (March 2011), 'Big
Data' Is Only the Beginning of
Extreme Information Management,
April 2011,
http://www.gartner.com/id=1622715
30
結論三: For Big Data Security,
buy hardware with encryption support
Power Hardware

PowerVM

PowerSC

Systems

Virtualization

Security &
Compliance

Power Hardware
with security built
in & hardware
assists for
encryption

Secure Virtualization
Platform ensuring
isolation integrity

Virtualization Centric
Security Extensions
to protect the Cloud
and Virtual Data
Center

AIX Operating
System

Systems Software

Secure Operating
Systems– defense in
depth: role based
access, trusted
execution, encrypted
file system
31
演講大綱 Agenda

Linux is everywhere

開放無所不在

What is Big Data ?

何謂巨量資料

Big Data in Motion !

即時巨資應用

The Next Big Thing ? 下半場的重點
Conclusion

三大結論回顧
32
Conclusion:
1.Switch to Linux, it helps to reduce TCO (total cost of
ownership) about 34% !
2.Hadoop only solve the problem for Big Data at Rest. You will
need In-Memory Processing tools such as HBase, Spark,
Impala for Big Data in Motion. To cleaning Big Data in realtime, you could try Message Queue, Storm and Kafka.
3.Consider to buy servers with SSD disk, it helps to improve
processing time for Big Data at Rest. For Big Data in Motion,
you will need tools like In-Memory Processing. Remember to
buy more RAM for your cluster.
4.To protect your big data, please survey hardwares and
operation system (OS) which have hardware encryption
support !
33
問題與討論
Questions?

34

Weitere ähnliche Inhalte

Was ist angesagt?

The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
Romeo Kienzler
 

Was ist angesagt? (20)

Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmek
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Thinking BIG
Thinking BIGThinking BIG
Thinking BIG
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
社群、協會、國際連結
社群、協會、國際連結社群、協會、國際連結
社群、協會、國際連結
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 

Andere mochten auch

Hadoop Deployment Model @ OSDC.TW
Hadoop Deployment Model @ OSDC.TWHadoop Deployment Model @ OSDC.TW
Hadoop Deployment Model @ OSDC.TW
Jazz Yao-Tsung Wang
 
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
Jazz Yao-Tsung Wang
 
13 09-28 hadoop-in_taiwan_2013_opening
13 09-28 hadoop-in_taiwan_2013_opening13 09-28 hadoop-in_taiwan_2013_opening
13 09-28 hadoop-in_taiwan_2013_opening
Jazz Yao-Tsung Wang
 
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
Jazz Yao-Tsung Wang
 
ClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud Testbed
Jazz Yao-Tsung Wang
 

Andere mochten auch (11)

Hadoop Deployment Model @ OSDC.TW
Hadoop Deployment Model @ OSDC.TWHadoop Deployment Model @ OSDC.TW
Hadoop Deployment Model @ OSDC.TW
 
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
 
13 09-28 hadoop-in_taiwan_2013_opening
13 09-28 hadoop-in_taiwan_2013_opening13 09-28 hadoop-in_taiwan_2013_opening
13 09-28 hadoop-in_taiwan_2013_opening
 
2006-11-16 RFID and OSS for Agriculture
2006-11-16 RFID and OSS for Agriculture2006-11-16 RFID and OSS for Agriculture
2006-11-16 RFID and OSS for Agriculture
 
Build Your Private Cloud with Ezilla and Haduzilla
Build Your Private Cloud with Ezilla and HaduzillaBuild Your Private Cloud with Ezilla and Haduzilla
Build Your Private Cloud with Ezilla and Haduzilla
 
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
 
ClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud Testbed
 
2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security
 
Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)
 
Hadoop.TW : Now and Future
Hadoop.TW : Now and FutureHadoop.TW : Now and Future
Hadoop.TW : Now and Future
 
Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望
 

Ähnlich wie Big Data Tools : PAST, NOW and FUTURE

Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
Rohit Dubey
 

Ähnlich wie Big Data Tools : PAST, NOW and FUTURE (20)

Bigdata notes
Bigdata notesBigdata notes
Bigdata notes
 
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
20150630 kca big-data-with-cloud_output
20150630 kca big-data-with-cloud_output20150630 kca big-data-with-cloud_output
20150630 kca big-data-with-cloud_output
 
Big data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantBig data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You Want
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Big data
Big dataBig data
Big data
 
Data analysis trend 2015 2016 v071
Data analysis trend 2015 2016 v071Data analysis trend 2015 2016 v071
Data analysis trend 2015 2016 v071
 
Python's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data Analysis
 
On Big Data
On Big DataOn Big Data
On Big Data
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Big Data 2.0
Big Data 2.0Big Data 2.0
Big Data 2.0
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Scott Edmunds slides from #IDCC13 Data Science session
Scott Edmunds slides from #IDCC13 Data Science sessionScott Edmunds slides from #IDCC13 Data Science session
Scott Edmunds slides from #IDCC13 Data Science session
 
Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life Revolution
 
Unit 1
Unit 1Unit 1
Unit 1
 
An Overview of BigData
An Overview of BigDataAn Overview of BigData
An Overview of BigData
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Big Data Tools : PAST, NOW and FUTURE

  • 1. 巨量資料處理工具的過去、現在與未來 三種處理工具與未來的挑戰 Big Data Tools : PAST, NOW and FUTURE - Three Types of Processing Tools and the Next Big Thing National Center for High-performance Computing Jazz Yao-Tsung Wang <0303109@narlabs.org.tw>
  • 2. WHO AM I ? JAZZ ? • About Speaker : – Jazz Yao-Tsung Wang / Associate Researcher – Co-Founder and Evangelist of Hadoop.TW – 0303109@narlabs.org.tw • All slides and training materials could be found at – http://trac.nchc.org.tw/cloud – Be Green ! Try not to print the slides. It changes frequently. FOSS User Debian/Ubutnu Access Grid Motion/VLC Red5 Debian Router DRBL/Clonezilla Hadoop FOSS Developer TRTC WSU/ Haduzilla / Hadop4Win / Ezilla FOSS Evangelist DRBL/Clonezilla Partclone/Tuxboot Hadoop Ecosystem 2
  • 3. 演講大綱 Agenda Linux is everywhere 開放無所不在 What is Big Data ? 何謂巨量資料 Big Data in Motion ! 即時巨資應用 The Next Big Thing ? 下半場的重點 Conclusion 三大結論回顧 3
  • 4. 3 Buzzwords in 2013 三大年度熱門關鍵字 物聯網 Internet of Things 雲端運算 Cloud Computing 巨量資料 Big Data 4
  • 6. Linux Adoption in Enterprise increase in last 3 years Source: "2013 Enterprise End User Report", Linux Foundation, March 2013 http://www.linuxfoundation.org/publications/linux-foundation/linux-adoption-trends-end-user-report-2013 6
  • 7. Linux in the Cloud ← Amazon Yahoo ↓ If you have 4000+ server, which OS will you choose? http://www.datacenterknowledge.com/archives/2010/09/20/inside-the-yahoo-computing-coop/ http://bits.blogs.nytimes.com/2013/01/08/amazons-unknown-unknowns/?_r=0 7
  • 8. Linux in the Devices !! Linux have dominated Embedded, Mobile ..... maybe Internet of Things in near future …. http://crackberry.com/sites/crackberry.com/files/styles/large/public/topic_images/2013/ANDROID.png http://boxysystems.com/myblog/wp-content/uploads/2011/01/google-chrome-OS-logo.jpg Source: The most popular end-user Linux distributions are ... http://www.zdnet.com/the-most-popular-end-user-linux-distributions-are-7000017223/ 8
  • 9. Why Linux ? Total Cost of Ownership ! Source: How Red Hat Enterprise Linux trims total cost of ownership (TCO) in comparison to Windows Server, October 7, 2013 http://www.redhat.com/about/news/archive/2013/10/how-red-hat-enterprise-linux-trims-total-cost-of-ownership-in-comparison-to-windows-server 9
  • 10. 演講大綱 Agenda Linux is everywhere 開放無所不在 What is Big Data ? 何謂巨量資料 Big Data in Motion ! 即時巨資應用 The Next Big Thing ? 下半場的重點 Conclusion 三大結論回顧 10
  • 11. 巨量資料的三大挑戰 Challenges - 3 Vs of Big Data Volume 資料數量 (amount of data) EB 參考來源: [1] Laney, Douglas. "3D Data Management: Controlling Data Volume, Velocity and Variety" (6 February 2001) [2] Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data, June 2011 Structured 結構化資料 Batch ( 批次作業 ) Semi-structured 半結構化資料 PB Unstructured 非結構化資料 Variety 資料多樣性 (data types, sources) Realtime ( 即時資料 ) TB Velocity 資料增加率 (speed of data in/out) 巨量資料的挑戰在於如何管理「數量」、「增加率」與「多樣性」 11 11
  • 12. 知識源自彙整過去,智慧在能預測未來 Knowledge is from the PAST, Wisdom is for the FUTURE. 資料多寡不是重點,重點是 我們想要產生什麼價值呢? 時效合理嘛?成本合理嘛? It does not matter how big is your data. The goal is to create VALUE within reasonable time period and total cost of ownership. http://www.pursuantgroup.com/blog/tag/dikw-model/ 12 12
  • 13. PAST: Big Data at Rest http://nextbigsound.com/ http://www.openpdc.com/ Source: 10 ways big data changes everything, http://gigaom.com/2012/03/11/10-ways-big-data-is-changing-everything 13
  • 14. 處理巨量資料的三類技術 (1) Data at Rest – MapReduce Framework e Volume e uc ed R ap M Batch Hadoop HPCC Unstructured Variety EB PB TB Structured Petabyte File System m ra F k or w Realtime Velocity 14 14
  • 15. 巨量資料處理的資訊架構 The SMAQ stack for big data LAMP is for Web Services. 未來處理海量資料的人必需知道 You will need a big data stack called SMAQ ( Storage, MapReduce and Query ) 參考來源: The SMAQ stack for big data , Edd Dumbill , 22 September 2010 ,          http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html 圖片來源: http://smashingweb.ge6.org/wp-content/uploads/2011/10/apache-php-mysql-ubuntu.png 15
  • 16. 高資料通量處理平台 Hadoop Key Concept : Data Locality Hadoop is a framework for developer to wrote and execute massive data processing applications easily. Hadoop includes two parts: HDFS and MapReduce. Warehouse for data source and output results. HDFS stores unstructure data and structure data Processing Grouping Map One in One out Reduce Multiple in, One out 16
  • 18. 演講大綱 Agenda Linux is everywhere 開放無所不在 What is Big Data ? 何謂巨量資料 Big Data in Motion ! 即時巨資應用 The Next Big Thing ? 下半場的重點 Conclusion 三大結論回顧 18
  • 19. NOW: Big Data in Motion [ 金融 ] Trading Robot [ 災防 ] 海嘯、土石流 Disaster Prevention Tsunami Forecast [ 資訊 ] 機房即時用電資訊監控、警訊 Realtime Data Center Power Usage and related notifications http://www.newmobilelife.com/2013/04/21/facebook-pue-real-time-charts/ 19
  • 20. 處理巨量資料的三類技術 (2) Data in Motion – In-Memory Processing Volume EB Batch HBase / Drill / Impala PB Unstructured Variety Structured Realtime TB Velocity 20 20
  • 21. Google 的技術演進 vs Apache 專案 Big Query (JSON, SQL-like) Incremental Index Update (Caffeine) Graph Database Dremel (2010) Apache Drill (2012) Percolator (2010) Pregel (2009) Apache Giraph (2011) BigTable (2006) Apache HBase (2007) MapReduce (2004) Hadoop MapReduce (2006) Google File System (2003) HDFS (2006) 21
  • 24. 處理巨量資料的三類技術 (3) Streaming Data Collection Volume EB Structured Batch PB Unstructured Variety Realtime TB Message Queue Storm / Kafka Velocity 24 24
  • 25. Twitter Storm + Apache Kafka http://blog.infochimps.com/2012/10/30/next-gen-real-time-streaming-storm-kafka-integration/ 25
  • 26. 混合模式的巨量資料處理架構 Lambda Architecture for Big Data after 2013 HBase Storm ElephantDB Or Voldemort Source: Lambda Architecture, 8. March 2013 http://www.ymc.ch/en/lambda-architecture-part-1 26
  • 27. 結論二: Golden Ratio for Big Data 1 core : 2+ GB RAM : 1 SSD Disk FLOPS=~IOPS 27
  • 28. 演講大綱 Agenda Linux is everywhere 開放無所不在 What is Big Data ? 何謂巨量資料 Big Data in Motion ! 即時巨資應用 The Next Big Thing ? 下半場的重點 Conclusion 三大結論回顧 28
  • 29. 市場現況: Gartner Hype Cycle 2013 萌芽期 夢幻期 幻滅期 平原期 高原期 29
  • 30. NEXT: Big Data Security 品質管控 當我們緊密相連 ..... 世界政經:歐盟想分 Tweeter 找出經濟、政治的脈動 國家安全:美國 PRISM 計劃 ( 網軍 ! 終極警探 4.0 ) 權限管控 組織如何因應 APT ? Big Data 平台本身的安全性 ? 有太多安全的問題等待解決! 數量管控 Source: Gartner (March 2011), 'Big Data' Is Only the Beginning of Extreme Information Management, April 2011, http://www.gartner.com/id=1622715 30
  • 31. 結論三: For Big Data Security, buy hardware with encryption support Power Hardware PowerVM PowerSC Systems Virtualization Security & Compliance Power Hardware with security built in & hardware assists for encryption Secure Virtualization Platform ensuring isolation integrity Virtualization Centric Security Extensions to protect the Cloud and Virtual Data Center AIX Operating System Systems Software Secure Operating Systems– defense in depth: role based access, trusted execution, encrypted file system 31
  • 32. 演講大綱 Agenda Linux is everywhere 開放無所不在 What is Big Data ? 何謂巨量資料 Big Data in Motion ! 即時巨資應用 The Next Big Thing ? 下半場的重點 Conclusion 三大結論回顧 32
  • 33. Conclusion: 1.Switch to Linux, it helps to reduce TCO (total cost of ownership) about 34% ! 2.Hadoop only solve the problem for Big Data at Rest. You will need In-Memory Processing tools such as HBase, Spark, Impala for Big Data in Motion. To cleaning Big Data in realtime, you could try Message Queue, Storm and Kafka. 3.Consider to buy servers with SSD disk, it helps to improve processing time for Big Data at Rest. For Big Data in Motion, you will need tools like In-Memory Processing. Remember to buy more RAM for your cluster. 4.To protect your big data, please survey hardwares and operation system (OS) which have hardware encryption support ! 33