Big Data and Data Intensive Computing: Education and Training

jwoo Woo
HiPIC
CSULA
Big Data and Data Intensive Computing:
Education and Training
Graduate School of Communication & Art
Yonsei University
Shinchon, Korea
Sept 5th 2013
Jongwook Woo (PhD)
High-Performance Information Computing Center (HiPIC)
Educational Partner with Cloudera and Grants Awardee of Amazon AWS
Computer Information Systems Department
California State University, Los Angeles

High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
소개
 Big Data Use Cases
 Data Issues
 Big Data
 Data-Intensive Computing: Hadoop
 Training in Big Data
 Big Data Supporters

Jongwook Woo
CSULA
Me
 이름: 우종욱
 직업:
 교수 (직책: 부교수), California State University Los Angeles
– Capital City of Entertainment
 경력:
 2002년 부터 교수: Computer Information Systems Dept, College of
Business and Economics
– www.calstatela.edu/faculty/jwoo5
 1998년부터 헐리우드등지의 많은 회사 컨설팅
– 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축
– FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
 2009여년 부터 하둡 빅데이타에 관심

Jongwook Woo
CSULA
Me
경력 (계속):
2013년 여름 현재 IglooSecurity 자문중:
– Hadoop 및 그 Ecosystems 교육
– 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을
빠르게 데이타 검색하는 시스템 R&D
• Hadoop, Solr, Java, Cloudera 이용
2013년 9월 중순: 삼성 종합 기술원
– 3일간 Hadoop 및 그 Ecosystems 교육 예정
– Using Cloudera material in Korea as far as I know

Jongwook Woo
CSULA
Experience in Big Data
 Grants
 Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
 Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
 Partnership
 Received Academic Education Partnership with Cloudera since
June 2012
 Linked with Hortonworks since May 2013
– Positive to provide partnership

Jongwook Woo
CSULA
 Certificate
 Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
 Certificate of 10gen Training Course, “M101: MongoDB
Development”, (Dec 24 2012)
 Blog and Github for Hadoop and its ecosystems
 http://dal-cloudcomputing.blogspot.com/
– Hadoop, AWS, Cloudera
 https://github.com/hipic
– Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming,
RHadoop
 https://github.com/dalgual

Jongwook Woo
CSULA
 Several publications regarding Hadoop and NoSQL
 “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA
2012, Las Vegas (July 16-19, 2012)
 “Market Basket Analysis Algorithm with no-SQL DB HBase and
Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon
Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011
 “Market Basket Analysis Algorithm with Map/Reduce of Cloud
Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011,
Las Vegas (July 18-21, 2011)
 Jongwook Woo, “Introduction to Cloud Computing”, in the
10th KOCSEA Technical Symposium, UNLV, Dec 18 - 19, 2009
 Talks in Korean Universities and companies
 Yonsei, Sookmyung, KAIST, Korean Polytech Univ
– Winter 2011
 VanillaBreeze
– Winter 2011

Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing

Jongwook Woo
CSULA
Data
Google
“We don’t have a better algorithm
than others but we have more data
than others”

Jongwook Woo
CSULA
Use Cases in Korea
SK Telecomm
Seoul
Credit Cards
Hyundai Motors

Jongwook Woo
CSULA
SK Telecomm
T Map
 Collect GPS traffic data from Taxi, Bus,
Rental Car
– Every 5 mins. Traffic data from 50,000 cars
 Tell the quickest directions to the
destination

Jongwook Woo
CSULA
Seoul
Night Bus
 Collect GPS traffic data from Taxi
 Find out the most frequent traffics
–Build Bus lines in the night

Jongwook Woo
CSULA
Credit Cards
Apps to find out popular restaurants
Collect customers behavior, which occurred using
the cards at the restaurants
Based on Logic: Frequency to visit the same
restaurants in 3 months
Show the popular restaurants
Credit Cards for Gas Station discount
Using a card at a gas station that does not provide
discounts
Sell a new card that gives a discount at any station

Jongwook Woo
CSULA
Hyundai Motors
Improve the present and future models
Collect drivers’ behavior and the status of the cars
Collect any errors in the car

Jongwook Woo
CSULA
Use Cases
President Election
Amazon AWS
HuffPOst | AOL

Jongwook Woo
CSULA
President Election
People Behavior Analysis
Collect people’s data of Credit card usages, Car
models, Newspapers to read, Facebook, Twitter
For example, pro-environmental Campaign for
– Mom
• who sends the kids to the public school,
• who twits about Organic foods,

Jongwook Woo
CSULA
HuffPost | AOL [10]
Two Machine Learning Use Cases
Comment Moderation
–Evaluate All New HuffPost User Comments
Every Day
• Identify Abusive / Aggressive Comments
• Auto Delete / Publish ~25% Comments Every
Day
Article Classification
–Tag Articles for Advertising
• E.g.: scary, salacious, …

Jongwook Woo
CSULA
HuffPost | AOL [10]
Parallelize on Hadoop
Good news:
– Mahout, a parallel machine learning tool, is
already available.
– There are Mallet, libsvm, Weka, … that support
necessary algorithms.
Bad news:
– Mahout doesn’t support necessary algorithms
yet.
– Other algorithms do not run natively on Hadoop.
build a flexible ML platform running on
Hadoop
Pig for Hadoop implementation.

Jongwook Woo
CSULA
Others
amazon.com
Recommend books to the people
Google
Find out influenza much earlier
– by analyzing the area under influenza
Translator
– by analyzing the data from many people
Siri of Apple
Natural Language Processing from many data of
people

Jongwook Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data with sparse attributes
– Semantic or social relations
No relational property
– nor complex join queries
• Log data
Immutable
No need to update and delete data

Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data, Bioinformatics, Social Computing,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive

Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
NoSQL DB
How to compute Big Data
Parallel Computing with multiple non-
expensive computers
–Own super computers

Jongwook Woo
CSULA
Big Data for RDBMS
Issues in RDBMS
Hard to scale
– Relation gets broken
• Partitioning for scalability
• Replication for availability
Speed
– The Seek times of physical storage
• Slower than N/W speed
• 1TB disk: 10Mbps transfer rate
– 100K sec =>27.8 hrs
– With Multiple data sources at difference places
• 100 10GB disks: each 10Mbps transfer rate
– 1K sec =>16.7min

Jongwook Woo
CSULA
Big Data for RDBMS (Cont’d)
Issues in RDBMS (Cont’d)
Data Integration
–Not good for un-/semi-structured data
• Many unstructured data
–Web or log data etc
RDB not good in parallelization
–Cannot split 1000 tasks to non-expensive
1000 PCs efficiently

Jongwook Woo
CSULA
RDBMS Issues
Solution
 Before: Data Warehouse
 Now and future: Big Data
Hadoop framework
Data Computation (MapReduce, Pig)
Data Repositories (NoSQL DB: HBase,
Cassandra, MongoDB)
Business Intelligence (Data Mining,
OLAP, Data Visualization, Reporting):
Hive, Mahout

Jongwook Woo
CSULA
Big Data
Definition
 Systems that supports a non-
expensive platform to store and
compute large scale, non-
/semi-structured data

Jongwook Woo
CSULA
Use Cases for NoSQL DB [1]
RDBMS replacement
for high-traffic web applications
Semi-structured content management
Real-time analytics & high-speed logging
Web Infrastructure
Web 2.0, Media, SaaS, Gaming,
Finance, Telecom, Healthcare, Government
Three NoSQL DB Approaches
Key/Value, Column, Document

Jongwook Woo
CSULA
Data Store of NoSQL DB
Key/Value store
(Key, Value)
Functions
– Index, versioning, sorting, locking, transaction,
replication
Apache Cassandra, Memcached

Jongwook Woo
CSULA
Data Store of NoSQL DB (Cont’d)
Column-Oriented Stores (Extensible Record
Stores)
stores data tables as sections of columns of data
– rather than as rows of data, like most RDBMS
• Sparse fields in RDBMS
– well-suited for OLAP-like workloads (e.g., data
warehouses)
Extensible record horizontally and vertically
partitioned across nodes
– Rows and Columns are distributed over multiple
nodes
BigTable, HBase, Cassandra, Hypertable

Jongwook Woo
CSULA
 Row Oriented
– 1,Smith, Joe, smith@hi.com;
– 2,Jones, Mary, mary@hi.com;
– 3,Johnson, Cathy, cathy@hi.com;
 Column Oriented
– 1,2,3;
– Smith, Jones, Johnson;
– Joe, Mary, Cathy;
– smith@hi.com, mary@hi.com, cathy@hi.com;
StudentId Lastname Firstname email
1 Smith Joe smith@hi.com
2 Jones Mary mary@hi.com
3 Johnson Cathy cathy@hi.com

Jongwook Woo
CSULA
HBase Schema Example (Student/Course)
 RDBMS
 Students: (id, name, sex, age)
 Courses: (id, title, desc, teacher_id)
 S_C: (s_id, c_id, type)
 HBase
Column Families
id Info: Course
<student_id> Info:name Info:sex Info:age Course:<course_id>=
type
Column Families
id Info: student
<course_id> Info:title Info:desc Info:teacher_id student:<student_id>
=type

Jongwook Woo
CSULA
Document Store
Collections and Documents
– vs Tables and Records of RDB
Used in Search Engine/Repository
Multiple index to store indexed document
– no fixed fields
Not simple key-value lookup
– Use API
Functions
– No locking, Replication, Transaction
MongoDB, CouchDB, ThruDB, SimpleDB

Jongwook Woo
CSULA
Understanding the Document Model [1]
{
_id:“A4304”
author: “nosh”,
date: 22/6/2010,
title: “Intro to MongoDB”
text: “MongoDB is an open source..”,
tags: [“webinar”, “opensource”]
comments: [{author: “mike”,
date: 11/18/2010,
txt: “Did you see the…”,
votes: 7},….]
}
Documents->Collections->Databases

Jongwook Woo
CSULA
Document Model Makes Queries Simple [1]
Operators:
$gt, $lt, $gte, $lte, $ne, $all, $in, $nin, count, limit,
skip, group
Example:
db.posts.find({author: “nosh”,
tags: “webinar”})

Jongwook Woo
CSULA
Selected Users [1]

Jongwook Woo
CSULA
The Great Divide [1]
MongoDB sweet spot: Easy, Flexible,
Scalable
HBase
MongoDB

Jongwook Woo
CSULA
Solutions in Big Data Computation
 Map/Reduce by Google
(Key, Value) parallel computing
 Apache Hadoop
 Big Data
Data Computation (MapReduce, Pig)
 Integrating MapReduce and RDB
Oracle + Hadoop
Sybase IQ
Vertica + Hadoop
Hadoop DB
Greenplum
Aster Data
 Integrating MapReduce and NoSQL DB
MongoDB MapReduce
HBase

Jongwook Woo
CSULA
Apache Hadoop
 Motivated by Google Map/Reduce and GFS
 open source project of the Apache Foundation.
 framework written in Java
– originally developed by Doug Cutting
• who named it after his son's toy elephant.
 Two core Components
 Storage: HDFS
– High Bandwidth Clustered storage
 Processing: Map/Reduce
– Fault Tolerant Distributed Processing
 Hadoop scales linearly with
 data size
 Analysis complexity

Jongwook Woo
CSULA
Hadoop issues
Map/Reduce is not DB
Algorithm in Restricted Parallel Computing
HDFS and HBase
Cannot compete with the functions in RDBMS
But, useful for
Useful for huge (peta- or Terra-bytes) but non-
complicated data
– Web crawling
– log analysis
• Log file for web companies
– New York Times case

Jongwook Woo
CSULA
MapReduce Pros & Cons Summary
Good when
Huge data for input, intermediate, output
A few synchronization required
Read once; batch oriented datasets (ETL)
Bad for
Fast response time
Large amount of shared data
Fine-grained synch needed
CPU-intensive not data-intensive
Continuous input stream

Jongwook Woo
CSULA
MapReduce in Detail
Functions borrowed from functional
programming languages (eg. Lisp)
Provides Restricted parallel programming
model on Hadoop
User implements Map() and Reduce()
Libraries (Hadoop) take care of
EVERYTHING else
–Parallelization
–Fault Tolerance
–Data Distribution
–Load Balancing

Jongwook Woo
CSULA
Map
Convert input data to (key, value) pairs
map() functions run in parallel,
 creating different intermediate (key, value)
values from different input data sets

Jongwook Woo
CSULA
Reduce
reduce() combines those intermediate values
into one or more final values for that same
key
reduce() functions also run in parallel,
each working on a different output key
Bottleneck:
reduce phase can’t start until map phase is
completely finished.

Jongwook Woo
CSULA
Training in Big Data
 Learn by yourself?
Miss many important topics
Two main:
–Cloudera, Hortonworks
• With hands-on exercises
Cloudera 강의 교재 간단히 소개
Especially MapReduce example

Jongwook Woo
CSULA
Example: Sort URLs in the largest hit order
Compute the largest hit URLs
Stored in log files
Map()
Input <logFilename, file text>
Output: Parses file and emits <url, hit counts> pairs
– eg. <http://hello.com, 1>
Reduce()
Input: <url, list of hit counts> from multiple map
nodes
Output: Sums all values for the same key and emits
<url, TotalCount>
– eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>

Jongwook Woo
CSULA
Map/Reduce for URL visits
…
…Map1() Map2() Mapm()
Reduce1 () Reducel()
Data Aggregation/Combine
(http://hi.com, <1, 1, …, 1>)
(http://hello.com, <3, 5, 2, 7>)
(http://hi.com, 32)
(http://hello.com, 17)
Input Log Data
Reduce2()
(http://hi.com, 1)
…
(http://halo.com, 1)
…
(http://halo.com, <1, 5,>)
(http://halo.com, 6)

Jongwook Woo
CSULA
Legacy Example
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
11 million in all, dating back to 1851.
four-terabyte pile of images in TIFF format.
needed to translate that four-terabyte pile of TIFFs
into more web-friendly PDF files.
– not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.

Jongwook Woo
CSULA
Legacy Example (Cont’d)
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
a software programmer at the Times, Derek Gottfrid,
– playing around with Amazon Web Services, Elastic
Compute Cloud (EC2),
• uploaded the four terabytes of TIFF data into Amazon's
Simple Storage System (S3)
• In less than 24 hours, 11 millions PDFs, all stored
neatly in S3 and ready to be served up to visitors to the
Times site.
 The total cost for the computing job? $240
– 10 cents per computer-hour times 100 computers times 24 hours

Jongwook Woo
CSULA
Supporters of Big Data: Hadoop Ecosystems
 Apache Hadoop Supporters
 Cloudera
– Like Linux and Redhat
– HiPIC is an Academic Partner
 Hortonworks
– Pig,
– Consulting and training
 Facebook
– Hive
 IBM
– Jaql
 NoSQL DB supporters
 MongoDB
 HBase, CouchDB, Apache Cassandra (originally by FB) etc

Jongwook Woo
CSULA
Pig
• developed at Yahoo Research around 2006
o moved into the Apache Software Foundation in
2007.
• PigLatin,
o Pig's language
o a data flow language
o well suited to processing unstructured data
 Unlike SQL, not require that the data have a
schema
 However, can still leverage the value of a schema

Jongwook Woo
CSULA
Hive
• developed at Facebook
o turns Hadoop into a data warehouse
o complete with a dialect of SQL for querying.
• HiveQL
o a declarative language (SQL dialect)
• Difference from PigLatin,
o you do not specify the data flow,
 but instead describe the result you want
 Hive figures out how to build a data flow to
achieve it.
o a schema is required,
 but not limited to one schema.
o data can have many schemas

Jongwook Woo
CSULA
Hive (Cont'd)
• Similarity with PigLatin and SQL,
o HiveQL on its own is a relationally complete
language
 but not a Turing complete language,
 That can express any computation
o can be extended through UDFs (User Defined
Functions) of Java
 just like Pig to be Turing complete

Jongwook Woo
CSULA
Jaql
• developed at IBM.
• a data flow language
o its native data structure format is JSON (JavaScript
Object Notation).
• Schemas are optional
• Turing complete on its own
o without the need for extension through UDFs.

Jongwook Woo
CSULA
MapReduce Cons and Future
Bad for
Fast response time
Large amount of shared data
Fine-grained synch needed
CPU-intensive not data-intensive
Continuous input stream
Hadoop 2.0: YARN
Not a product yet but will be soon

Jongwook Woo
CSULA
Hadoop 2.0: YARN
Data processing applications and services
Online Serving – HOYA (HBase on YARN)
Real-time event processing – Storm, S4, other
commercial platforms
Tez – Generic framework to run a complex DAG
 MPI: OpenMPI, MPICH2
 Master-Worker
 Machine Learning: Spark
 Graph processing: Giraph
 Enabled by allowing the use of paradigm-specific
application master
[http://www.slideshare.net/hortonworks/apache-
hadoop-yarn-enabling-nex]

Jongwook Woo
CSULA
Big Data Supporters
Amazon AWS
Facebook
Twitter
Craiglist

Jongwook Woo
CSULA
Amazon AWS
amazon.com
Consumer and seller business
aws.amazon.com
IT infrastructure business
– Focus on your business not IT management
Pay as you go
Services with many APIs
– S3: Simple Storage Service
– EC2: Elastic Compute Cloud
• Provide many virtual Linux servers
• Can run on multiple nodes
– Hadoop and HBase
– MongoDB

Jongwook Woo
CSULA
Amazon AWS (Cont’d)
Customers on aws.amazon.com
Samsung
– Smart TV hub sites: TV applications are on AWS
Netflix
– ~25% of US internet traffic
– ~100% on AWS
NASA JPL
– Analyze more than 200,000 images
NASDAQ
– Using AWS S3
HiPIC received research and teaching
grants from AWS

Jongwook Woo
CSULA
Facebook [7]
Using Apache HBase
 For Titan and Puma
– Message Services
– ETL
 HBase for FB
– Provide excellent write performance and good reads
– Nice features
• Scalable
• Fault Tolerance
• MapReduce

Jongwook Woo
CSULA
Titan: Facebook
Message services in FB
Hundreds of millions of active users
15+ billion messages a month
50K instant message a second
Challenges
High write throughput
– Every message, instant message, SMS, email
Massive Clusters
– Must be easily scalable
Solution
Clustered HBase

Jongwook Woo
CSULA
Puma: Facebook
 ETL
 Extract, Transform, Load
– Data Integrating from many data sources to Data Warehouse
 Data analytics
– Domain owners’ web analytics for Ad and apps
• clicks, likes, shares, comments etc
 ETL before Puma
 8 – 24 hours
– Procedures: Scribe, HDFS, Hive, MySQL
 ETL after Puma
 Puma
– Real time MapReduce framework
 2 – 30 secs
– Procedures: Scribe, HDFS, Puma, HBase

Jongwook Woo
CSULA
Twitter [8]
Three Challenges
Collecting Data
– Scribe as FB
Large Scale Storage and analysis
– Cassandra: ColumnFamily key-value store
– Hadoop
Rapid Learning over Big Data
– Pig
• 5% of Java code
• 5% of dev time
• Within 20% of running time

Jongwook Woo
CSULA
Craiglist in MongoDB [9]
Craiglist
~700 cities, worldwide
~1 billion hits/day
~1.5 million posts/day
Servers
– ~500 servers
– ~100 MySQL servers
Migrate to MongoDB
Scalable, Fast, Proven, Friendly

Jongwook Woo
CSULA
Hadoop Streaming
 Hadoop MapReduce for Non-Java codes: Python,
Ruby
 Requirement
 Running Hadoop
 Needs Hadoop Streaming API
– hadoop-streaming.jar
 Needs to build Mapper and Reducer codes
– Simple conversion from sequential codes
 STDIN > mapper > reducer > STDOUT

Jongwook Woo
CSULA
Hadoop Streaming
 MapReduce Python execution
 http://wiki.apache.org/hadoop/HadoopStreaming
 Sysntax
$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar
[options] Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
 Example
$ bin/hadoop jar contrib/streaming/hadoop-streaming.jar
-file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py
-file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py
-input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare-
output

Jongwook Woo
CSULA
Conclusion
 Era of Big Data
 Need to store and compute Big Data
 Many solutions but Hadoop
 Storage: NoSQL DB
 Computation: Hadoop MapRedude
 Need to analyze Big Data in mobile computing, SNS
for Ad, User Behavior, Patterns …

Jongwook Woo
CSULA
Question?

Big Data and Data Intensive Computing: Education and Training

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Big Data and Data Intensive Computing: Education and Training

Ähnlich wie Big Data and Data Intensive Computing: Education and Training (20)

Mehr von Jongwook Woo

Mehr von Jongwook Woo (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data and Data Intensive Computing: Education and Training