This document provides an overview of Jongwook Woo's background and experience working with big data and Hadoop. It discusses Woo's role as a professor teaching big data courses, partnerships with Cloudera and Amazon AWS, publications on Hadoop and NoSQL databases, and certificates earned in big data training. It also summarizes key aspects of big data, including the rise of unstructured and large-scale data, issues with relational databases at scale, and the two core components of Hadoop - HDFS for storage and MapReduce for distributed processing. Finally, it provides an example MapReduce job for sorting URLs by number of hits.
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Big Data and Data Intensive Computing: Education and Training
1. jwoo Woo
HiPIC
CSULA
Big Data and Data Intensive Computing:
Education and Training
Naver Labs
Bundang, Korea
Aug 30th 2013
Jongwook Woo (PhD)
High-Performance Internet Computing Center (HiPIC)
Educational Partner with Cloudera and Grants Awardee of Amazon AWS
Computer Information Systems Department
California State University, Los Angeles
2. High Performance Internet Computing Center
jwoo Woo
CSULA
Contents
소개
Data Issues
Big Data
Data-Intensive Computing: Hadoop
Training in Big Data
Big Data Supporters and Use Cases
3. High Performance Internet Computing Center
jwoo Woo
CSULA
Me
이름: 우종욱
직업:
교수 (직책: 부교수), California State University Los Angeles
– Capital City of Entertainment
경력:
2002년 부터 교수: Computer Information Systems Dept, College of
Business and Economics
– www.calstatela.edu/faculty/jwoo5
1998년부터 헐리우드등지의 많은 회사 컨설팅
– 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축
– FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
2009여년 부터 하둡 빅데이타에 관심
4. High Performance Internet Computing Center
jwoo Woo
CSULA
Me
경력 (계속):
2013년 여름 현재 IglooSecurity 자문중:
– Hadoop 및 그 Ecosystems 교육
– 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을
빠르게 데이타 검색하는 시스템 R&D
• Hadoop, Solr, Java, Cloudera 이용
2013년 9월 중순: 삼성 종합 기술원
– 3일간 Hadoop 및 그 Ecosystems 교육 예정
– Using Cloudera material in Korea as far as I know
5. High Performance Internet Computing Center
jwoo Woo
CSULA
Experience in Big Data
Grants
Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
Partnership
Received Academic Education Partnership with Cloudera since
June 2012
Linked with Hortonworks since May 2013
– Positive to provide partnership
6. High Performance Internet Computing Center
jwoo Woo
CSULA
Experience in Big Data
Certificate
Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
Certificate of 10gen Training Course, “M101: MongoDB
Development”, (Dec 24 2012)
Blog and Github for Hadoop and its ecosystems
http://dal-cloudcomputing.blogspot.com/
– Hadoop, AWS, Cloudera
https://github.com/hipic
– Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming,
RHadoop
https://github.com/dalgual
7. High Performance Internet Computing Center
jwoo Woo
CSULA
Experience in Big Data
Several publications regarding Hadoop and NoSQL
“Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA
2012, Las Vegas (July 16-19, 2012)
“Market Basket Analysis Algorithm with no-SQL DB HBase and
Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon
Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011
“Market Basket Analysis Algorithm with Map/Reduce of Cloud
Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011,
Las Vegas (July 18-21, 2011)
Jongwook Woo, “Introduction to Cloud Computing”, in the
10th KOCSEA Technical Symposium, UNLV, Dec 18 - 19, 2009
Talks in Korean Universities and companies
Yonsei, Sookmyung, KAIST, Korean Polytech Univ
– Winter 2011
VanillaBreeze
– Winter 2011
8. High Performance Internet Computing Center
jwoo Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing
9. High Performance Internet Computing Center
jwoo Woo
CSULA
Data
Google
“We don’t have a better algorithm
than others but we have more data
than others”
10. High Performance Internet Computing Center
jwoo Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data with sparse attributes
– Semantic or social relations
No relational property
– nor complex join queries
• Log data
Immutable
No need to update and delete data
11. High Performance Internet Computing Center
jwoo Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data, Bioinformatics, Social Computing,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
12. High Performance Internet Computing Center
jwoo Woo
CSULA
Two Cores in Big Data
How to store Big Data
NoSQL DB
How to compute Big Data
Parallel Computing with multiple non-
expensive computers
–Own super computers
13. High Performance Internet Computing Center
jwoo Woo
CSULA
Big Data for RDBMS
Issues in RDBMS
Hard to scale
– Relation gets broken
• Partitioning for scalability
• Replication for availability
Speed
– The Seek times of physical storage
• Slower than N/W speed
• 1TB disk: 10Mbps transfer rate
– 100K sec =>27.8 hrs
– With Multiple data sources at difference places
• 100 10GB disks: each 10Mbps transfer rate
– 1K sec =>16.7min
14. High Performance Internet Computing Center
jwoo Woo
CSULA
Big Data for RDBMS (Cont’d)
Issues in RDBMS (Cont’d)
Data Integration
–Not good for un-/semi-structured data
• Many unstructured data
–Web or log data etc
RDB not good in parallelization
–Cannot split 1000 tasks to non-expensive
1000 PCs efficiently
15. High Performance Internet Computing Center
jwoo Woo
CSULA
RDBMS Issues
Solution
Before: Data Warehouse
Now and future: Big Data
Hadoop framework
Data Computation (MapReduce, Pig)
Data Repositories (NoSQL DB: HBase,
Cassandra, MongoDB)
Business Intelligence (Data Mining,
OLAP, Data Visualization, Reporting):
Hive, Mahout
16. High Performance Internet Computing Center
jwoo Woo
CSULA
Big Data
Definition
Systems that supports a non-
expensive platform to store and
compute large scale, non-
/semi-structured data
17. High Performance Internet Computing Center
jwoo Woo
CSULA
Use Cases for NoSQL DB [1]
RDBMS replacement
for high-traffic web applications
Semi-structured content management
Real-time analytics & high-speed logging
Web Infrastructure
Web 2.0, Media, SaaS, Gaming,
Finance, Telecom, Healthcare, Government
Three NoSQL DB Approaches
Key/Value, Column, Document
18. High Performance Internet Computing Center
jwoo Woo
CSULA
Data Store of NoSQL DB
Key/Value store
(Key, Value)
Functions
– Index, versioning, sorting, locking, transaction,
replication
Apache Cassandra, Memcached
19. High Performance Internet Computing Center
jwoo Woo
CSULA
Data Store of NoSQL DB (Cont’d)
Column-Oriented Stores (Extensible Record
Stores)
stores data tables as sections of columns of data
– rather than as rows of data, like most RDBMS
• Sparse fields in RDBMS
– well-suited for OLAP-like workloads (e.g., data
warehouses)
Extensible record horizontally and vertically
partitioned across nodes
– Rows and Columns are distributed over multiple
nodes
BigTable, HBase, Cassandra, Hypertable
20. High Performance Internet Computing Center
jwoo Woo
CSULA
Data Store of NoSQL DB (Cont’d)
Row Oriented
– 1,Smith, Joe, smith@hi.com;
– 2,Jones, Mary, mary@hi.com;
– 3,Johnson, Cathy, cathy@hi.com;
Column Oriented
– 1,2,3;
– Smith, Jones, Johnson;
– Joe, Mary, Cathy;
– smith@hi.com, mary@hi.com, cathy@hi.com;
StudentId Lastname Firstname email
1 Smith Joe smith@hi.com
2 Jones Mary mary@hi.com
3 Johnson Cathy cathy@hi.com
21. High Performance Internet Computing Center
jwoo Woo
CSULA
HBase Schema Example (Student/Course)
RDBMS
Students: (id, name, sex, age)
Courses: (id, title, desc, teacher_id)
S_C: (s_id, c_id, type)
HBase
Column Families
id Info: Course
<student_id> Info:name Info:sex Info:age Course:<course_id>=
type
Column Families
id Info: student
<course_id> Info:title Info:desc Info:teacher_id student:<student_id>
=type
22. High Performance Internet Computing Center
jwoo Woo
CSULA
Data Store of NoSQL DB (Cont’d)
Document Store
Collections and Documents
– vs Tables and Records of RDB
Used in Search Engine/Repository
Multiple index to store indexed document
– no fixed fields
Not simple key-value lookup
– Use API
Functions
– No locking, Replication, Transaction
MongoDB, CouchDB, ThruDB, SimpleDB
23. High Performance Internet Computing Center
jwoo Woo
CSULA
Understanding the Document Model [1]
{
_id:“A4304”
author: “nosh”,
date: 22/6/2010,
title: “Intro to MongoDB”
text: “MongoDB is an open source..”,
tags: [“webinar”, “opensource”]
comments: [{author: “mike”,
date: 11/18/2010,
txt: “Did you see the…”,
votes: 7},….]
}
Documents->Collections->Databases
24. High Performance Internet Computing Center
jwoo Woo
CSULA
Document Model Makes Queries Simple [1]
Operators:
$gt, $lt, $gte, $lte, $ne, $all, $in, $nin, count, limit,
skip, group
Example:
db.posts.find({author: “nosh”,
tags: “webinar”})
26. High Performance Internet Computing Center
jwoo Woo
CSULA
The Great Divide [1]
MongoDB sweet spot: Easy, Flexible,
Scalable
HBase
MongoDB
27. High Performance Internet Computing Center
jwoo Woo
CSULA
Solutions in Big Data Computation
Map/Reduce by Google
(Key, Value) parallel computing
Apache Hadoop
Big Data
Data Computation (MapReduce, Pig)
Integrating MapReduce and RDB
Oracle + Hadoop
Sybase IQ
Vertica + Hadoop
Hadoop DB
Greenplum
Aster Data
Integrating MapReduce and NoSQL DB
MongoDB MapReduce
HBase
28. High Performance Internet Computing Center
jwoo Woo
CSULA
Apache Hadoop
Motivated by Google Map/Reduce and GFS
open source project of the Apache Foundation.
framework written in Java
– originally developed by Doug Cutting
• who named it after his son's toy elephant.
Two core Components
Storage: HDFS
– High Bandwidth Clustered storage
Processing: Map/Reduce
– Fault Tolerant Distributed Processing
Hadoop scales linearly with
data size
Analysis complexity
29. High Performance Internet Computing Center
jwoo Woo
CSULA
Hadoop issues
Map/Reduce is not DB
Algorithm in Restricted Parallel Computing
HDFS and HBase
Cannot compete with the functions in RDBMS
But, useful for
Useful for huge (peta- or Terra-bytes) but non-
complicated data
– Web crawling
– log analysis
• Log file for web companies
– New York Times case
30. High Performance Internet Computing Center
jwoo Woo
CSULA
MapReduce Pros & Cons Summary
Good when
Huge data for input, intermediate, output
A few synchronization required
Read once; batch oriented datasets (ETL)
Bad for
Fast response time
Large amount of shared data
Fine-grained synch needed
CPU-intensive not data-intensive
Continuous input stream
31. High Performance Internet Computing Center
jwoo Woo
CSULA
MapReduce in Detail
Functions borrowed from functional
programming languages (eg. Lisp)
Provides Restricted parallel programming
model on Hadoop
User implements Map() and Reduce()
Libraries (Hadoop) take care of
EVERYTHING else
–Parallelization
–Fault Tolerance
–Data Distribution
–Load Balancing
32. High Performance Internet Computing Center
jwoo Woo
CSULA
Map
Convert input data to (key, value) pairs
map() functions run in parallel,
creating different intermediate (key, value)
values from different input data sets
33. High Performance Internet Computing Center
jwoo Woo
CSULA
Reduce
reduce() combines those intermediate values
into one or more final values for that same
key
reduce() functions also run in parallel,
each working on a different output key
Bottleneck:
reduce phase can’t start until map phase is
completely finished.
34. High Performance Internet Computing Center
jwoo Woo
CSULA
Training in Big Data
Learn by yourself?
Miss many important topics
Two main:
–Cloudera, Hortonworks
• With hands-on exercises
Cloudera 강의 교재 간단히 소개
Especially MapReduce example
35. High Performance Internet Computing Center
jwoo Woo
CSULA
Example: Sort URLs in the largest hit order
Compute the largest hit URLs
Stored in log files
Map()
Input <logFilename, file text>
Output: Parses file and emits <url, hit counts> pairs
– eg. <http://hello.com, 1>
Reduce()
Input: <url, list of hit counts> from multiple map
nodes
Output: Sums all values for the same key and emits
<url, TotalCount>
– eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
36. High Performance Internet Computing Center
jwoo Woo
CSULA
Map/Reduce for URL visits
…
…Map1() Map2() Mapm()
Reduce1 () Reducel()
Data Aggregation/Combine
(http://hi.com, <1, 1, …, 1>)
(http://hello.com, <3, 5, 2, 7>)
(http://hi.com, 32)
(http://hello.com, 17)
Input Log Data
Reduce2()
(http://hi.com, 1)
(http://hello.com, 3)
…
(http://halo.com, 1)
(http://hello.com, 5)
…
(http://halo.com, <1, 5,>)
(http://halo.com, 6)
37. High Performance Internet Computing Center
jwoo Woo
CSULA
Legacy Example
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
11 million in all, dating back to 1851.
four-terabyte pile of images in TIFF format.
needed to translate that four-terabyte pile of TIFFs
into more web-friendly PDF files.
– not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.
38. High Performance Internet Computing Center
jwoo Woo
CSULA
Legacy Example (Cont’d)
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
a software programmer at the Times, Derek Gottfrid,
– playing around with Amazon Web Services, Elastic
Compute Cloud (EC2),
• uploaded the four terabytes of TIFF data into Amazon's
Simple Storage System (S3)
• In less than 24 hours, 11 millions PDFs, all stored
neatly in S3 and ready to be served up to visitors to the
Times site.
The total cost for the computing job? $240
– 10 cents per computer-hour times 100 computers times 24 hours
39. High Performance Internet Computing Center
jwoo Woo
CSULA
Supporters of Big Data: Hadoop Ecosystems
Apache Hadoop Supporters
Cloudera
– Like Linux and Redhat
– HiPIC is an Academic Partner
Hortonworks
– Pig,
– Consulting and training
Facebook
– Hive
IBM
– Jaql
NoSQL DB supporters
MongoDB
HBase, CouchDB, Apache Cassandra (originally by FB) etc
40. High Performance Internet Computing Center
jwoo Woo
CSULA
Pig
• developed at Yahoo Research around 2006
o moved into the Apache Software Foundation in
2007.
• PigLatin,
o Pig's language
o a data flow language
o well suited to processing unstructured data
Unlike SQL, not require that the data have a
schema
However, can still leverage the value of a schema
41. High Performance Internet Computing Center
jwoo Woo
CSULA
Hive
• developed at Facebook
o turns Hadoop into a data warehouse
o complete with a dialect of SQL for querying.
• HiveQL
o a declarative language (SQL dialect)
• Difference from PigLatin,
o you do not specify the data flow,
but instead describe the result you want
Hive figures out how to build a data flow to
achieve it.
o a schema is required,
but not limited to one schema.
o data can have many schemas
42. High Performance Internet Computing Center
jwoo Woo
CSULA
Hive (Cont'd)
• Similarity with PigLatin and SQL,
o HiveQL on its own is a relationally complete
language
but not a Turing complete language,
That can express any computation
o can be extended through UDFs (User Defined
Functions) of Java
just like Pig to be Turing complete
43. High Performance Internet Computing Center
jwoo Woo
CSULA
Jaql
• developed at IBM.
• a data flow language
o its native data structure format is JSON (JavaScript
Object Notation).
• Schemas are optional
• Turing complete on its own
o without the need for extension through UDFs.
44. High Performance Internet Computing Center
jwoo Woo
CSULA
MapReduce Cons and Future
Bad for
Fast response time
Large amount of shared data
Fine-grained synch needed
CPU-intensive not data-intensive
Continuous input stream
Hadoop 2.0: YARN
Not a product yet but will be soon
45. High Performance Internet Computing Center
jwoo Woo
CSULA
Hadoop 2.0: YARN
Data processing applications and services
Online Serving – HOYA (HBase on YARN)
Real-time event processing – Storm, S4, other
commercial platforms
Tez – Generic framework to run a complex DAG
MPI: OpenMPI, MPICH2
Master-Worker
Machine Learning: Spark
Graph processing: Giraph
Enabled by allowing the use of paradigm-specific
application master
[http://www.slideshare.net/hortonworks/apache-
hadoop-yarn-enabling-nex]
46. High Performance Internet Computing Center
jwoo Woo
CSULA
Use Cases
Amazon AWS
Facebook
Twitter
Craiglist
HuffPOst | AOL
47. High Performance Internet Computing Center
jwoo Woo
CSULA
Amazon AWS
amazon.com
Consumer and seller business
aws.amazon.com
IT infrastructure business
– Focus on your business not IT management
Pay as you go
Services with many APIs
– S3: Simple Storage Service
– EC2: Elastic Compute Cloud
• Provide many virtual Linux servers
• Can run on multiple nodes
– Hadoop and HBase
– MongoDB
48. High Performance Internet Computing Center
jwoo Woo
CSULA
Amazon AWS (Cont’d)
Customers on aws.amazon.com
Samsung
– Smart TV hub sites: TV applications are on AWS
Netflix
– ~25% of US internet traffic
– ~100% on AWS
NASA JPL
– Analyze more than 200,000 images
NASDAQ
– Using AWS S3
HiPIC received research and teaching
grants from AWS
49. High Performance Internet Computing Center
jwoo Woo
CSULA
Facebook [7]
Using Apache HBase
For Titan and Puma
– Message Services
– ETL
HBase for FB
– Provide excellent write performance and good reads
– Nice features
• Scalable
• Fault Tolerance
• MapReduce
50. High Performance Internet Computing Center
jwoo Woo
CSULA
Titan: Facebook
Message services in FB
Hundreds of millions of active users
15+ billion messages a month
50K instant message a second
Challenges
High write throughput
– Every message, instant message, SMS, email
Massive Clusters
– Must be easily scalable
Solution
Clustered HBase
51. High Performance Internet Computing Center
jwoo Woo
CSULA
Puma: Facebook
ETL
Extract, Transform, Load
– Data Integrating from many data sources to Data Warehouse
Data analytics
– Domain owners’ web analytics for Ad and apps
• clicks, likes, shares, comments etc
ETL before Puma
8 – 24 hours
– Procedures: Scribe, HDFS, Hive, MySQL
ETL after Puma
Puma
– Real time MapReduce framework
2 – 30 secs
– Procedures: Scribe, HDFS, Puma, HBase
52. High Performance Internet Computing Center
jwoo Woo
CSULA
Twitter [8]
Three Challenges
Collecting Data
– Scribe as FB
Large Scale Storage and analysis
– Cassandra: ColumnFamily key-value store
– Hadoop
Rapid Learning over Big Data
– Pig
• 5% of Java code
• 5% of dev time
• Within 20% of running time
53. High Performance Internet Computing Center
jwoo Woo
CSULA
Craiglist in MongoDB [9]
Craiglist
~700 cities, worldwide
~1 billion hits/day
~1.5 million posts/day
Servers
– ~500 servers
– ~100 MySQL servers
Migrate to MongoDB
Scalable, Fast, Proven, Friendly
54. High Performance Internet Computing Center
jwoo Woo
CSULA
HuffPost | AOL [10]
Two Machine Learning Use Cases
Comment Moderation
–Evaluate All New HuffPost User Comments
Every Day
• Identify Abusive / Aggressive Comments
• Auto Delete / Publish ~25% Comments Every
Day
Article Classification
–Tag Articles for Advertising
• E.g.: scary, salacious, …
55. High Performance Internet Computing Center
jwoo Woo
CSULA
HuffPost | AOL [10]
Parallelize on Hadoop
Good news:
– Mahout, a parallel machine learning tool, is
already available.
– There are Mallet, libsvm, Weka, … that support
necessary algorithms.
Bad news:
– Mahout doesn’t support necessary algorithms
yet.
– Other algorithms do not run natively on Hadoop.
build a flexible ML platform running on
Hadoop
Pig for Hadoop implementation.
56. High Performance Internet Computing Center
jwoo Woo
CSULA
Hadoop Streaming
Hadoop MapReduce for Non-Java codes: Python,
Ruby
Requirement
Running Hadoop
Needs Hadoop Streaming API
– hadoop-streaming.jar
Needs to build Mapper and Reducer codes
– Simple conversion from sequential codes
STDIN > mapper > reducer > STDOUT
57. High Performance Internet Computing Center
jwoo Woo
CSULA
Hadoop Streaming
MapReduce Python execution
http://wiki.apache.org/hadoop/HadoopStreaming
Sysntax
$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar
[options] Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
Example
$ bin/hadoop jar contrib/streaming/hadoop-streaming.jar
-file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py
-file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py
-input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare-
output
58. High Performance Internet Computing Center
jwoo Woo
CSULA
Conclusion
Era of Big Data
Need to store and compute Big Data
Many solutions but Hadoop
Storage: NoSQL DB
Computation: Hadoop MapRedude
Need to analyze Big Data in mobile computing, SNS
for Ad, User Behavior, Patterns …