I talk about the fundamental of Big Data, which includes Hadoop, Intensive Data Computing, and NoSQL database that have received highlight to compute and store BigData that is usually greater then Peta-byte data. Besides, I
introduce the case studies that use Hadoop and NSQL DB.
Big Data Fundamentals in the Emerging New Data World
1. HiPIC
Big Data Fundamentals in the
Emerging New Data World
PIT (Product Innovation Team)
Samsung Electronics America
San Jose, CA
Aug 17th 2012
Jongwook Woo (PhD)
High-Performance Internet Computing Center (HiPIC)
Educational Partner with Cloudera and Grants Awardee of Amazon AWS
Computer Information Systems Department
California State University, Los Angeles
Jongwook Woo
CSULA
2. HiPIC Contents
Fundamentals of Big Data
NoSQL DB: HBase, MongoDB
Data-Intensive Computing: Hadoop
Big Data Supporters and Use Cases
CSULA
Jongwook Woo
3. HiPIC Experience in Big Data
Several publications regarding Hadoop and NoSQL
“Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA
2012, Las Vegas (July 16-19, 2012)
“Market Basket Analysis Algorithm with no-SQL DB HBase and
Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon
Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011
“Market Basket Analysis Algorithm with Map/Reduce of Cloud
Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011,
Las Vegas (July 18-21, 2011)
Jongwook Woo, “Introduction to Cloud Computing”, in the
10th KOCSEA Technical Symposium, UNLV, Dec 18 - 19, 2009
Talks in Korean Universities and companies
Yonsei, Sookmyung, KAIST, Korean Polytech Univ
– Winter 2011
VanillaBreeze
– Winter 2011
CSULA
Jongwook Woo
4. HiPIC Experience in Big Data (Cont’d)
Grants
Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
Partnership
Received Academic Education Partnership with Cloudera since
June 2012
Certificate
Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
Cloud Computing Blog
http://dal-cloudcomputing.blogspot.com/
CSULA
Jongwook Woo
5. What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
HiPIC Cloud Computing
Clo
ude
AWS ra
Ho
L
rto
DB SQ
nW
No
ork
s
CSULA
Jongwook Woo
6. HiPIC Big Data
Too much data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data, Bioinformatics, Social
Computing, smart phone, online game…
Cannot handle with the legacy
approach
Too big
Un-/Semi-structured data
CSULA
Jongwook Woo
7. HiPIC Two Issues in Big Data
How to store Big Data
NoSQL DB
How to compute Big Data
Parallel Computing with multiple cheap
computers
– Not need super computers
CSULA
Jongwook Woo
8. HiPIC Contents
Fundamentals of Big Data
NoSQL DB: HBase, MongoDB
Data-Intensive Computing: Hadoop
Big Data Supporters and Use Cases
CSULA
Jongwook Woo
9. HiPIC New Data Trend
Sparsity
Schema free data with sparse attributes
– Document Term vector
– User-Item matrix
– Semantic or social relations
No relational property
– nor complex join queries
• Log data
CSULA
Jongwook Woo
10. HiPIC New Data Trend (Cont’d)
Immutable
No need to update and delete data
– Only insert with versions
• Tracking history
• Lock-free (key based autonomicity)
CSULA
Jongwook Woo
11. HiPIC Big Data for RDBMS
Issues in RDBMS
Hard to scale
– Relation gets broken
• Partitioning for scalability
• Replication for availability
Speed
– The Seek times of physical storage
• Slower than N/W speed
• 1TB disk: 10Mbps transfer rate
– 100K sec =>27.8 hrs
– With Multiple data sources at difference places
• 100 10GB disks: each 10Mbps transfer rate
– 1K sec =>16.7min
CSULA
Jongwook Woo
12. HiPIC Big Data for RDBMS (Cont’d)
Issues in RDBMS (Cont’d)
Data Integration
– Not good for un-/semi-structured data
• Many unstructured data
– Web or log data etc
RDB not good in parallelization
– Cannot split 1000 tasks to cheap 1000 PCs
efficiently
CSULA
Jongwook Woo
13. HiPIC RDBMS Issues
Solution
Big Data
⇒Data Cleansing by Hadoop
⇒ Data Computation (MapReduce, Pig)
⇒ Data Repositories (NoSQL DB: HBase,
Cassandra, MongoDB)
⇒Business Intelligence (Data Mining,
OLAP, Data Visualization, Reporting):
Hive, Mahout
CSULA
Jongwook Woo
14. HiPIC NoSQL DBs
not primarily built on tables,
generally do not use SQL for data manipulation
non-relational, distributed data stores
– often do not provide ACID (atomicity, consistency, isolation,
durability)
• which are the key attributes of classic RDB
Fast Index on large amount of data
Lookup by keys (key/value)
NoSQL normally supports MapReduce
Parallel computation
CSULA
Jongwook Woo
15. HiPIC Use Cases for NoSQL DB [1]
RDBMS replacement
for high-traffic web applications
Semi-structured content management
Real-time analytics & high-speed logging
Web Infrastructure
Web 2.0, Media, SaaS, Gaming,
Finance, Telecom, Healthcare, Government
Three NoSQL DB Approaches
Key/Value, Column, Document
CSULA
Jongwook Woo
16. HiPIC Data Store of NoSQL DB
Key/Value store
(Key, Value)
Functions
– Index, versioning, sorting, locking, transaction,
replication
Apache Cassandra, Memcached
CSULA
Jongwook Woo
17. HiPIC Data Store of NoSQL DB (Cont’d)
Column-Oriented Stores (Extensible Record
Stores)
stores data tables as sections of columns of data
– rather than as rows of data, like most RDBMS
• Sparse fields in RDBMS
– well-suited for OLAP-like workloads (e.g., data
warehouses)
Extensible record horizontally and vertically
partitioned across nodes
– Rows and Columns are distributed over multiple
nodes
BigTable, HBase, Cassandra, Hypertable
CSULA
Jongwook Woo
18. HiPIC Data Store of NoSQL DB (Cont’d)
StudentId Lastname Firstname email
1 Smith Joe smith@hi.com
2 Jones Mary mary@hi.com
3 Johnson Cathy cathy@hi.com
Row Oriented
– 1,Smith, Joe, smith@hi.com;
– 2,Jones, Mary, mary@hi.com;
– 3,Johnson, Cathy, cathy@hi.com;
Column Oriented
– 1,2,3;
– Smith, Jones, Johnson;
– Joe, Mary, Cathy;
– smith@hi.com, mary@hi.com, cathy@hi.com;
CSULA
Jongwook Woo
20. HiPIC Data Store of NoSQL DB (Cont’d)
Document Store
Collections and Documents
– vs Tables and Records of RDB
Used in Search Engine/Repository
Multiple index to store indexed document
– no fixed fields
Not simple key-value lookup
– Use API
Functions
– No locking, Replication, Transaction
MongoDB, CouchDB, ThruDB, SimpleDB
CSULA
Jongwook Woo
21. HiPIC The Great Divide [1]
MongoDB
HBase
MongoDB sweet spot: Easy, Flexible, Scalable
CSULA
Jongwook Woo
22. HiPIC Understanding the Document Model [1]
{
_id:“A4304”
author: “nosh”,
date: 22/6/2010,
title: “Intro to MongoDB”
text: “MongoDB is an open source..”,
tags: [“webinar”, “opensource”]
comments: [{author: “mike”,
date: 11/18/2010,
txt: “Did you see the…”,
votes: 7},….]
}
Documents->Collections->Databases
CSULA
Jongwook Woo
23. HiPIC Document Model Makes Queries Simple [1]
Operators:
$gt, $lt, $gte, $lte, $ne, $all, $in, $nin, count, limit,
skip, group
Example:
db.posts.find({author: “nosh”,
tags: “webinar”})
CSULA
Jongwook Woo
25. HiPIC Contents
Fundamentals of Big Data
NoSQL DB: HBase, MongoDB
Data-Intensive Computing: Hadoop
Big Data Supporters and Use Cases
CSULA
Jongwook Woo
26. HiPIC Data nowadays
• Data Issues
o data grows to 10TB, and then 100TB.
o Unstructured data coming from sources
like Facebook, Twitter, RFID readers, sensors,
and so on.
Need to derive information from both the
relational data and the unstructured data
• as soon as possible.
• Solution to efficiently compute Big
Data
o Hadoop Map/Reduce
CSULA
Jongwook Woo
27. HiPIC Solutions in Big Data Computation
Map/Reduce by Google
(Key, Value) parallel computing
Apache Hadoop
Big Data
⇒Data Computation (MapReduce, Pig)
Integrating MapReduce and RDB
Oracle + Hadoop
Sybase IQ
Vertica + Hadoop
Hadoop DB
Greenplum
Aster Data
Integrating MapReduce and NoSQL DB
MongoDB MapReduce
HBase
CSULA
Jongwook Woo
28. HiPIC Apache Hadoop
Motivated by Google Map/Reduce and GFS
open source project of the Apache Foundation.
framework written in Java
– originally developed by Doug Cutting
• who named it after his son's toy elephant.
Two core Components
Storage: HDFS
– High Bandwidth Clustered storage
Processing: Map/Reduce
– Fault Tolerant Distributed Processing
Hadoop scales linearly with
data size
Analysis complexity
CSULA
Jongwook Woo
29. HiPIC Hadoop issues
Map/Reduce is not DB
Algorithm in Restricted Parallel Computing
HDFS and HBase
Cannot compete with the functions in RDBMS
But, useful for
Semi-structured data model and high-level dataflow query
language on top of MapReduce
– Pig, Hive, Jsql, Cascading, Cloudbase
Useful for huge (peta- or Terra-bytes) but non-complicated data
– Web crawling
– log analysis
• Log file for web companies
– New York Times case
CSULA
Jongwook Woo
30. HiPIC MapReduce Pros & Cons Summary
Good when
Huge data for input, intermediate, output
A few synchronization required
Read once; batch oriented datasets (ETL)
Bad for
Fast response time
Large amount of shared data
Fine-grained synch needed
CPU-intensive not data-intensive
Continuous input stream
CSULA
Jongwook Woo
31. HiPIC MapReduce in Detail
Functions borrowed from functional
programming languages (eg. Lisp)
Provides Restricted parallel programming
model on Hadoop
User implements Map() and Reduce()
Libraries (Hadoop) take care of
EVERYTHING else
– Parallelization
– Fault Tolerance
– Data Distribution
– Load Balancing
CSULA
Jongwook Woo
32. HiPIC Map
Convert input data to (key, value) pairs
map() functions run in parallel,
creating different intermediate (key, value)
values from different input data sets
CSULA
Jongwook Woo
33. HiPIC Reduce
reduce() combines those intermediate values
into one or more final values for that same
key
reduce() functions also run in parallel,
each working on a different output key
Bottleneck:
reduce phase can’t start until map phase is
completely finished.
CSULA
Jongwook Woo
34. HiPIC Example: Sort URLs in the largest hit order
Compute the largest hit URLs
Stored in log files
Map()
Input <logFilename, file text>
Output: Parses file and emits <url, hit counts> pairs
– eg. <http://hello.com, 1>
Reduce()
Input: <url, list of hit counts> from multiple map
nodes
Output: Sums all values for the same key and emits
<url, TotalCount>
– eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
CSULA
Jongwook Woo
36. HiPIC Legacy Example
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
11 million in all, dating back to 1851.
four-terabyte pile of images in TIFF format.
needed to translate that four-terabyte pile of TIFFs
into more web-friendly PDF files.
– not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.
CSULA
Jongwook Woo
37. HiPIC Legacy Example (Cont’d)
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
a software programmer at the Times, Derek Gottfrid,
– playing around with Amazon Web Services, Elastic
Compute Cloud (EC2),
• uploaded the four terabytes of TIFF data into Amazon's
Simple Storage System (S3)
• In less than 24 hours, 11 millions PDFs, all stored
neatly in S3 and ready to be served up to visitors to the
Times site.
The total cost for the computing job? $240
– 10 cents per computer-hour times 100 computers times 24 hours
CSULA
Jongwook Woo
38. HiPIC Contents
Fundamentals of Big Data
NoSQL DB: HBase, MongoDB
Data-Intensive Computing: Hadoop
Big Data Supporters and Use Cases
CSULA
Jongwook Woo
39. HiPIC
Supporters of Big Data
Apache Hadoop Supporters
Cloudera
– Like Linux and Redhat
– HiPIC is an Academic Partner
Hortonworks
– Pig
Facebook
– Hive
IBM
– Jaql
NoSQL DB supporters
MongoDB
– HiPIC tries to collaborate
HBase, CouchDB, Apache Cassandra (originally by FB) etc
CSULA
Jongwook Woo
40. HiPIC Similarities in Pig, Hive, and Jaql
• translate high-level languages into MapReduce jobs
o the programmer can work at a higher level
than writing MapReduce jobs in Java or other
lower-level languages
• programs are much smaller than Java code.
• option to extend these languages,
o often by writing user-defined functions in Java.
• Interoperability
o programs written in these high-level languages can
be imbedded inside other languages as well.
• the same limitations as Hadoop does
o non-supporting random reads and writes
o and low-latency queries.
CSULA
Jongwook Woo
41. HiPIC Pig
• developed at Yahoo Research around 2006
o moved into the Apache Software Foundation in
2007.
• PigLatin,
o Pig's language
o a data flow language
o well suited to processing unstructured data
Unlike SQL, not require that the data have a
schema
However, can still leverage the value of a schema
CSULA
Jongwook Woo
42. HiPIC Hive
• developed at Facebook
o turns Hadoop into a data warehouse
o complete with a dialect of SQL for querying.
• HiveQL
o a declarative language (SQL dialect)
• Difference from PigLatin,
o you do not specify the data flow,
but instead describe the result you want
Hive figures out how to build a data flow to
achieve it.
o a schema is required,
but not limited to one schema.
o data can have many schemas
CSULA
Jongwook Woo
43. HiPIC Hive (Cont'd)
• Similarity with PigLatin and SQL,
o HiveQL on its own is a relationally complete
language
but not a Turing complete language,
That can express any computation
o can be extended through UDFs (User Defined
Functions) of Java
just like Pig to be Turing complete
CSULA
Jongwook Woo
44. HiPIC Jaql
• developed at IBM.
• a data flow language
o its native data structure format is JSON (JavaScript
Object Notation).
• Schemas are optional
• Turing complete on its own
o without the need for extension through UDFs.
CSULA
Jongwook Woo
46. HiPIC Amazon AWS
amazon.com
Consumer and seller business
aws.amazon.com
IT infrastructure business
– Focus on your business not IT management
Pay as you go
– Pay for servers by the hour
– Pay for storage per Giga byte per month
– Pay for data transfer per Giga byte
Services with many APIs
– S3: Simple Storage Service
– EC2: Elastic Compute Cloud
• Provide many virtual Linux servers
• Can run on multiple nodes
– Hadoop and HBase
– MongoDB
CSULA
Jongwook Woo
47. HiPIC Amazon AWS (Cont’d)
Customers on aws.amazon.com
Samsung
– Smart TV hub sites: TV applications are on AWS
Netflix
– ~25% of US internet traffic
– ~100% on AWS
NASA JPL
– Analyze more than 200,000 images
NASDAQ
– Using AWS S3
HiPIC received research and teaching grants
from AWS
CSULA
Jongwook Woo
48. HiPIC Facebook [7]
Using Apache HBase
For Titan and Puma
HBase for FB
– Provide excellent write performance and good reads
– Nice features
• Scalable
• Fault Tolerance
• MapReduce
CSULA
Jongwook Woo
49. HiPIC Titan: Facebook
Message services in FB
Hundreds of millions of active users
15+ billion messages a month
50K instant message a second
Challenges
High write throughput
– Every message, instant message, SMS, email
Massive Clusters
– Must be easily scalable
Solution
Clustered HBase
CSULA
Jongwook Woo
50. HiPIC Puma: Facebook
ETL
Extract, Transform, Load
– Data Integrating from many data sources to Data Warehouse
Data analytics
– Domain owners’ web analytics for Ad and apps
• clicks, likes, shares, comments etc
ETL before Puma
8 – 24 hours
– Procedures: Scribe, HDFS, Hive, MySQL
ETL after Puma
Puma
– Real time MapReduce framework
2 – 30 secs
– Procedures: Scribe, HDFS, Puma, HBase
CSULA
Jongwook Woo
51. HiPIC Twitter [8]
Three Challenges
Collecting Data
– Scribe as FB
Large Scale Storage and analysis
– Cassandra: ColumnFamily key-value store
– Hadoop
Rapid Learning over Big Data
– Pig
• 5% of Java code
• 5% of dev time
• Within 20% of running time
CSULA
Jongwook Woo
52. HiPIC Craiglist in MongoDB [9]
Craiglist
~700 cities, worldwide
~1 billion hits/day
~1.5 million posts/day
Servers
– ~500 servers
– ~100 MySQL servers
Migrate to MongoDB
Scalable, Fast, Proven, Friendly
CSULA
Jongwook Woo
53. HiPIC
HuffPost | AOL [10]
Two Machine Learning Use Cases
Comment Moderation
– Evaluate All New HuffPost User Comments
Every Day
• Identify Abusive / Aggressive Comments
• Auto Delete / Publish ~25% Comments Every
Day
Article Classification
– Tag Articles for Advertising
• E.g.: scary, salacious, …
CSULA
Jongwook Woo
54. HiPIC HuffPost | AOL [10]
Parallelize on Hadoop
Good news:
– Mahout, a parallel machine learning tool, is
already available.
– There are Mallet, libsvm, Weka, … that support
necessary algorithms.
Bad news:
– Mahout doesn’t support necessary algorithms
yet.
– Other algorithms do not run natively on Hadoop.
build a flexible ML platform running on
Hadoop
Pig for Hadoop implementation.
CSULA
Jongwook Woo
55. HiPIC
MapReduce Example
Word Count in the previous slide
Shortest Path in the graph
Graph algorithm is very suitable for M/R, especially BFS
– Spreading activation type of processing
Map:
– Input: a node n as a key, and (D, points-to) as its value
• D is the distance to the node from the start
• points-to is a list of nodes reachable from n
– Output: ∀p ∈ points-to, emit (p, D+1)
Reduce:
– Input: possible distances to a given p
– Output: selects the minimum one
• Perform multiple iterations
Iterative process for matrix, graph, network
– Apache HAMA needed?
• Iterative Process on Hadoop
CSULA
Jongwook Woo
56. HiPIC
MapReduce Example (Cont’d)
Social N/W analysis
Recommend new friends (friend of a friend: FOAF)
Map
– In: (x, <friendsx>)
– Out: if (u, x) are friends
• (u, < friendsx / friendsu >)
– < friendsx / friendsu >: friends of x but not friends of u
– Otherwise
• nil
Reduce
– In: (u, < < friendsa / friendsu >, < friendsa / friendsu >, …>)
• Friends list of all users a, b, … who are friends of u
– Out: (u, < (X1 , N1 ), (X2 , N2 ), …>)
• Xm : FOAF of u
• Nm : Total number of occurrences in all FOAF lists
– To sort or rank the results
CSULA
Jongwook Woo
59. HiPIC Conclusion
Era of Big Data
Need to store and compute Big Data
Storage: NoSQL DB
Computation: Hadoop MapRedude
Need to analyze Big Data in mobile computing, SNS
for Ad, User Behavior, Patterns …
CSULA
Jongwook Woo
61. HiPIC References
1) Introduction to MongoDB, Nosh Petigara, Jan 11, 2011
2) Hadoop Fundamental I, Big Data University
3) “Large Scale Data Analysis with Map/Reduce”, Marin
Dimitrov, Feb 2010
4) “BFS & MapReduce”, Edward J Yoon
http://blog.udanax.org/2009/02/breadth-first-search-
mapreduce.html, Feb 26 2009
5) “Market Basket Analysis Algorithm with no-SQL DB HBase
and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang
Xu, Seon Ho Kim, The Third International Conference on
Emerging Databases (EDB 2011), Songdo Park Hotel,
Incheon, Korea, Aug. 25-27, 2011
CSULA
Jongwook Woo
62. HiPIC References
6) “Market Basket Analysis Algorithm with Map/Reduce of
Cloud Computing”, Jongwook Woo and Yuhang Xu, The
2011 international Conference on Parallel and Distributed
Processing Techniques and Applications (PDPTA 2011),Las
Vegas (July 18-21, 2011)
7) Building Realtime Big Data Services at Facebook with
Hadoop and Hbase, Jonathan Gray, Facebook, Nov 11, 2011,
Hadoop World NYC
8) Analyzing Big Data at Twitter, Kevin Well, Web 2.0 Expo,
NYC, Sep 2010
9) Lessons Learned from Migrating 2+ Billion Documents at
Craigslist, Jeremy Zawodny, 2011
10) Machine Learning on Hadoop at Huffington Post | AOL, Thu
Kyaw and Sang Chul Song, Hadoop DC, Oct 4, 2011
CSULA
Jongwook Woo
63. HiPIC References
11) “MapReduce Debates and Schema-Free”, Woohyun Kim,
www.coordguru.com, http://blog.naver.com/wisereign, March
3 2010
12) “Large Scale Data Analysis with Map/Reduce”, Marin
Dimitrov, Feb 2010
13) “HBase Schema Design Case Studies”, Qingyan Liu, July 13
2009
CSULA
Jongwook Woo
Hinweis der Redaktion
If you choose to supply one. Like SQL, PigLatin is relationally complete, which means it is at least as powerful as relational algebra. Turing completeness requires looping constructs, an infinite memory model, and conditional constructs. PigLatin is not Turing complete on its own, but is Turing complete when extended with User-Defined Functions in Java.