SlideShare ist ein Scribd-Unternehmen logo
1 von 59
jwoo Woo
HiPIC
CSULA
Big Data and Data Intensive Computing:
Education and Training
Naver Labs
Bundang, Korea
Aug 30th 2013
Jongwook Woo (PhD)
High-Performance Internet Computing Center (HiPIC)
Educational Partner with Cloudera and Grants Awardee of Amazon AWS
Computer Information Systems Department
California State University, Los Angeles
High Performance Internet Computing Center
jwoo Woo
CSULA
Contents
소개
 Data Issues
 Big Data
 Data-Intensive Computing: Hadoop
 Training in Big Data
 Big Data Supporters and Use Cases
High Performance Internet Computing Center
jwoo Woo
CSULA
Me
 이름: 우종욱
 직업:
 교수 (직책: 부교수), California State University Los Angeles
– Capital City of Entertainment
 경력:
 2002년 부터 교수: Computer Information Systems Dept, College of
Business and Economics
– www.calstatela.edu/faculty/jwoo5
 1998년부터 헐리우드등지의 많은 회사 컨설팅
– 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축
– FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
 2009여년 부터 하둡 빅데이타에 관심
High Performance Internet Computing Center
jwoo Woo
CSULA
Me
경력 (계속):
2013년 여름 현재 IglooSecurity 자문중:
– Hadoop 및 그 Ecosystems 교육
– 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을
빠르게 데이타 검색하는 시스템 R&D
• Hadoop, Solr, Java, Cloudera 이용
2013년 9월 중순: 삼성 종합 기술원
– 3일간 Hadoop 및 그 Ecosystems 교육 예정
– Using Cloudera material in Korea as far as I know
High Performance Internet Computing Center
jwoo Woo
CSULA
Experience in Big Data
 Grants
 Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
 Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
 Partnership
 Received Academic Education Partnership with Cloudera since
June 2012
 Linked with Hortonworks since May 2013
– Positive to provide partnership
High Performance Internet Computing Center
jwoo Woo
CSULA
Experience in Big Data
 Certificate
 Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
 Certificate of 10gen Training Course, “M101: MongoDB
Development”, (Dec 24 2012)
 Blog and Github for Hadoop and its ecosystems
 http://dal-cloudcomputing.blogspot.com/
– Hadoop, AWS, Cloudera
 https://github.com/hipic
– Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming,
RHadoop
 https://github.com/dalgual
High Performance Internet Computing Center
jwoo Woo
CSULA
Experience in Big Data
 Several publications regarding Hadoop and NoSQL
 “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA
2012, Las Vegas (July 16-19, 2012)
 “Market Basket Analysis Algorithm with no-SQL DB HBase and
Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon
Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011
 “Market Basket Analysis Algorithm with Map/Reduce of Cloud
Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011,
Las Vegas (July 18-21, 2011)
 Jongwook Woo, “Introduction to Cloud Computing”, in the
10th KOCSEA Technical Symposium, UNLV, Dec 18 - 19, 2009
 Talks in Korean Universities and companies
 Yonsei, Sookmyung, KAIST, Korean Polytech Univ
– Winter 2011
 VanillaBreeze
– Winter 2011
High Performance Internet Computing Center
jwoo Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing
High Performance Internet Computing Center
jwoo Woo
CSULA
Data
Google
“We don’t have a better algorithm
than others but we have more data
than others”
High Performance Internet Computing Center
jwoo Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data with sparse attributes
– Semantic or social relations
No relational property
– nor complex join queries
• Log data
Immutable
No need to update and delete data
High Performance Internet Computing Center
jwoo Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data, Bioinformatics, Social Computing,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Internet Computing Center
jwoo Woo
CSULA
Two Cores in Big Data
How to store Big Data
NoSQL DB
How to compute Big Data
Parallel Computing with multiple non-
expensive computers
–Own super computers
High Performance Internet Computing Center
jwoo Woo
CSULA
Big Data for RDBMS
Issues in RDBMS
Hard to scale
– Relation gets broken
• Partitioning for scalability
• Replication for availability
Speed
– The Seek times of physical storage
• Slower than N/W speed
• 1TB disk: 10Mbps transfer rate
– 100K sec =>27.8 hrs
– With Multiple data sources at difference places
• 100 10GB disks: each 10Mbps transfer rate
– 1K sec =>16.7min
High Performance Internet Computing Center
jwoo Woo
CSULA
Big Data for RDBMS (Cont’d)
Issues in RDBMS (Cont’d)
Data Integration
–Not good for un-/semi-structured data
• Many unstructured data
–Web or log data etc
RDB not good in parallelization
–Cannot split 1000 tasks to non-expensive
1000 PCs efficiently
High Performance Internet Computing Center
jwoo Woo
CSULA
RDBMS Issues
Solution
 Before: Data Warehouse
 Now and future: Big Data
Hadoop framework
Data Computation (MapReduce, Pig)
Data Repositories (NoSQL DB: HBase,
Cassandra, MongoDB)
Business Intelligence (Data Mining,
OLAP, Data Visualization, Reporting):
Hive, Mahout
High Performance Internet Computing Center
jwoo Woo
CSULA
Big Data
Definition
 Systems that supports a non-
expensive platform to store and
compute large scale, non-
/semi-structured data
High Performance Internet Computing Center
jwoo Woo
CSULA
Use Cases for NoSQL DB [1]
RDBMS replacement
for high-traffic web applications
Semi-structured content management
Real-time analytics & high-speed logging
Web Infrastructure
Web 2.0, Media, SaaS, Gaming,
Finance, Telecom, Healthcare, Government
Three NoSQL DB Approaches
Key/Value, Column, Document
High Performance Internet Computing Center
jwoo Woo
CSULA
Data Store of NoSQL DB
Key/Value store
(Key, Value)
Functions
– Index, versioning, sorting, locking, transaction,
replication
Apache Cassandra, Memcached
High Performance Internet Computing Center
jwoo Woo
CSULA
Data Store of NoSQL DB (Cont’d)
Column-Oriented Stores (Extensible Record
Stores)
stores data tables as sections of columns of data
– rather than as rows of data, like most RDBMS
• Sparse fields in RDBMS
– well-suited for OLAP-like workloads (e.g., data
warehouses)
Extensible record horizontally and vertically
partitioned across nodes
– Rows and Columns are distributed over multiple
nodes
BigTable, HBase, Cassandra, Hypertable
High Performance Internet Computing Center
jwoo Woo
CSULA
Data Store of NoSQL DB (Cont’d)
 Row Oriented
– 1,Smith, Joe, smith@hi.com;
– 2,Jones, Mary, mary@hi.com;
– 3,Johnson, Cathy, cathy@hi.com;
 Column Oriented
– 1,2,3;
– Smith, Jones, Johnson;
– Joe, Mary, Cathy;
– smith@hi.com, mary@hi.com, cathy@hi.com;
StudentId Lastname Firstname email
1 Smith Joe smith@hi.com
2 Jones Mary mary@hi.com
3 Johnson Cathy cathy@hi.com
High Performance Internet Computing Center
jwoo Woo
CSULA
HBase Schema Example (Student/Course)
 RDBMS
 Students: (id, name, sex, age)
 Courses: (id, title, desc, teacher_id)
 S_C: (s_id, c_id, type)
 HBase
Column Families
id Info: Course
<student_id> Info:name Info:sex Info:age Course:<course_id>=
type
Column Families
id Info: student
<course_id> Info:title Info:desc Info:teacher_id student:<student_id>
=type
High Performance Internet Computing Center
jwoo Woo
CSULA
Data Store of NoSQL DB (Cont’d)
Document Store
Collections and Documents
– vs Tables and Records of RDB
Used in Search Engine/Repository
Multiple index to store indexed document
– no fixed fields
Not simple key-value lookup
– Use API
Functions
– No locking, Replication, Transaction
MongoDB, CouchDB, ThruDB, SimpleDB
High Performance Internet Computing Center
jwoo Woo
CSULA
Understanding the Document Model [1]
{
_id:“A4304”
author: “nosh”,
date: 22/6/2010,
title: “Intro to MongoDB”
text: “MongoDB is an open source..”,
tags: [“webinar”, “opensource”]
comments: [{author: “mike”,
date: 11/18/2010,
txt: “Did you see the…”,
votes: 7},….]
}
Documents->Collections->Databases
High Performance Internet Computing Center
jwoo Woo
CSULA
Document Model Makes Queries Simple [1]
Operators:
$gt, $lt, $gte, $lte, $ne, $all, $in, $nin, count, limit,
skip, group
Example:
db.posts.find({author: “nosh”,
tags: “webinar”})
High Performance Internet Computing Center
jwoo Woo
CSULA
Selected Users [1]
High Performance Internet Computing Center
jwoo Woo
CSULA
The Great Divide [1]
MongoDB sweet spot: Easy, Flexible,
Scalable
HBase
MongoDB
High Performance Internet Computing Center
jwoo Woo
CSULA
Solutions in Big Data Computation
 Map/Reduce by Google
(Key, Value) parallel computing
 Apache Hadoop
 Big Data
Data Computation (MapReduce, Pig)
 Integrating MapReduce and RDB
Oracle + Hadoop
Sybase IQ
Vertica + Hadoop
Hadoop DB
Greenplum
Aster Data
 Integrating MapReduce and NoSQL DB
MongoDB MapReduce
HBase
High Performance Internet Computing Center
jwoo Woo
CSULA
Apache Hadoop
 Motivated by Google Map/Reduce and GFS
 open source project of the Apache Foundation.
 framework written in Java
– originally developed by Doug Cutting
• who named it after his son's toy elephant.
 Two core Components
 Storage: HDFS
– High Bandwidth Clustered storage
 Processing: Map/Reduce
– Fault Tolerant Distributed Processing
 Hadoop scales linearly with
 data size
 Analysis complexity
High Performance Internet Computing Center
jwoo Woo
CSULA
Hadoop issues
Map/Reduce is not DB
Algorithm in Restricted Parallel Computing
HDFS and HBase
Cannot compete with the functions in RDBMS
But, useful for
Useful for huge (peta- or Terra-bytes) but non-
complicated data
– Web crawling
– log analysis
• Log file for web companies
– New York Times case
High Performance Internet Computing Center
jwoo Woo
CSULA
MapReduce Pros & Cons Summary
Good when
Huge data for input, intermediate, output
A few synchronization required
Read once; batch oriented datasets (ETL)
Bad for
Fast response time
Large amount of shared data
Fine-grained synch needed
CPU-intensive not data-intensive
Continuous input stream
High Performance Internet Computing Center
jwoo Woo
CSULA
MapReduce in Detail
Functions borrowed from functional
programming languages (eg. Lisp)
Provides Restricted parallel programming
model on Hadoop
User implements Map() and Reduce()
Libraries (Hadoop) take care of
EVERYTHING else
–Parallelization
–Fault Tolerance
–Data Distribution
–Load Balancing
High Performance Internet Computing Center
jwoo Woo
CSULA
Map
Convert input data to (key, value) pairs
map() functions run in parallel,
 creating different intermediate (key, value)
values from different input data sets
High Performance Internet Computing Center
jwoo Woo
CSULA
Reduce
reduce() combines those intermediate values
into one or more final values for that same
key
reduce() functions also run in parallel,
each working on a different output key
Bottleneck:
reduce phase can’t start until map phase is
completely finished.
High Performance Internet Computing Center
jwoo Woo
CSULA
Training in Big Data
 Learn by yourself?
Miss many important topics
Two main:
–Cloudera, Hortonworks
• With hands-on exercises
Cloudera 강의 교재 간단히 소개
Especially MapReduce example
High Performance Internet Computing Center
jwoo Woo
CSULA
Example: Sort URLs in the largest hit order
Compute the largest hit URLs
Stored in log files
Map()
Input <logFilename, file text>
Output: Parses file and emits <url, hit counts> pairs
– eg. <http://hello.com, 1>
Reduce()
Input: <url, list of hit counts> from multiple map
nodes
Output: Sums all values for the same key and emits
<url, TotalCount>
– eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
High Performance Internet Computing Center
jwoo Woo
CSULA
Map/Reduce for URL visits
…
…Map1() Map2() Mapm()
Reduce1 () Reducel()
Data Aggregation/Combine
(http://hi.com, <1, 1, …, 1>)
(http://hello.com, <3, 5, 2, 7>)
(http://hi.com, 32)
(http://hello.com, 17)
Input Log Data
Reduce2()
(http://hi.com, 1)
(http://hello.com, 3)
…
(http://halo.com, 1)
(http://hello.com, 5)
…
(http://halo.com, <1, 5,>)
(http://halo.com, 6)
High Performance Internet Computing Center
jwoo Woo
CSULA
Legacy Example
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
11 million in all, dating back to 1851.
four-terabyte pile of images in TIFF format.
needed to translate that four-terabyte pile of TIFFs
into more web-friendly PDF files.
– not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.
High Performance Internet Computing Center
jwoo Woo
CSULA
Legacy Example (Cont’d)
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
a software programmer at the Times, Derek Gottfrid,
– playing around with Amazon Web Services, Elastic
Compute Cloud (EC2),
• uploaded the four terabytes of TIFF data into Amazon's
Simple Storage System (S3)
• In less than 24 hours, 11 millions PDFs, all stored
neatly in S3 and ready to be served up to visitors to the
Times site.
 The total cost for the computing job? $240
– 10 cents per computer-hour times 100 computers times 24 hours
High Performance Internet Computing Center
jwoo Woo
CSULA
Supporters of Big Data: Hadoop Ecosystems
 Apache Hadoop Supporters
 Cloudera
– Like Linux and Redhat
– HiPIC is an Academic Partner
 Hortonworks
– Pig,
– Consulting and training
 Facebook
– Hive
 IBM
– Jaql
 NoSQL DB supporters
 MongoDB
 HBase, CouchDB, Apache Cassandra (originally by FB) etc
High Performance Internet Computing Center
jwoo Woo
CSULA
Pig
• developed at Yahoo Research around 2006
o moved into the Apache Software Foundation in
2007.
• PigLatin,
o Pig's language
o a data flow language
o well suited to processing unstructured data
 Unlike SQL, not require that the data have a
schema
 However, can still leverage the value of a schema
High Performance Internet Computing Center
jwoo Woo
CSULA
Hive
• developed at Facebook
o turns Hadoop into a data warehouse
o complete with a dialect of SQL for querying.
• HiveQL
o a declarative language (SQL dialect)
• Difference from PigLatin,
o you do not specify the data flow,
 but instead describe the result you want
 Hive figures out how to build a data flow to
achieve it.
o a schema is required,
 but not limited to one schema.
o data can have many schemas
High Performance Internet Computing Center
jwoo Woo
CSULA
Hive (Cont'd)
• Similarity with PigLatin and SQL,
o HiveQL on its own is a relationally complete
language
 but not a Turing complete language,
 That can express any computation
o can be extended through UDFs (User Defined
Functions) of Java
 just like Pig to be Turing complete
High Performance Internet Computing Center
jwoo Woo
CSULA
Jaql
• developed at IBM.
• a data flow language
o its native data structure format is JSON (JavaScript
Object Notation).
• Schemas are optional
• Turing complete on its own
o without the need for extension through UDFs.
High Performance Internet Computing Center
jwoo Woo
CSULA
MapReduce Cons and Future
Bad for
Fast response time
Large amount of shared data
Fine-grained synch needed
CPU-intensive not data-intensive
Continuous input stream
Hadoop 2.0: YARN
Not a product yet but will be soon
High Performance Internet Computing Center
jwoo Woo
CSULA
Hadoop 2.0: YARN
Data processing applications and services
Online Serving – HOYA (HBase on YARN)
Real-time event processing – Storm, S4, other
commercial platforms
Tez – Generic framework to run a complex DAG
 MPI: OpenMPI, MPICH2
 Master-Worker
 Machine Learning: Spark
 Graph processing: Giraph
 Enabled by allowing the use of paradigm-specific
application master
[http://www.slideshare.net/hortonworks/apache-
hadoop-yarn-enabling-nex]
High Performance Internet Computing Center
jwoo Woo
CSULA
Use Cases
Amazon AWS
Facebook
Twitter
Craiglist
HuffPOst | AOL
High Performance Internet Computing Center
jwoo Woo
CSULA
Amazon AWS
amazon.com
Consumer and seller business
aws.amazon.com
IT infrastructure business
– Focus on your business not IT management
Pay as you go
Services with many APIs
– S3: Simple Storage Service
– EC2: Elastic Compute Cloud
• Provide many virtual Linux servers
• Can run on multiple nodes
– Hadoop and HBase
– MongoDB
High Performance Internet Computing Center
jwoo Woo
CSULA
Amazon AWS (Cont’d)
Customers on aws.amazon.com
Samsung
– Smart TV hub sites: TV applications are on AWS
Netflix
– ~25% of US internet traffic
– ~100% on AWS
NASA JPL
– Analyze more than 200,000 images
NASDAQ
– Using AWS S3
HiPIC received research and teaching
grants from AWS
High Performance Internet Computing Center
jwoo Woo
CSULA
Facebook [7]
Using Apache HBase
 For Titan and Puma
– Message Services
– ETL
 HBase for FB
– Provide excellent write performance and good reads
– Nice features
• Scalable
• Fault Tolerance
• MapReduce
High Performance Internet Computing Center
jwoo Woo
CSULA
Titan: Facebook
Message services in FB
Hundreds of millions of active users
15+ billion messages a month
50K instant message a second
Challenges
High write throughput
– Every message, instant message, SMS, email
Massive Clusters
– Must be easily scalable
Solution
Clustered HBase
High Performance Internet Computing Center
jwoo Woo
CSULA
Puma: Facebook
 ETL
 Extract, Transform, Load
– Data Integrating from many data sources to Data Warehouse
 Data analytics
– Domain owners’ web analytics for Ad and apps
• clicks, likes, shares, comments etc
 ETL before Puma
 8 – 24 hours
– Procedures: Scribe, HDFS, Hive, MySQL
 ETL after Puma
 Puma
– Real time MapReduce framework
 2 – 30 secs
– Procedures: Scribe, HDFS, Puma, HBase
High Performance Internet Computing Center
jwoo Woo
CSULA
Twitter [8]
Three Challenges
Collecting Data
– Scribe as FB
Large Scale Storage and analysis
– Cassandra: ColumnFamily key-value store
– Hadoop
Rapid Learning over Big Data
– Pig
• 5% of Java code
• 5% of dev time
• Within 20% of running time
High Performance Internet Computing Center
jwoo Woo
CSULA
Craiglist in MongoDB [9]
Craiglist
~700 cities, worldwide
~1 billion hits/day
~1.5 million posts/day
Servers
– ~500 servers
– ~100 MySQL servers
Migrate to MongoDB
Scalable, Fast, Proven, Friendly
High Performance Internet Computing Center
jwoo Woo
CSULA
HuffPost | AOL [10]
Two Machine Learning Use Cases
Comment Moderation
–Evaluate All New HuffPost User Comments
Every Day
• Identify Abusive / Aggressive Comments
• Auto Delete / Publish ~25% Comments Every
Day
Article Classification
–Tag Articles for Advertising
• E.g.: scary, salacious, …
High Performance Internet Computing Center
jwoo Woo
CSULA
HuffPost | AOL [10]
Parallelize on Hadoop
Good news:
– Mahout, a parallel machine learning tool, is
already available.
– There are Mallet, libsvm, Weka, … that support
necessary algorithms.
Bad news:
– Mahout doesn’t support necessary algorithms
yet.
– Other algorithms do not run natively on Hadoop.
build a flexible ML platform running on
Hadoop
Pig for Hadoop implementation.
High Performance Internet Computing Center
jwoo Woo
CSULA
Hadoop Streaming
 Hadoop MapReduce for Non-Java codes: Python,
Ruby
 Requirement
 Running Hadoop
 Needs Hadoop Streaming API
– hadoop-streaming.jar
 Needs to build Mapper and Reducer codes
– Simple conversion from sequential codes
 STDIN > mapper > reducer > STDOUT
High Performance Internet Computing Center
jwoo Woo
CSULA
Hadoop Streaming
 MapReduce Python execution
 http://wiki.apache.org/hadoop/HadoopStreaming
 Sysntax
$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar
[options] Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
 Example
$ bin/hadoop jar contrib/streaming/hadoop-streaming.jar 
-file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py 
-file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py 
-input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare-
output
High Performance Internet Computing Center
jwoo Woo
CSULA
Conclusion
 Era of Big Data
 Need to store and compute Big Data
 Many solutions but Hadoop
 Storage: NoSQL DB
 Computation: Hadoop MapRedude
 Need to analyze Big Data in mobile computing, SNS
for Ad, User Behavior, Patterns …
High Performance Internet Computing Center
jwoo Woo
CSULA
Question?

Weitere ähnliche Inhalte

Was ist angesagt?

Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Jonathan Seidman
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big dataSteven Francia
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data scienceDeepak Singh
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Jonathan Seidman
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveYahoo Developer Network
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
A unified data modeler in the world of big data
A unified data modeler in the world of big dataA unified data modeler in the world of big data
A unified data modeler in the world of big dataWilliam Luk
 
Making Sense of Schema on Read
Making Sense of Schema on ReadMaking Sense of Schema on Read
Making Sense of Schema on ReadKent Graziano
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
 
Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandAndrew Brust
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Jonathan Seidman
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 

Was ist angesagt? (20)

Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data science
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
A unified data modeler in the world of big data
A unified data modeler in the world of big dataA unified data modeler in the world of big data
A unified data modeler in the world of big data
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Making Sense of Schema on Read
Making Sense of Schema on ReadMaking Sense of Schema on Read
Making Sense of Schema on Read
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 
Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-Land
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 

Ähnlich wie Big Data and Data Intensive Computing: Education and Training

Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryJongwook Woo
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Jongwook Woo
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkJongwook Woo
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingJongwook Woo
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsJongwook Woo
 
Big Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data WorldBig Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data WorldJongwook Woo
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open DataJongwook Woo
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesJongwook Woo
 
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 Recent IT Development and Women: Big Data and The Power of Women in Goryeo Recent IT Development and Women: Big Data and The Power of Women in Goryeo
Recent IT Development and Women: Big Data and The Power of Women in GoryeoJongwook Woo
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Jongwook Woo
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksJongwook Woo
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approachesLuxoft
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudJames Serra
 
Above the cloud joarder kamal
Above the cloud   joarder kamalAbove the cloud   joarder kamal
Above the cloud joarder kamalJoarder Kamal
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 

Ähnlich wie Big Data and Data Intensive Computing: Education and Training (20)

Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Spark ukc2015v1.1
Spark ukc2015v1.1Spark ukc2015v1.1
Spark ukc2015v1.1
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using Spark
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Big Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data WorldBig Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data World
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 Recent IT Development and Women: Big Data and The Power of Women in Goryeo Recent IT Development and Women: Big Data and The Power of Women in Goryeo
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approaches
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Above the cloud joarder kamal
Above the cloud   joarder kamalAbove the cloud   joarder kamal
Above the cloud joarder kamal
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 

Mehr von Jongwook Woo

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum ComputingJongwook Woo
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its TrendsJongwook Woo
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkJongwook Woo
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningJongwook Woo
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive AnalysisJongwook Woo
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeJongwook Woo
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesJongwook Woo
 

Mehr von Jongwook Woo (16)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 

Kürzlich hochgeladen

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Kürzlich hochgeladen (20)

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Big Data and Data Intensive Computing: Education and Training

  • 1. jwoo Woo HiPIC CSULA Big Data and Data Intensive Computing: Education and Training Naver Labs Bundang, Korea Aug 30th 2013 Jongwook Woo (PhD) High-Performance Internet Computing Center (HiPIC) Educational Partner with Cloudera and Grants Awardee of Amazon AWS Computer Information Systems Department California State University, Los Angeles
  • 2. High Performance Internet Computing Center jwoo Woo CSULA Contents 소개  Data Issues  Big Data  Data-Intensive Computing: Hadoop  Training in Big Data  Big Data Supporters and Use Cases
  • 3. High Performance Internet Computing Center jwoo Woo CSULA Me  이름: 우종욱  직업:  교수 (직책: 부교수), California State University Los Angeles – Capital City of Entertainment  경력:  2002년 부터 교수: Computer Information Systems Dept, College of Business and Economics – www.calstatela.edu/faculty/jwoo5  1998년부터 헐리우드등지의 많은 회사 컨설팅 – 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축 – FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합 – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등  2009여년 부터 하둡 빅데이타에 관심
  • 4. High Performance Internet Computing Center jwoo Woo CSULA Me 경력 (계속): 2013년 여름 현재 IglooSecurity 자문중: – Hadoop 및 그 Ecosystems 교육 – 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을 빠르게 데이타 검색하는 시스템 R&D • Hadoop, Solr, Java, Cloudera 이용 2013년 9월 중순: 삼성 종합 기술원 – 3일간 Hadoop 및 그 Ecosystems 교육 예정 – Using Cloudera material in Korea as far as I know
  • 5. High Performance Internet Computing Center jwoo Woo CSULA Experience in Big Data  Grants  Received Amazon AWS in Education Research Grant (July 2012 - July 2014)  Received Amazon AWS in Education Coursework Grants (July 2012 - July 2013, Jan 2011 - Dec 2011  Partnership  Received Academic Education Partnership with Cloudera since June 2012  Linked with Hortonworks since May 2013 – Positive to provide partnership
  • 6. High Performance Internet Computing Center jwoo Woo CSULA Experience in Big Data  Certificate  Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012  Certificate of 10gen Training Course, “M101: MongoDB Development”, (Dec 24 2012)  Blog and Github for Hadoop and its ecosystems  http://dal-cloudcomputing.blogspot.com/ – Hadoop, AWS, Cloudera  https://github.com/hipic – Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming, RHadoop  https://github.com/dalgual
  • 7. High Performance Internet Computing Center jwoo Woo CSULA Experience in Big Data  Several publications regarding Hadoop and NoSQL  “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las Vegas (July 16-19, 2012)  “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011  “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las Vegas (July 18-21, 2011)  Jongwook Woo, “Introduction to Cloud Computing”, in the 10th KOCSEA Technical Symposium, UNLV, Dec 18 - 19, 2009  Talks in Korean Universities and companies  Yonsei, Sookmyung, KAIST, Korean Polytech Univ – Winter 2011  VanillaBreeze – Winter 2011
  • 8. High Performance Internet Computing Center jwoo Woo CSULA What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing
  • 9. High Performance Internet Computing Center jwoo Woo CSULA Data Google “We don’t have a better algorithm than others but we have more data than others”
  • 10. High Performance Internet Computing Center jwoo Woo CSULA New Data Trend Sparsity Unstructured Schema free data with sparse attributes – Semantic or social relations No relational property – nor complex join queries • Log data Immutable No need to update and delete data
  • 11. High Performance Internet Computing Center jwoo Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data, Bioinformatics, Social Computing, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  • 12. High Performance Internet Computing Center jwoo Woo CSULA Two Cores in Big Data How to store Big Data NoSQL DB How to compute Big Data Parallel Computing with multiple non- expensive computers –Own super computers
  • 13. High Performance Internet Computing Center jwoo Woo CSULA Big Data for RDBMS Issues in RDBMS Hard to scale – Relation gets broken • Partitioning for scalability • Replication for availability Speed – The Seek times of physical storage • Slower than N/W speed • 1TB disk: 10Mbps transfer rate – 100K sec =>27.8 hrs – With Multiple data sources at difference places • 100 10GB disks: each 10Mbps transfer rate – 1K sec =>16.7min
  • 14. High Performance Internet Computing Center jwoo Woo CSULA Big Data for RDBMS (Cont’d) Issues in RDBMS (Cont’d) Data Integration –Not good for un-/semi-structured data • Many unstructured data –Web or log data etc RDB not good in parallelization –Cannot split 1000 tasks to non-expensive 1000 PCs efficiently
  • 15. High Performance Internet Computing Center jwoo Woo CSULA RDBMS Issues Solution  Before: Data Warehouse  Now and future: Big Data Hadoop framework Data Computation (MapReduce, Pig) Data Repositories (NoSQL DB: HBase, Cassandra, MongoDB) Business Intelligence (Data Mining, OLAP, Data Visualization, Reporting): Hive, Mahout
  • 16. High Performance Internet Computing Center jwoo Woo CSULA Big Data Definition  Systems that supports a non- expensive platform to store and compute large scale, non- /semi-structured data
  • 17. High Performance Internet Computing Center jwoo Woo CSULA Use Cases for NoSQL DB [1] RDBMS replacement for high-traffic web applications Semi-structured content management Real-time analytics & high-speed logging Web Infrastructure Web 2.0, Media, SaaS, Gaming, Finance, Telecom, Healthcare, Government Three NoSQL DB Approaches Key/Value, Column, Document
  • 18. High Performance Internet Computing Center jwoo Woo CSULA Data Store of NoSQL DB Key/Value store (Key, Value) Functions – Index, versioning, sorting, locking, transaction, replication Apache Cassandra, Memcached
  • 19. High Performance Internet Computing Center jwoo Woo CSULA Data Store of NoSQL DB (Cont’d) Column-Oriented Stores (Extensible Record Stores) stores data tables as sections of columns of data – rather than as rows of data, like most RDBMS • Sparse fields in RDBMS – well-suited for OLAP-like workloads (e.g., data warehouses) Extensible record horizontally and vertically partitioned across nodes – Rows and Columns are distributed over multiple nodes BigTable, HBase, Cassandra, Hypertable
  • 20. High Performance Internet Computing Center jwoo Woo CSULA Data Store of NoSQL DB (Cont’d)  Row Oriented – 1,Smith, Joe, smith@hi.com; – 2,Jones, Mary, mary@hi.com; – 3,Johnson, Cathy, cathy@hi.com;  Column Oriented – 1,2,3; – Smith, Jones, Johnson; – Joe, Mary, Cathy; – smith@hi.com, mary@hi.com, cathy@hi.com; StudentId Lastname Firstname email 1 Smith Joe smith@hi.com 2 Jones Mary mary@hi.com 3 Johnson Cathy cathy@hi.com
  • 21. High Performance Internet Computing Center jwoo Woo CSULA HBase Schema Example (Student/Course)  RDBMS  Students: (id, name, sex, age)  Courses: (id, title, desc, teacher_id)  S_C: (s_id, c_id, type)  HBase Column Families id Info: Course <student_id> Info:name Info:sex Info:age Course:<course_id>= type Column Families id Info: student <course_id> Info:title Info:desc Info:teacher_id student:<student_id> =type
  • 22. High Performance Internet Computing Center jwoo Woo CSULA Data Store of NoSQL DB (Cont’d) Document Store Collections and Documents – vs Tables and Records of RDB Used in Search Engine/Repository Multiple index to store indexed document – no fixed fields Not simple key-value lookup – Use API Functions – No locking, Replication, Transaction MongoDB, CouchDB, ThruDB, SimpleDB
  • 23. High Performance Internet Computing Center jwoo Woo CSULA Understanding the Document Model [1] { _id:“A4304” author: “nosh”, date: 22/6/2010, title: “Intro to MongoDB” text: “MongoDB is an open source..”, tags: [“webinar”, “opensource”] comments: [{author: “mike”, date: 11/18/2010, txt: “Did you see the…”, votes: 7},….] } Documents->Collections->Databases
  • 24. High Performance Internet Computing Center jwoo Woo CSULA Document Model Makes Queries Simple [1] Operators: $gt, $lt, $gte, $lte, $ne, $all, $in, $nin, count, limit, skip, group Example: db.posts.find({author: “nosh”, tags: “webinar”})
  • 25. High Performance Internet Computing Center jwoo Woo CSULA Selected Users [1]
  • 26. High Performance Internet Computing Center jwoo Woo CSULA The Great Divide [1] MongoDB sweet spot: Easy, Flexible, Scalable HBase MongoDB
  • 27. High Performance Internet Computing Center jwoo Woo CSULA Solutions in Big Data Computation  Map/Reduce by Google (Key, Value) parallel computing  Apache Hadoop  Big Data Data Computation (MapReduce, Pig)  Integrating MapReduce and RDB Oracle + Hadoop Sybase IQ Vertica + Hadoop Hadoop DB Greenplum Aster Data  Integrating MapReduce and NoSQL DB MongoDB MapReduce HBase
  • 28. High Performance Internet Computing Center jwoo Woo CSULA Apache Hadoop  Motivated by Google Map/Reduce and GFS  open source project of the Apache Foundation.  framework written in Java – originally developed by Doug Cutting • who named it after his son's toy elephant.  Two core Components  Storage: HDFS – High Bandwidth Clustered storage  Processing: Map/Reduce – Fault Tolerant Distributed Processing  Hadoop scales linearly with  data size  Analysis complexity
  • 29. High Performance Internet Computing Center jwoo Woo CSULA Hadoop issues Map/Reduce is not DB Algorithm in Restricted Parallel Computing HDFS and HBase Cannot compete with the functions in RDBMS But, useful for Useful for huge (peta- or Terra-bytes) but non- complicated data – Web crawling – log analysis • Log file for web companies – New York Times case
  • 30. High Performance Internet Computing Center jwoo Woo CSULA MapReduce Pros & Cons Summary Good when Huge data for input, intermediate, output A few synchronization required Read once; batch oriented datasets (ETL) Bad for Fast response time Large amount of shared data Fine-grained synch needed CPU-intensive not data-intensive Continuous input stream
  • 31. High Performance Internet Computing Center jwoo Woo CSULA MapReduce in Detail Functions borrowed from functional programming languages (eg. Lisp) Provides Restricted parallel programming model on Hadoop User implements Map() and Reduce() Libraries (Hadoop) take care of EVERYTHING else –Parallelization –Fault Tolerance –Data Distribution –Load Balancing
  • 32. High Performance Internet Computing Center jwoo Woo CSULA Map Convert input data to (key, value) pairs map() functions run in parallel,  creating different intermediate (key, value) values from different input data sets
  • 33. High Performance Internet Computing Center jwoo Woo CSULA Reduce reduce() combines those intermediate values into one or more final values for that same key reduce() functions also run in parallel, each working on a different output key Bottleneck: reduce phase can’t start until map phase is completely finished.
  • 34. High Performance Internet Computing Center jwoo Woo CSULA Training in Big Data  Learn by yourself? Miss many important topics Two main: –Cloudera, Hortonworks • With hands-on exercises Cloudera 강의 교재 간단히 소개 Especially MapReduce example
  • 35. High Performance Internet Computing Center jwoo Woo CSULA Example: Sort URLs in the largest hit order Compute the largest hit URLs Stored in log files Map() Input <logFilename, file text> Output: Parses file and emits <url, hit counts> pairs – eg. <http://hello.com, 1> Reduce() Input: <url, list of hit counts> from multiple map nodes Output: Sums all values for the same key and emits <url, TotalCount> – eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
  • 36. High Performance Internet Computing Center jwoo Woo CSULA Map/Reduce for URL visits … …Map1() Map2() Mapm() Reduce1 () Reducel() Data Aggregation/Combine (http://hi.com, <1, 1, …, 1>) (http://hello.com, <3, 5, 2, 7>) (http://hi.com, 32) (http://hello.com, 17) Input Log Data Reduce2() (http://hi.com, 1) (http://hello.com, 3) … (http://halo.com, 1) (http://hello.com, 5) … (http://halo.com, <1, 5,>) (http://halo.com, 6)
  • 37. High Performance Internet Computing Center jwoo Woo CSULA Legacy Example In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files. – not a particularly complicated but large computing chore, • requiring a whole lot of computer processing time.
  • 38. High Performance Internet Computing Center jwoo Woo CSULA Legacy Example (Cont’d) In late 2007, the New York Times wanted to make available over the web its entire archive of articles, a software programmer at the Times, Derek Gottfrid, – playing around with Amazon Web Services, Elastic Compute Cloud (EC2), • uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3) • In less than 24 hours, 11 millions PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.  The total cost for the computing job? $240 – 10 cents per computer-hour times 100 computers times 24 hours
  • 39. High Performance Internet Computing Center jwoo Woo CSULA Supporters of Big Data: Hadoop Ecosystems  Apache Hadoop Supporters  Cloudera – Like Linux and Redhat – HiPIC is an Academic Partner  Hortonworks – Pig, – Consulting and training  Facebook – Hive  IBM – Jaql  NoSQL DB supporters  MongoDB  HBase, CouchDB, Apache Cassandra (originally by FB) etc
  • 40. High Performance Internet Computing Center jwoo Woo CSULA Pig • developed at Yahoo Research around 2006 o moved into the Apache Software Foundation in 2007. • PigLatin, o Pig's language o a data flow language o well suited to processing unstructured data  Unlike SQL, not require that the data have a schema  However, can still leverage the value of a schema
  • 41. High Performance Internet Computing Center jwoo Woo CSULA Hive • developed at Facebook o turns Hadoop into a data warehouse o complete with a dialect of SQL for querying. • HiveQL o a declarative language (SQL dialect) • Difference from PigLatin, o you do not specify the data flow,  but instead describe the result you want  Hive figures out how to build a data flow to achieve it. o a schema is required,  but not limited to one schema. o data can have many schemas
  • 42. High Performance Internet Computing Center jwoo Woo CSULA Hive (Cont'd) • Similarity with PigLatin and SQL, o HiveQL on its own is a relationally complete language  but not a Turing complete language,  That can express any computation o can be extended through UDFs (User Defined Functions) of Java  just like Pig to be Turing complete
  • 43. High Performance Internet Computing Center jwoo Woo CSULA Jaql • developed at IBM. • a data flow language o its native data structure format is JSON (JavaScript Object Notation). • Schemas are optional • Turing complete on its own o without the need for extension through UDFs.
  • 44. High Performance Internet Computing Center jwoo Woo CSULA MapReduce Cons and Future Bad for Fast response time Large amount of shared data Fine-grained synch needed CPU-intensive not data-intensive Continuous input stream Hadoop 2.0: YARN Not a product yet but will be soon
  • 45. High Performance Internet Computing Center jwoo Woo CSULA Hadoop 2.0: YARN Data processing applications and services Online Serving – HOYA (HBase on YARN) Real-time event processing – Storm, S4, other commercial platforms Tez – Generic framework to run a complex DAG  MPI: OpenMPI, MPICH2  Master-Worker  Machine Learning: Spark  Graph processing: Giraph  Enabled by allowing the use of paradigm-specific application master [http://www.slideshare.net/hortonworks/apache- hadoop-yarn-enabling-nex]
  • 46. High Performance Internet Computing Center jwoo Woo CSULA Use Cases Amazon AWS Facebook Twitter Craiglist HuffPOst | AOL
  • 47. High Performance Internet Computing Center jwoo Woo CSULA Amazon AWS amazon.com Consumer and seller business aws.amazon.com IT infrastructure business – Focus on your business not IT management Pay as you go Services with many APIs – S3: Simple Storage Service – EC2: Elastic Compute Cloud • Provide many virtual Linux servers • Can run on multiple nodes – Hadoop and HBase – MongoDB
  • 48. High Performance Internet Computing Center jwoo Woo CSULA Amazon AWS (Cont’d) Customers on aws.amazon.com Samsung – Smart TV hub sites: TV applications are on AWS Netflix – ~25% of US internet traffic – ~100% on AWS NASA JPL – Analyze more than 200,000 images NASDAQ – Using AWS S3 HiPIC received research and teaching grants from AWS
  • 49. High Performance Internet Computing Center jwoo Woo CSULA Facebook [7] Using Apache HBase  For Titan and Puma – Message Services – ETL  HBase for FB – Provide excellent write performance and good reads – Nice features • Scalable • Fault Tolerance • MapReduce
  • 50. High Performance Internet Computing Center jwoo Woo CSULA Titan: Facebook Message services in FB Hundreds of millions of active users 15+ billion messages a month 50K instant message a second Challenges High write throughput – Every message, instant message, SMS, email Massive Clusters – Must be easily scalable Solution Clustered HBase
  • 51. High Performance Internet Computing Center jwoo Woo CSULA Puma: Facebook  ETL  Extract, Transform, Load – Data Integrating from many data sources to Data Warehouse  Data analytics – Domain owners’ web analytics for Ad and apps • clicks, likes, shares, comments etc  ETL before Puma  8 – 24 hours – Procedures: Scribe, HDFS, Hive, MySQL  ETL after Puma  Puma – Real time MapReduce framework  2 – 30 secs – Procedures: Scribe, HDFS, Puma, HBase
  • 52. High Performance Internet Computing Center jwoo Woo CSULA Twitter [8] Three Challenges Collecting Data – Scribe as FB Large Scale Storage and analysis – Cassandra: ColumnFamily key-value store – Hadoop Rapid Learning over Big Data – Pig • 5% of Java code • 5% of dev time • Within 20% of running time
  • 53. High Performance Internet Computing Center jwoo Woo CSULA Craiglist in MongoDB [9] Craiglist ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day Servers – ~500 servers – ~100 MySQL servers Migrate to MongoDB Scalable, Fast, Proven, Friendly
  • 54. High Performance Internet Computing Center jwoo Woo CSULA HuffPost | AOL [10] Two Machine Learning Use Cases Comment Moderation –Evaluate All New HuffPost User Comments Every Day • Identify Abusive / Aggressive Comments • Auto Delete / Publish ~25% Comments Every Day Article Classification –Tag Articles for Advertising • E.g.: scary, salacious, …
  • 55. High Performance Internet Computing Center jwoo Woo CSULA HuffPost | AOL [10] Parallelize on Hadoop Good news: – Mahout, a parallel machine learning tool, is already available. – There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news: – Mahout doesn’t support necessary algorithms yet. – Other algorithms do not run natively on Hadoop. build a flexible ML platform running on Hadoop Pig for Hadoop implementation.
  • 56. High Performance Internet Computing Center jwoo Woo CSULA Hadoop Streaming  Hadoop MapReduce for Non-Java codes: Python, Ruby  Requirement  Running Hadoop  Needs Hadoop Streaming API – hadoop-streaming.jar  Needs to build Mapper and Reducer codes – Simple conversion from sequential codes  STDIN > mapper > reducer > STDOUT
  • 57. High Performance Internet Computing Center jwoo Woo CSULA Hadoop Streaming  MapReduce Python execution  http://wiki.apache.org/hadoop/HadoopStreaming  Sysntax $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd|JavaClassName> The streaming command to run -reducer <cmd|JavaClassName> The streaming command to run -file <file> File/dir to be shipped in the Job jar file  Example $ bin/hadoop jar contrib/streaming/hadoop-streaming.jar -file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py -file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py -input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare- output
  • 58. High Performance Internet Computing Center jwoo Woo CSULA Conclusion  Era of Big Data  Need to store and compute Big Data  Many solutions but Hadoop  Storage: NoSQL DB  Computation: Hadoop MapRedude  Need to analyze Big Data in mobile computing, SNS for Ad, User Behavior, Patterns …
  • 59. High Performance Internet Computing Center jwoo Woo CSULA Question?