SlideShare ist ein Scribd-Unternehmen logo
1 von 67
jwoo Woo
HiPIC
CSULA
Big Data and Data Intensive Computing:
Education and Training
Graduate School of Communication & Art
Yonsei University
Shinchon, Korea
Sept 5th 2013
Jongwook Woo (PhD)
High-Performance Information Computing Center (HiPIC)
Educational Partner with Cloudera and Grants Awardee of Amazon AWS
Computer Information Systems Department
California State University, Los Angeles
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
소개
 Big Data Use Cases
 Data Issues
 Big Data
 Data-Intensive Computing: Hadoop
 Training in Big Data
 Big Data Supporters
High Performance Information Computing Center
Jongwook Woo
CSULA
Me
 이름: 우종욱
 직업:
 교수 (직책: 부교수), California State University Los Angeles
– Capital City of Entertainment
 경력:
 2002년 부터 교수: Computer Information Systems Dept, College of
Business and Economics
– www.calstatela.edu/faculty/jwoo5
 1998년부터 헐리우드등지의 많은 회사 컨설팅
– 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축
– FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
 2009여년 부터 하둡 빅데이타에 관심
High Performance Information Computing Center
Jongwook Woo
CSULA
Me
경력 (계속):
2013년 여름 현재 IglooSecurity 자문중:
– Hadoop 및 그 Ecosystems 교육
– 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을
빠르게 데이타 검색하는 시스템 R&D
• Hadoop, Solr, Java, Cloudera 이용
2013년 9월 중순: 삼성 종합 기술원
– 3일간 Hadoop 및 그 Ecosystems 교육 예정
– Using Cloudera material in Korea as far as I know
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Grants
 Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
 Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
 Partnership
 Received Academic Education Partnership with Cloudera since
June 2012
 Linked with Hortonworks since May 2013
– Positive to provide partnership
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Certificate
 Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
 Certificate of 10gen Training Course, “M101: MongoDB
Development”, (Dec 24 2012)
 Blog and Github for Hadoop and its ecosystems
 http://dal-cloudcomputing.blogspot.com/
– Hadoop, AWS, Cloudera
 https://github.com/hipic
– Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming,
RHadoop
 https://github.com/dalgual
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Several publications regarding Hadoop and NoSQL
 “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA
2012, Las Vegas (July 16-19, 2012)
 “Market Basket Analysis Algorithm with no-SQL DB HBase and
Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon
Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011
 “Market Basket Analysis Algorithm with Map/Reduce of Cloud
Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011,
Las Vegas (July 18-21, 2011)
 Jongwook Woo, “Introduction to Cloud Computing”, in the
10th KOCSEA Technical Symposium, UNLV, Dec 18 - 19, 2009
 Talks in Korean Universities and companies
 Yonsei, Sookmyung, KAIST, Korean Polytech Univ
– Winter 2011
 VanillaBreeze
– Winter 2011
High Performance Information Computing Center
Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing
High Performance Information Computing Center
Jongwook Woo
CSULA
Data
Google
“We don’t have a better algorithm
than others but we have more data
than others”
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases in Korea
SK Telecomm
Seoul
Credit Cards
Hyundai Motors
High Performance Information Computing Center
Jongwook Woo
CSULA
SK Telecomm
T Map
 Collect GPS traffic data from Taxi, Bus,
Rental Car
– Every 5 mins. Traffic data from 50,000 cars
 Tell the quickest directions to the
destination
High Performance Information Computing Center
Jongwook Woo
CSULA
Seoul
Night Bus
 Collect GPS traffic data from Taxi
 Find out the most frequent traffics
–Build Bus lines in the night
High Performance Information Computing Center
Jongwook Woo
CSULA
Credit Cards
Apps to find out popular restaurants
Collect customers behavior, which occurred using
the cards at the restaurants
Based on Logic: Frequency to visit the same
restaurants in 3 months
Show the popular restaurants
Credit Cards for Gas Station discount
Using a card at a gas station that does not provide
discounts
Sell a new card that gives a discount at any station
High Performance Information Computing Center
Jongwook Woo
CSULA
Hyundai Motors
Improve the present and future models
Collect drivers’ behavior and the status of the cars
Collect any errors in the car
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases
President Election
Amazon AWS
HuffPOst | AOL
High Performance Information Computing Center
Jongwook Woo
CSULA
President Election
People Behavior Analysis
Collect people’s data of Credit card usages, Car
models, Newspapers to read, Facebook, Twitter
For example, pro-environmental Campaign for
– Mom
• who sends the kids to the public school,
• who twits about Organic foods,
High Performance Information Computing Center
Jongwook Woo
CSULA
HuffPost | AOL [10]
Two Machine Learning Use Cases
Comment Moderation
–Evaluate All New HuffPost User Comments
Every Day
• Identify Abusive / Aggressive Comments
• Auto Delete / Publish ~25% Comments Every
Day
Article Classification
–Tag Articles for Advertising
• E.g.: scary, salacious, …
High Performance Information Computing Center
Jongwook Woo
CSULA
HuffPost | AOL [10]
Parallelize on Hadoop
Good news:
– Mahout, a parallel machine learning tool, is
already available.
– There are Mallet, libsvm, Weka, … that support
necessary algorithms.
Bad news:
– Mahout doesn’t support necessary algorithms
yet.
– Other algorithms do not run natively on Hadoop.
build a flexible ML platform running on
Hadoop
Pig for Hadoop implementation.
High Performance Information Computing Center
Jongwook Woo
CSULA
Others
amazon.com
Recommend books to the people
Google
Find out influenza much earlier
– by analyzing the area under influenza
Translator
– by analyzing the data from many people
Siri of Apple
Natural Language Processing from many data of
people
High Performance Information Computing Center
Jongwook Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data with sparse attributes
– Semantic or social relations
No relational property
– nor complex join queries
• Log data
Immutable
No need to update and delete data
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data, Bioinformatics, Social Computing,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
NoSQL DB
How to compute Big Data
Parallel Computing with multiple non-
expensive computers
–Own super computers
High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data for RDBMS
Issues in RDBMS
Hard to scale
– Relation gets broken
• Partitioning for scalability
• Replication for availability
Speed
– The Seek times of physical storage
• Slower than N/W speed
• 1TB disk: 10Mbps transfer rate
– 100K sec =>27.8 hrs
– With Multiple data sources at difference places
• 100 10GB disks: each 10Mbps transfer rate
– 1K sec =>16.7min
High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data for RDBMS (Cont’d)
Issues in RDBMS (Cont’d)
Data Integration
–Not good for un-/semi-structured data
• Many unstructured data
–Web or log data etc
RDB not good in parallelization
–Cannot split 1000 tasks to non-expensive
1000 PCs efficiently
High Performance Information Computing Center
Jongwook Woo
CSULA
RDBMS Issues
Solution
 Before: Data Warehouse
 Now and future: Big Data
Hadoop framework
Data Computation (MapReduce, Pig)
Data Repositories (NoSQL DB: HBase,
Cassandra, MongoDB)
Business Intelligence (Data Mining,
OLAP, Data Visualization, Reporting):
Hive, Mahout
High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data
Definition
 Systems that supports a non-
expensive platform to store and
compute large scale, non-
/semi-structured data
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases for NoSQL DB [1]
RDBMS replacement
for high-traffic web applications
Semi-structured content management
Real-time analytics & high-speed logging
Web Infrastructure
Web 2.0, Media, SaaS, Gaming,
Finance, Telecom, Healthcare, Government
Three NoSQL DB Approaches
Key/Value, Column, Document
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Store of NoSQL DB
Key/Value store
(Key, Value)
Functions
– Index, versioning, sorting, locking, transaction,
replication
Apache Cassandra, Memcached
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Store of NoSQL DB (Cont’d)
Column-Oriented Stores (Extensible Record
Stores)
stores data tables as sections of columns of data
– rather than as rows of data, like most RDBMS
• Sparse fields in RDBMS
– well-suited for OLAP-like workloads (e.g., data
warehouses)
Extensible record horizontally and vertically
partitioned across nodes
– Rows and Columns are distributed over multiple
nodes
BigTable, HBase, Cassandra, Hypertable
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Store of NoSQL DB (Cont’d)
 Row Oriented
– 1,Smith, Joe, smith@hi.com;
– 2,Jones, Mary, mary@hi.com;
– 3,Johnson, Cathy, cathy@hi.com;
 Column Oriented
– 1,2,3;
– Smith, Jones, Johnson;
– Joe, Mary, Cathy;
– smith@hi.com, mary@hi.com, cathy@hi.com;
StudentId Lastname Firstname email
1 Smith Joe smith@hi.com
2 Jones Mary mary@hi.com
3 Johnson Cathy cathy@hi.com
High Performance Information Computing Center
Jongwook Woo
CSULA
HBase Schema Example (Student/Course)
 RDBMS
 Students: (id, name, sex, age)
 Courses: (id, title, desc, teacher_id)
 S_C: (s_id, c_id, type)
 HBase
Column Families
id Info: Course
<student_id> Info:name Info:sex Info:age Course:<course_id>=
type
Column Families
id Info: student
<course_id> Info:title Info:desc Info:teacher_id student:<student_id>
=type
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Store of NoSQL DB (Cont’d)
Document Store
Collections and Documents
– vs Tables and Records of RDB
Used in Search Engine/Repository
Multiple index to store indexed document
– no fixed fields
Not simple key-value lookup
– Use API
Functions
– No locking, Replication, Transaction
MongoDB, CouchDB, ThruDB, SimpleDB
High Performance Information Computing Center
Jongwook Woo
CSULA
Understanding the Document Model [1]
{
_id:“A4304”
author: “nosh”,
date: 22/6/2010,
title: “Intro to MongoDB”
text: “MongoDB is an open source..”,
tags: [“webinar”, “opensource”]
comments: [{author: “mike”,
date: 11/18/2010,
txt: “Did you see the…”,
votes: 7},….]
}
Documents->Collections->Databases
High Performance Information Computing Center
Jongwook Woo
CSULA
Document Model Makes Queries Simple [1]
Operators:
$gt, $lt, $gte, $lte, $ne, $all, $in, $nin, count, limit,
skip, group
Example:
db.posts.find({author: “nosh”,
tags: “webinar”})
High Performance Information Computing Center
Jongwook Woo
CSULA
Selected Users [1]
High Performance Information Computing Center
Jongwook Woo
CSULA
The Great Divide [1]
MongoDB sweet spot: Easy, Flexible,
Scalable
HBase
MongoDB
High Performance Information Computing Center
Jongwook Woo
CSULA
Solutions in Big Data Computation
 Map/Reduce by Google
(Key, Value) parallel computing
 Apache Hadoop
 Big Data
Data Computation (MapReduce, Pig)
 Integrating MapReduce and RDB
Oracle + Hadoop
Sybase IQ
Vertica + Hadoop
Hadoop DB
Greenplum
Aster Data
 Integrating MapReduce and NoSQL DB
MongoDB MapReduce
HBase
High Performance Information Computing Center
Jongwook Woo
CSULA
Apache Hadoop
 Motivated by Google Map/Reduce and GFS
 open source project of the Apache Foundation.
 framework written in Java
– originally developed by Doug Cutting
• who named it after his son's toy elephant.
 Two core Components
 Storage: HDFS
– High Bandwidth Clustered storage
 Processing: Map/Reduce
– Fault Tolerant Distributed Processing
 Hadoop scales linearly with
 data size
 Analysis complexity
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop issues
Map/Reduce is not DB
Algorithm in Restricted Parallel Computing
HDFS and HBase
Cannot compete with the functions in RDBMS
But, useful for
Useful for huge (peta- or Terra-bytes) but non-
complicated data
– Web crawling
– log analysis
• Log file for web companies
– New York Times case
High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce Pros & Cons Summary
Good when
Huge data for input, intermediate, output
A few synchronization required
Read once; batch oriented datasets (ETL)
Bad for
Fast response time
Large amount of shared data
Fine-grained synch needed
CPU-intensive not data-intensive
Continuous input stream
High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce in Detail
Functions borrowed from functional
programming languages (eg. Lisp)
Provides Restricted parallel programming
model on Hadoop
User implements Map() and Reduce()
Libraries (Hadoop) take care of
EVERYTHING else
–Parallelization
–Fault Tolerance
–Data Distribution
–Load Balancing
High Performance Information Computing Center
Jongwook Woo
CSULA
Map
Convert input data to (key, value) pairs
map() functions run in parallel,
 creating different intermediate (key, value)
values from different input data sets
High Performance Information Computing Center
Jongwook Woo
CSULA
Reduce
reduce() combines those intermediate values
into one or more final values for that same
key
reduce() functions also run in parallel,
each working on a different output key
Bottleneck:
reduce phase can’t start until map phase is
completely finished.
High Performance Information Computing Center
Jongwook Woo
CSULA
Training in Big Data
 Learn by yourself?
Miss many important topics
Two main:
–Cloudera, Hortonworks
• With hands-on exercises
Cloudera 강의 교재 간단히 소개
Especially MapReduce example
High Performance Information Computing Center
Jongwook Woo
CSULA
Example: Sort URLs in the largest hit order
Compute the largest hit URLs
Stored in log files
Map()
Input <logFilename, file text>
Output: Parses file and emits <url, hit counts> pairs
– eg. <http://hello.com, 1>
Reduce()
Input: <url, list of hit counts> from multiple map
nodes
Output: Sums all values for the same key and emits
<url, TotalCount>
– eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
High Performance Information Computing Center
Jongwook Woo
CSULA
Map/Reduce for URL visits
…
…Map1() Map2() Mapm()
Reduce1 () Reducel()
Data Aggregation/Combine
(http://hi.com, <1, 1, …, 1>)
(http://hello.com, <3, 5, 2, 7>)
(http://hi.com, 32)
(http://hello.com, 17)
Input Log Data
Reduce2()
(http://hi.com, 1)
(http://hello.com, 3)
…
(http://halo.com, 1)
(http://hello.com, 5)
…
(http://halo.com, <1, 5,>)
(http://halo.com, 6)
High Performance Information Computing Center
Jongwook Woo
CSULA
Legacy Example
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
11 million in all, dating back to 1851.
four-terabyte pile of images in TIFF format.
needed to translate that four-terabyte pile of TIFFs
into more web-friendly PDF files.
– not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.
High Performance Information Computing Center
Jongwook Woo
CSULA
Legacy Example (Cont’d)
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
a software programmer at the Times, Derek Gottfrid,
– playing around with Amazon Web Services, Elastic
Compute Cloud (EC2),
• uploaded the four terabytes of TIFF data into Amazon's
Simple Storage System (S3)
• In less than 24 hours, 11 millions PDFs, all stored
neatly in S3 and ready to be served up to visitors to the
Times site.
 The total cost for the computing job? $240
– 10 cents per computer-hour times 100 computers times 24 hours
High Performance Information Computing Center
Jongwook Woo
CSULA
Supporters of Big Data: Hadoop Ecosystems
 Apache Hadoop Supporters
 Cloudera
– Like Linux and Redhat
– HiPIC is an Academic Partner
 Hortonworks
– Pig,
– Consulting and training
 Facebook
– Hive
 IBM
– Jaql
 NoSQL DB supporters
 MongoDB
 HBase, CouchDB, Apache Cassandra (originally by FB) etc
High Performance Information Computing Center
Jongwook Woo
CSULA
Pig
• developed at Yahoo Research around 2006
o moved into the Apache Software Foundation in
2007.
• PigLatin,
o Pig's language
o a data flow language
o well suited to processing unstructured data
 Unlike SQL, not require that the data have a
schema
 However, can still leverage the value of a schema
High Performance Information Computing Center
Jongwook Woo
CSULA
Hive
• developed at Facebook
o turns Hadoop into a data warehouse
o complete with a dialect of SQL for querying.
• HiveQL
o a declarative language (SQL dialect)
• Difference from PigLatin,
o you do not specify the data flow,
 but instead describe the result you want
 Hive figures out how to build a data flow to
achieve it.
o a schema is required,
 but not limited to one schema.
o data can have many schemas
High Performance Information Computing Center
Jongwook Woo
CSULA
Hive (Cont'd)
• Similarity with PigLatin and SQL,
o HiveQL on its own is a relationally complete
language
 but not a Turing complete language,
 That can express any computation
o can be extended through UDFs (User Defined
Functions) of Java
 just like Pig to be Turing complete
High Performance Information Computing Center
Jongwook Woo
CSULA
Jaql
• developed at IBM.
• a data flow language
o its native data structure format is JSON (JavaScript
Object Notation).
• Schemas are optional
• Turing complete on its own
o without the need for extension through UDFs.
High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce Cons and Future
Bad for
Fast response time
Large amount of shared data
Fine-grained synch needed
CPU-intensive not data-intensive
Continuous input stream
Hadoop 2.0: YARN
Not a product yet but will be soon
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 2.0: YARN
Data processing applications and services
Online Serving – HOYA (HBase on YARN)
Real-time event processing – Storm, S4, other
commercial platforms
Tez – Generic framework to run a complex DAG
 MPI: OpenMPI, MPICH2
 Master-Worker
 Machine Learning: Spark
 Graph processing: Giraph
 Enabled by allowing the use of paradigm-specific
application master
[http://www.slideshare.net/hortonworks/apache-
hadoop-yarn-enabling-nex]
High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data Supporters
Amazon AWS
Facebook
Twitter
Craiglist
High Performance Information Computing Center
Jongwook Woo
CSULA
Amazon AWS
amazon.com
Consumer and seller business
aws.amazon.com
IT infrastructure business
– Focus on your business not IT management
Pay as you go
Services with many APIs
– S3: Simple Storage Service
– EC2: Elastic Compute Cloud
• Provide many virtual Linux servers
• Can run on multiple nodes
– Hadoop and HBase
– MongoDB
High Performance Information Computing Center
Jongwook Woo
CSULA
Amazon AWS (Cont’d)
Customers on aws.amazon.com
Samsung
– Smart TV hub sites: TV applications are on AWS
Netflix
– ~25% of US internet traffic
– ~100% on AWS
NASA JPL
– Analyze more than 200,000 images
NASDAQ
– Using AWS S3
HiPIC received research and teaching
grants from AWS
High Performance Information Computing Center
Jongwook Woo
CSULA
Facebook [7]
Using Apache HBase
 For Titan and Puma
– Message Services
– ETL
 HBase for FB
– Provide excellent write performance and good reads
– Nice features
• Scalable
• Fault Tolerance
• MapReduce
High Performance Information Computing Center
Jongwook Woo
CSULA
Titan: Facebook
Message services in FB
Hundreds of millions of active users
15+ billion messages a month
50K instant message a second
Challenges
High write throughput
– Every message, instant message, SMS, email
Massive Clusters
– Must be easily scalable
Solution
Clustered HBase
High Performance Information Computing Center
Jongwook Woo
CSULA
Puma: Facebook
 ETL
 Extract, Transform, Load
– Data Integrating from many data sources to Data Warehouse
 Data analytics
– Domain owners’ web analytics for Ad and apps
• clicks, likes, shares, comments etc
 ETL before Puma
 8 – 24 hours
– Procedures: Scribe, HDFS, Hive, MySQL
 ETL after Puma
 Puma
– Real time MapReduce framework
 2 – 30 secs
– Procedures: Scribe, HDFS, Puma, HBase
High Performance Information Computing Center
Jongwook Woo
CSULA
Twitter [8]
Three Challenges
Collecting Data
– Scribe as FB
Large Scale Storage and analysis
– Cassandra: ColumnFamily key-value store
– Hadoop
Rapid Learning over Big Data
– Pig
• 5% of Java code
• 5% of dev time
• Within 20% of running time
High Performance Information Computing Center
Jongwook Woo
CSULA
Craiglist in MongoDB [9]
Craiglist
~700 cities, worldwide
~1 billion hits/day
~1.5 million posts/day
Servers
– ~500 servers
– ~100 MySQL servers
Migrate to MongoDB
Scalable, Fast, Proven, Friendly
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
 Hadoop MapReduce for Non-Java codes: Python,
Ruby
 Requirement
 Running Hadoop
 Needs Hadoop Streaming API
– hadoop-streaming.jar
 Needs to build Mapper and Reducer codes
– Simple conversion from sequential codes
 STDIN > mapper > reducer > STDOUT
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
 MapReduce Python execution
 http://wiki.apache.org/hadoop/HadoopStreaming
 Sysntax
$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar
[options] Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
 Example
$ bin/hadoop jar contrib/streaming/hadoop-streaming.jar 
-file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py 
-file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py 
-input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare-
output
High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
 Era of Big Data
 Need to store and compute Big Data
 Many solutions but Hadoop
 Storage: NoSQL DB
 Computation: Hadoop MapRedude
 Need to analyze Big Data in mobile computing, SNS
for Ad, User Behavior, Patterns …
High Performance Information Computing Center
Jongwook Woo
CSULA
Question?

Weitere ähnliche Inhalte

Ähnlich wie Big Data and Data Intensive Computing: Education and Training

Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesJongwook Woo
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksJongwook Woo
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingJongwook Woo
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopJongwook Woo
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryJongwook Woo
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017Jongwook Woo
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open DataJongwook Woo
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open DataJongwook Woo
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsJongwook Woo
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopJongwook Woo
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Jongwook Woo
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Jongwook Woo
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzerpriyal mistry
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive AnalysisJongwook Woo
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
 

Ähnlich wie Big Data and Data Intensive Computing: Education and Training (20)

Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on Hadoop
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open Data
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
 
Spark ukc2015v1.1
Spark ukc2015v1.1Spark ukc2015v1.1
Spark ukc2015v1.1
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzer
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 

Mehr von Jongwook Woo

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum ComputingJongwook Woo
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its TrendsJongwook Woo
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkJongwook Woo
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningJongwook Woo
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeJongwook Woo
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Jongwook Woo
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesJongwook Woo
 

Mehr von Jongwook Woo (15)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 

Kürzlich hochgeladen

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Kürzlich hochgeladen (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Big Data and Data Intensive Computing: Education and Training

  • 1. jwoo Woo HiPIC CSULA Big Data and Data Intensive Computing: Education and Training Graduate School of Communication & Art Yonsei University Shinchon, Korea Sept 5th 2013 Jongwook Woo (PhD) High-Performance Information Computing Center (HiPIC) Educational Partner with Cloudera and Grants Awardee of Amazon AWS Computer Information Systems Department California State University, Los Angeles
  • 2. High Performance Information Computing Center Jongwook Woo CSULA Contents 소개  Big Data Use Cases  Data Issues  Big Data  Data-Intensive Computing: Hadoop  Training in Big Data  Big Data Supporters
  • 3. High Performance Information Computing Center Jongwook Woo CSULA Me  이름: 우종욱  직업:  교수 (직책: 부교수), California State University Los Angeles – Capital City of Entertainment  경력:  2002년 부터 교수: Computer Information Systems Dept, College of Business and Economics – www.calstatela.edu/faculty/jwoo5  1998년부터 헐리우드등지의 많은 회사 컨설팅 – 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축 – FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합 – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등  2009여년 부터 하둡 빅데이타에 관심
  • 4. High Performance Information Computing Center Jongwook Woo CSULA Me 경력 (계속): 2013년 여름 현재 IglooSecurity 자문중: – Hadoop 및 그 Ecosystems 교육 – 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을 빠르게 데이타 검색하는 시스템 R&D • Hadoop, Solr, Java, Cloudera 이용 2013년 9월 중순: 삼성 종합 기술원 – 3일간 Hadoop 및 그 Ecosystems 교육 예정 – Using Cloudera material in Korea as far as I know
  • 5. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Grants  Received Amazon AWS in Education Research Grant (July 2012 - July 2014)  Received Amazon AWS in Education Coursework Grants (July 2012 - July 2013, Jan 2011 - Dec 2011  Partnership  Received Academic Education Partnership with Cloudera since June 2012  Linked with Hortonworks since May 2013 – Positive to provide partnership
  • 6. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Certificate  Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012  Certificate of 10gen Training Course, “M101: MongoDB Development”, (Dec 24 2012)  Blog and Github for Hadoop and its ecosystems  http://dal-cloudcomputing.blogspot.com/ – Hadoop, AWS, Cloudera  https://github.com/hipic – Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming, RHadoop  https://github.com/dalgual
  • 7. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Several publications regarding Hadoop and NoSQL  “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las Vegas (July 16-19, 2012)  “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011  “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las Vegas (July 18-21, 2011)  Jongwook Woo, “Introduction to Cloud Computing”, in the 10th KOCSEA Technical Symposium, UNLV, Dec 18 - 19, 2009  Talks in Korean Universities and companies  Yonsei, Sookmyung, KAIST, Korean Polytech Univ – Winter 2011  VanillaBreeze – Winter 2011
  • 8. High Performance Information Computing Center Jongwook Woo CSULA What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing
  • 9. High Performance Information Computing Center Jongwook Woo CSULA Data Google “We don’t have a better algorithm than others but we have more data than others”
  • 10. High Performance Information Computing Center Jongwook Woo CSULA Use Cases in Korea SK Telecomm Seoul Credit Cards Hyundai Motors
  • 11. High Performance Information Computing Center Jongwook Woo CSULA SK Telecomm T Map  Collect GPS traffic data from Taxi, Bus, Rental Car – Every 5 mins. Traffic data from 50,000 cars  Tell the quickest directions to the destination
  • 12. High Performance Information Computing Center Jongwook Woo CSULA Seoul Night Bus  Collect GPS traffic data from Taxi  Find out the most frequent traffics –Build Bus lines in the night
  • 13. High Performance Information Computing Center Jongwook Woo CSULA Credit Cards Apps to find out popular restaurants Collect customers behavior, which occurred using the cards at the restaurants Based on Logic: Frequency to visit the same restaurants in 3 months Show the popular restaurants Credit Cards for Gas Station discount Using a card at a gas station that does not provide discounts Sell a new card that gives a discount at any station
  • 14. High Performance Information Computing Center Jongwook Woo CSULA Hyundai Motors Improve the present and future models Collect drivers’ behavior and the status of the cars Collect any errors in the car
  • 15. High Performance Information Computing Center Jongwook Woo CSULA Use Cases President Election Amazon AWS HuffPOst | AOL
  • 16. High Performance Information Computing Center Jongwook Woo CSULA President Election People Behavior Analysis Collect people’s data of Credit card usages, Car models, Newspapers to read, Facebook, Twitter For example, pro-environmental Campaign for – Mom • who sends the kids to the public school, • who twits about Organic foods,
  • 17. High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL [10] Two Machine Learning Use Cases Comment Moderation –Evaluate All New HuffPost User Comments Every Day • Identify Abusive / Aggressive Comments • Auto Delete / Publish ~25% Comments Every Day Article Classification –Tag Articles for Advertising • E.g.: scary, salacious, …
  • 18. High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL [10] Parallelize on Hadoop Good news: – Mahout, a parallel machine learning tool, is already available. – There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news: – Mahout doesn’t support necessary algorithms yet. – Other algorithms do not run natively on Hadoop. build a flexible ML platform running on Hadoop Pig for Hadoop implementation.
  • 19. High Performance Information Computing Center Jongwook Woo CSULA Others amazon.com Recommend books to the people Google Find out influenza much earlier – by analyzing the area under influenza Translator – by analyzing the data from many people Siri of Apple Natural Language Processing from many data of people
  • 20. High Performance Information Computing Center Jongwook Woo CSULA New Data Trend Sparsity Unstructured Schema free data with sparse attributes – Semantic or social relations No relational property – nor complex join queries • Log data Immutable No need to update and delete data
  • 21. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data, Bioinformatics, Social Computing, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  • 22. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data NoSQL DB How to compute Big Data Parallel Computing with multiple non- expensive computers –Own super computers
  • 23. High Performance Information Computing Center Jongwook Woo CSULA Big Data for RDBMS Issues in RDBMS Hard to scale – Relation gets broken • Partitioning for scalability • Replication for availability Speed – The Seek times of physical storage • Slower than N/W speed • 1TB disk: 10Mbps transfer rate – 100K sec =>27.8 hrs – With Multiple data sources at difference places • 100 10GB disks: each 10Mbps transfer rate – 1K sec =>16.7min
  • 24. High Performance Information Computing Center Jongwook Woo CSULA Big Data for RDBMS (Cont’d) Issues in RDBMS (Cont’d) Data Integration –Not good for un-/semi-structured data • Many unstructured data –Web or log data etc RDB not good in parallelization –Cannot split 1000 tasks to non-expensive 1000 PCs efficiently
  • 25. High Performance Information Computing Center Jongwook Woo CSULA RDBMS Issues Solution  Before: Data Warehouse  Now and future: Big Data Hadoop framework Data Computation (MapReduce, Pig) Data Repositories (NoSQL DB: HBase, Cassandra, MongoDB) Business Intelligence (Data Mining, OLAP, Data Visualization, Reporting): Hive, Mahout
  • 26. High Performance Information Computing Center Jongwook Woo CSULA Big Data Definition  Systems that supports a non- expensive platform to store and compute large scale, non- /semi-structured data
  • 27. High Performance Information Computing Center Jongwook Woo CSULA Use Cases for NoSQL DB [1] RDBMS replacement for high-traffic web applications Semi-structured content management Real-time analytics & high-speed logging Web Infrastructure Web 2.0, Media, SaaS, Gaming, Finance, Telecom, Healthcare, Government Three NoSQL DB Approaches Key/Value, Column, Document
  • 28. High Performance Information Computing Center Jongwook Woo CSULA Data Store of NoSQL DB Key/Value store (Key, Value) Functions – Index, versioning, sorting, locking, transaction, replication Apache Cassandra, Memcached
  • 29. High Performance Information Computing Center Jongwook Woo CSULA Data Store of NoSQL DB (Cont’d) Column-Oriented Stores (Extensible Record Stores) stores data tables as sections of columns of data – rather than as rows of data, like most RDBMS • Sparse fields in RDBMS – well-suited for OLAP-like workloads (e.g., data warehouses) Extensible record horizontally and vertically partitioned across nodes – Rows and Columns are distributed over multiple nodes BigTable, HBase, Cassandra, Hypertable
  • 30. High Performance Information Computing Center Jongwook Woo CSULA Data Store of NoSQL DB (Cont’d)  Row Oriented – 1,Smith, Joe, smith@hi.com; – 2,Jones, Mary, mary@hi.com; – 3,Johnson, Cathy, cathy@hi.com;  Column Oriented – 1,2,3; – Smith, Jones, Johnson; – Joe, Mary, Cathy; – smith@hi.com, mary@hi.com, cathy@hi.com; StudentId Lastname Firstname email 1 Smith Joe smith@hi.com 2 Jones Mary mary@hi.com 3 Johnson Cathy cathy@hi.com
  • 31. High Performance Information Computing Center Jongwook Woo CSULA HBase Schema Example (Student/Course)  RDBMS  Students: (id, name, sex, age)  Courses: (id, title, desc, teacher_id)  S_C: (s_id, c_id, type)  HBase Column Families id Info: Course <student_id> Info:name Info:sex Info:age Course:<course_id>= type Column Families id Info: student <course_id> Info:title Info:desc Info:teacher_id student:<student_id> =type
  • 32. High Performance Information Computing Center Jongwook Woo CSULA Data Store of NoSQL DB (Cont’d) Document Store Collections and Documents – vs Tables and Records of RDB Used in Search Engine/Repository Multiple index to store indexed document – no fixed fields Not simple key-value lookup – Use API Functions – No locking, Replication, Transaction MongoDB, CouchDB, ThruDB, SimpleDB
  • 33. High Performance Information Computing Center Jongwook Woo CSULA Understanding the Document Model [1] { _id:“A4304” author: “nosh”, date: 22/6/2010, title: “Intro to MongoDB” text: “MongoDB is an open source..”, tags: [“webinar”, “opensource”] comments: [{author: “mike”, date: 11/18/2010, txt: “Did you see the…”, votes: 7},….] } Documents->Collections->Databases
  • 34. High Performance Information Computing Center Jongwook Woo CSULA Document Model Makes Queries Simple [1] Operators: $gt, $lt, $gte, $lte, $ne, $all, $in, $nin, count, limit, skip, group Example: db.posts.find({author: “nosh”, tags: “webinar”})
  • 35. High Performance Information Computing Center Jongwook Woo CSULA Selected Users [1]
  • 36. High Performance Information Computing Center Jongwook Woo CSULA The Great Divide [1] MongoDB sweet spot: Easy, Flexible, Scalable HBase MongoDB
  • 37. High Performance Information Computing Center Jongwook Woo CSULA Solutions in Big Data Computation  Map/Reduce by Google (Key, Value) parallel computing  Apache Hadoop  Big Data Data Computation (MapReduce, Pig)  Integrating MapReduce and RDB Oracle + Hadoop Sybase IQ Vertica + Hadoop Hadoop DB Greenplum Aster Data  Integrating MapReduce and NoSQL DB MongoDB MapReduce HBase
  • 38. High Performance Information Computing Center Jongwook Woo CSULA Apache Hadoop  Motivated by Google Map/Reduce and GFS  open source project of the Apache Foundation.  framework written in Java – originally developed by Doug Cutting • who named it after his son's toy elephant.  Two core Components  Storage: HDFS – High Bandwidth Clustered storage  Processing: Map/Reduce – Fault Tolerant Distributed Processing  Hadoop scales linearly with  data size  Analysis complexity
  • 39. High Performance Information Computing Center Jongwook Woo CSULA Hadoop issues Map/Reduce is not DB Algorithm in Restricted Parallel Computing HDFS and HBase Cannot compete with the functions in RDBMS But, useful for Useful for huge (peta- or Terra-bytes) but non- complicated data – Web crawling – log analysis • Log file for web companies – New York Times case
  • 40. High Performance Information Computing Center Jongwook Woo CSULA MapReduce Pros & Cons Summary Good when Huge data for input, intermediate, output A few synchronization required Read once; batch oriented datasets (ETL) Bad for Fast response time Large amount of shared data Fine-grained synch needed CPU-intensive not data-intensive Continuous input stream
  • 41. High Performance Information Computing Center Jongwook Woo CSULA MapReduce in Detail Functions borrowed from functional programming languages (eg. Lisp) Provides Restricted parallel programming model on Hadoop User implements Map() and Reduce() Libraries (Hadoop) take care of EVERYTHING else –Parallelization –Fault Tolerance –Data Distribution –Load Balancing
  • 42. High Performance Information Computing Center Jongwook Woo CSULA Map Convert input data to (key, value) pairs map() functions run in parallel,  creating different intermediate (key, value) values from different input data sets
  • 43. High Performance Information Computing Center Jongwook Woo CSULA Reduce reduce() combines those intermediate values into one or more final values for that same key reduce() functions also run in parallel, each working on a different output key Bottleneck: reduce phase can’t start until map phase is completely finished.
  • 44. High Performance Information Computing Center Jongwook Woo CSULA Training in Big Data  Learn by yourself? Miss many important topics Two main: –Cloudera, Hortonworks • With hands-on exercises Cloudera 강의 교재 간단히 소개 Especially MapReduce example
  • 45. High Performance Information Computing Center Jongwook Woo CSULA Example: Sort URLs in the largest hit order Compute the largest hit URLs Stored in log files Map() Input <logFilename, file text> Output: Parses file and emits <url, hit counts> pairs – eg. <http://hello.com, 1> Reduce() Input: <url, list of hit counts> from multiple map nodes Output: Sums all values for the same key and emits <url, TotalCount> – eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
  • 46. High Performance Information Computing Center Jongwook Woo CSULA Map/Reduce for URL visits … …Map1() Map2() Mapm() Reduce1 () Reducel() Data Aggregation/Combine (http://hi.com, <1, 1, …, 1>) (http://hello.com, <3, 5, 2, 7>) (http://hi.com, 32) (http://hello.com, 17) Input Log Data Reduce2() (http://hi.com, 1) (http://hello.com, 3) … (http://halo.com, 1) (http://hello.com, 5) … (http://halo.com, <1, 5,>) (http://halo.com, 6)
  • 47. High Performance Information Computing Center Jongwook Woo CSULA Legacy Example In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files. – not a particularly complicated but large computing chore, • requiring a whole lot of computer processing time.
  • 48. High Performance Information Computing Center Jongwook Woo CSULA Legacy Example (Cont’d) In late 2007, the New York Times wanted to make available over the web its entire archive of articles, a software programmer at the Times, Derek Gottfrid, – playing around with Amazon Web Services, Elastic Compute Cloud (EC2), • uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3) • In less than 24 hours, 11 millions PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.  The total cost for the computing job? $240 – 10 cents per computer-hour times 100 computers times 24 hours
  • 49. High Performance Information Computing Center Jongwook Woo CSULA Supporters of Big Data: Hadoop Ecosystems  Apache Hadoop Supporters  Cloudera – Like Linux and Redhat – HiPIC is an Academic Partner  Hortonworks – Pig, – Consulting and training  Facebook – Hive  IBM – Jaql  NoSQL DB supporters  MongoDB  HBase, CouchDB, Apache Cassandra (originally by FB) etc
  • 50. High Performance Information Computing Center Jongwook Woo CSULA Pig • developed at Yahoo Research around 2006 o moved into the Apache Software Foundation in 2007. • PigLatin, o Pig's language o a data flow language o well suited to processing unstructured data  Unlike SQL, not require that the data have a schema  However, can still leverage the value of a schema
  • 51. High Performance Information Computing Center Jongwook Woo CSULA Hive • developed at Facebook o turns Hadoop into a data warehouse o complete with a dialect of SQL for querying. • HiveQL o a declarative language (SQL dialect) • Difference from PigLatin, o you do not specify the data flow,  but instead describe the result you want  Hive figures out how to build a data flow to achieve it. o a schema is required,  but not limited to one schema. o data can have many schemas
  • 52. High Performance Information Computing Center Jongwook Woo CSULA Hive (Cont'd) • Similarity with PigLatin and SQL, o HiveQL on its own is a relationally complete language  but not a Turing complete language,  That can express any computation o can be extended through UDFs (User Defined Functions) of Java  just like Pig to be Turing complete
  • 53. High Performance Information Computing Center Jongwook Woo CSULA Jaql • developed at IBM. • a data flow language o its native data structure format is JSON (JavaScript Object Notation). • Schemas are optional • Turing complete on its own o without the need for extension through UDFs.
  • 54. High Performance Information Computing Center Jongwook Woo CSULA MapReduce Cons and Future Bad for Fast response time Large amount of shared data Fine-grained synch needed CPU-intensive not data-intensive Continuous input stream Hadoop 2.0: YARN Not a product yet but will be soon
  • 55. High Performance Information Computing Center Jongwook Woo CSULA Hadoop 2.0: YARN Data processing applications and services Online Serving – HOYA (HBase on YARN) Real-time event processing – Storm, S4, other commercial platforms Tez – Generic framework to run a complex DAG  MPI: OpenMPI, MPICH2  Master-Worker  Machine Learning: Spark  Graph processing: Giraph  Enabled by allowing the use of paradigm-specific application master [http://www.slideshare.net/hortonworks/apache- hadoop-yarn-enabling-nex]
  • 56. High Performance Information Computing Center Jongwook Woo CSULA Big Data Supporters Amazon AWS Facebook Twitter Craiglist
  • 57. High Performance Information Computing Center Jongwook Woo CSULA Amazon AWS amazon.com Consumer and seller business aws.amazon.com IT infrastructure business – Focus on your business not IT management Pay as you go Services with many APIs – S3: Simple Storage Service – EC2: Elastic Compute Cloud • Provide many virtual Linux servers • Can run on multiple nodes – Hadoop and HBase – MongoDB
  • 58. High Performance Information Computing Center Jongwook Woo CSULA Amazon AWS (Cont’d) Customers on aws.amazon.com Samsung – Smart TV hub sites: TV applications are on AWS Netflix – ~25% of US internet traffic – ~100% on AWS NASA JPL – Analyze more than 200,000 images NASDAQ – Using AWS S3 HiPIC received research and teaching grants from AWS
  • 59. High Performance Information Computing Center Jongwook Woo CSULA Facebook [7] Using Apache HBase  For Titan and Puma – Message Services – ETL  HBase for FB – Provide excellent write performance and good reads – Nice features • Scalable • Fault Tolerance • MapReduce
  • 60. High Performance Information Computing Center Jongwook Woo CSULA Titan: Facebook Message services in FB Hundreds of millions of active users 15+ billion messages a month 50K instant message a second Challenges High write throughput – Every message, instant message, SMS, email Massive Clusters – Must be easily scalable Solution Clustered HBase
  • 61. High Performance Information Computing Center Jongwook Woo CSULA Puma: Facebook  ETL  Extract, Transform, Load – Data Integrating from many data sources to Data Warehouse  Data analytics – Domain owners’ web analytics for Ad and apps • clicks, likes, shares, comments etc  ETL before Puma  8 – 24 hours – Procedures: Scribe, HDFS, Hive, MySQL  ETL after Puma  Puma – Real time MapReduce framework  2 – 30 secs – Procedures: Scribe, HDFS, Puma, HBase
  • 62. High Performance Information Computing Center Jongwook Woo CSULA Twitter [8] Three Challenges Collecting Data – Scribe as FB Large Scale Storage and analysis – Cassandra: ColumnFamily key-value store – Hadoop Rapid Learning over Big Data – Pig • 5% of Java code • 5% of dev time • Within 20% of running time
  • 63. High Performance Information Computing Center Jongwook Woo CSULA Craiglist in MongoDB [9] Craiglist ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day Servers – ~500 servers – ~100 MySQL servers Migrate to MongoDB Scalable, Fast, Proven, Friendly
  • 64. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  Hadoop MapReduce for Non-Java codes: Python, Ruby  Requirement  Running Hadoop  Needs Hadoop Streaming API – hadoop-streaming.jar  Needs to build Mapper and Reducer codes – Simple conversion from sequential codes  STDIN > mapper > reducer > STDOUT
  • 65. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  MapReduce Python execution  http://wiki.apache.org/hadoop/HadoopStreaming  Sysntax $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd|JavaClassName> The streaming command to run -reducer <cmd|JavaClassName> The streaming command to run -file <file> File/dir to be shipped in the Job jar file  Example $ bin/hadoop jar contrib/streaming/hadoop-streaming.jar -file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py -file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py -input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare- output
  • 66. High Performance Information Computing Center Jongwook Woo CSULA Conclusion  Era of Big Data  Need to store and compute Big Data  Many solutions but Hadoop  Storage: NoSQL DB  Computation: Hadoop MapRedude  Need to analyze Big Data in mobile computing, SNS for Ad, User Behavior, Patterns …
  • 67. High Performance Information Computing Center Jongwook Woo CSULA Question?