A Critique of the Proposed National Education Policy Reform
Hadoop dev 01
1. NYC Data Science Academy
Hadoop Application Development with Real Cases
Hadoop Application Development with Real Cases
2. NYC Data Science Academy
Hadoop Application Development with Real Cases
Multi-layer Model
2
3. NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Pyramid and Character
Business personnel
ETL Engineer
Data Warehouse Engineer
Analyzer
Data Visualization Engineer
IT supporter: Operation-
Maintanence, Programmer
3
4. NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis
Analyze collected data with statistical methods on purpose, then understand and
implement the result
4
5. NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Mining
Data Mining is a technique focusing on retrieving hidden information in the data. It is a process that apply
knowledge-discovery algorithms to large database and show the associations to the users.
Original Idea: Hypothesis testing, Pattern Recognition, Artificial Intellegence, Machine Learning
Common Data Mining Projects: Association Rules, Clustering, Outlier Analysis
Case: Beer and Diaper
Science: Detecting Novel Associations in Large Data Sets
5
6. NYC Data Science Academy
Hadoop Application Development with Real Cases
Business Intelligence
BI = Data Warehouses (Storage) + Data Analysis and Data Mining (Analysis) +
Report (Demonstration)
Our course
6
7. NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis Algorithms
Popular Algorithms
7
8. NYC Data Science Academy
Hadoop Application Development with Real Cases
Regression
8
9. NYC Data Science Academy
Hadoop Application Development with Real Cases
Time Series Analysis
10. NYC Data Science Academy
Hadoop Application Development with Real Cases
Classifier
10
11. NYC Data Science Academy
Hadoop Application Development with Real Cases
Clustering
11
12. NYC Data Science Academy
Hadoop Application Development with Real Cases
Association Rules
12
13. NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis
Data Analysis Tools
13
14. NYC Data Science Academy
Hadoop Application Development with Real Cases
Popular Data Analysis Tools Ranking
14
15. NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis stages
stage 1: Dominate by Business personnel
stage 2: Dominate by both Business personnel and Analyzer
stage 3: Dominate by Analyzer
15
16. NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis in stage 1
Business staff set all the requirements and most analysis plans
According to experiences, Business staff select features, set threshold, and
IT staff search, integrate data, analyzer make report
Feature selection and choice of threshold is based on experience and
personal knowledge
Suitable for simple cases, analysis technique is equivalent to the simplest
decision tree
Business staffs has valuable experiences and hard to be replaced,
analyzers are just for graphing and is easily replaced
This is common in the traditional industry
16
17. NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis in stage 2
More complex. Business staffs could analyze a small number of
data records while cannot figure out all the features and the
relationship among them. They have no experience with large
number of samples.
Analyzer come to clean data and select features, and finally build
suitable model to solve problem.
Business staffs and analyzer could evaluate the result together,
very likely to success. Analyzer prefer this step because their ability
and value is confirmed.
17
18. NYC Data Science Academy
Hadoop Application Development with Real Cases
Spammer in Wordpress
19. NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis in stage 3
Business staffs have no experience for the
case, and cannot offer any useful prior
knowledge
Data analyzers use various tools and models to
mine the data and trying to have interesting
discovery
It is analyzer’s ideal world, while it is likely to
fail
Business staffs cannot get involved, and they
dislike this stage
19
20. NYC Data Science Academy
Hadoop Application Development with Real Cases
Step Forward
The first stage(Gold on the ground) -> The second
stage(Gold beneath the ground) -> The third stage (Gold
deeply buried)
If analyzers are reckless, business staffs will resist to
help
Data analysis is rooted in the business background. The
goal of analysis is increasing profit. Successful analysis
could not be apart from business
Interesting topic is more important than the model
20
21. NYC Data Science Academy
Hadoop Application Development with Real Cases
What is Big Data
22. NYC Data Science Academy
Hadoop Application Development with Real Cases
Features of Big Data
23. NYC Data Science Academy
Hadoop Application Development with Real Cases
Challenges for Analyzers
Bottleneck for both insertion and query due to the increasing amount of data
The trend of integrating users’ application and analysis result is asking for faster
real-time computation and response time
More complex models require more expensive computation
23
24. NYC Data Science Academy
Hadoop Application Development with Real Cases
Dilemma of Traditional Data Analysis
Tools
R, SAS, SPSS are experimental tools
Capable data size is restricted by the memory size
Use Oracle database for large volume of data, but lack of professional and fast
analyzing ability
Sampling is a limited solution, it is not useful for clustering and recommendation
system
Solution: Hadoop cluster and Map-Reduce parallel computing
24
25. NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 1: analysis and monitor for a
telecommunication company
25
26. NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 1: analysis and monitor for a
telecommunication company
Configuration of the original database server: HP minicomputer, 128G memory, 48-
core CPU, RAC with two nodes, one node for insertion and the other for query
Storage: HP virtual storage, over 1000 disks
Architecture: Oracle RAC with two nodes
Bottleneck: 1. Insertion 2. Query
26
27. NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 2: DNA database
27
28. NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 3: Social analysis, activity
fingerprint detection
28|
Public Voice
mail intersect IMSI 1 IMSI 2 …… IMSI n
total call
duration
User A IMSI 20% 12% …… 5% 365
User B IMSI 15% 13% …… 2% 310
Public SMS
intersect IMSI 1 IMSI 2 …… IMSI n
Monthly
SMS count
User A IMSI 50% 10% …… 5% 200
User B IMSI 20% 13% …… 2% 260
Public base
station CGI 1 CGI 2 …… CGI n Shutdown
User A IMSI 20% 12% …… 5% 20%
User B IMSI 15% 13% …… 2% 5%
Public Fingerprint
(0.2, 0.12, …, 0.05)
(0.15, 0.13, …, 0.02)
(0.5, 0.1, …, 0.05)
(0.2, 0.13, …, 0.02)
(0.2, 0.12, …, 0.05, 0.2)
(0.15, 0.13, …, 0.02, 0.05
eigenvector
29. NYC Data Science Academy
Hadoop Application Development with Real Cases
When equals to , these two vectors are independent
When equals to 0 , these two vectors are perfectly dependent
The closer is from 0, the more dependent these vectors are
90
Case 3: Social analysis, activity
fingerprint detection
29
30. NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 3: Social analysis, VIP detection
30
31. NYC Data Science Academy
Hadoop Application Development with Real Cases
Solution that analyzers look forward to
Perfectly eliminate the bottleneck in the foreseeable future
Smoothly transplant available techniques, for example SQL and R.
The cost of new platform: hardware and software, re-development, skill training,
maintenance
31
32. NYC Data Science Academy
Hadoop Application Development with Real Cases
Path to Big Data
33. NYC Data Science Academy
Hadoop Application Development with Real Cases
Idea of Hadoop
33
34. NYC Data Science Academy
Hadoop Application Development with Real Cases
Map-Reduce Programming
34
35. NYC Data Science Academy
Hadoop Application Development with Real Cases
Map-Reduce program for meteorological
data analysis
35
36. NYC Data Science Academy
Hadoop Application Development with Real Cases
Map-Reduce implementation for popular
algorithms
36
37. NYC Data Science Academy
Hadoop Application Development with Real Cases
Map-Reduce implementation for popular
algorithms
37
38. NYC Data Science Academy
Hadoop Application Development with Real Cases
Why not Hadoop?
Java?
Hard to control?
Hard to integrate data?
Hadoop vs Oracle
38
39. NYC Data Science Academy
Hadoop Application Development with Real Cases
Analysis under Hadoop system
Mainstream: Java program
Light-weighted script language: Pig
Smooth transplant from SQL: Hive
NoSQL: HBase
39
40. NYC Data Science Academy
Hadoop Application Development with Real Cases
Family of Hadoop
40
41. NYC Data Science Academy
Hadoop Application Development with Real Cases
pig
Pig could be treated as a client software
to the hadoop, could connect to hadoop
and analyze
Pig is convenient for users unfamiliar
with java, using a SQL-like language,
pig latin, dealing with data flow
Pig latin could perform sorting, filtering,
sum, grouping, association, and define
custom functions. It is a light-weighted
script language for data operation and
analysis
Pig could be treated as the mapping
from pig latin to map-reduce
41
42. NYC Data Science Academy
Hadoop Application Development with Real Cases
Hive
Data warehouse tool, could turn
primary data structure in Hadoop into
tables in Hive
Support HiveQL, a language almost
the same as SQL, its function is the
same as SQL except updating,
indexing and
could be treated as the mapping from
SQL to map-reduce
Offering interfaces for shell、
JDBC/ODBC、Thrift、Web
42
43. NYC Data Science Academy
Hadoop Application Development with Real Cases
Features of Mahout
Mahout is for scalable machine learning
algorithms (M-R implementation), and
Hadoop platform is not necessary. The
core library also have efficient algorithms
on single machine
Mature and popular algorithms are
1. Frequent Itemset Mining
2. Clustering
3. Classifier
4. Recommendation System
5. Frequent Subgraph Mining
43
44. NYC Data Science Academy
Hadoop Application Development with Real Cases
Reference Textbooks
45. NYC Data Science Academy
Hadoop Application Development with Real Cases
Reference Textbooks
46. NYC Data Science Academy
Hadoop Application Development with Real Cases
Reference Textbooks
47. NYC Data Science Academy
Hadoop Application Development with Real Cases
Reference Textbooks
47
48. NYC Data Science Academy
Hadoop Application Development with Real Cases
Typical Experiment Environtment(with
server)
Server: ESXi, capable of deploying multiple virtual machines and could run 3
machines at the same time
PC: Linux or Windows+Cygwin, linux could be standalone or a virtual machine
SSH: Use command ssh under linux, and SecureCRT or putty under Windows to
connect with remote linux server
Vmware client: Management of ESXi
Hadoop: Use version 1.x or 2.x
48
49. NYC Data Science Academy
Hadoop Application Development with Real Cases
Typical Experiment Environtment(with
only PC or laptop running Windows)
At Least 4G memory, 64bit windows is preferred, because 32bit machine can use
only more than 3G memory.
Install vmware workstation or virtual box
Deploy 3 virtual machines and running at the same time. If can only run two VMs,
treat host as a node (by cygwin), and use bridged networking for virtual network
Install Linux and Java
Old computers could consider pseudo-distributed environment
49
50. NYC Data Science Academy
Hadoop Application Development with Real Cases
Experiment Environment
Deploy Pig
Deploy Hive
Deploy Mahout
51. NYC Data Science Academy
Hadoop Application Development with Real Cases
List of Cases of the Course
Analysis of high volume website log system; Retrieve KPI data(Map-Reduce)
LBS application for telecommunication company; Analysis of trace of user‘s mobile phone(Map-
Reduce)
User analysis for telecommunication company; Labeling duplicated users by the fingerprint of
calls(Map-Reduce)
Recommendation system for E-commerce company(Map-Reduce)
Complicated recommendation system application(mahout)
Social network; Distance between users; Community detection(Pig)
Importance of nodes in a social network(Map-Reduce)
Application of clustering algorithm; Analysis of VIP(Map-Reduce, Mahout)
Financial data analysis; Retrieve reverse repurchase information from historical data(Hive)
Set stock strategies with data analysis(Map-Reduce, Hive)
GPS application; Sign-in data analysis(Pig)
Implementation and optimization of sorting on Map-Reduce
Middleware development; Cooperation of multiple Hadoop clusters