Big Data has been popular last 10 years using Hadoop and Spark for data analysis and prediction with large scale data sets in distributed parallel computing systems. Its platform has expanded using NoSQL DB and Search Engine as well and has been more popular along cloud computing. Then, Deep Learning has become a buzzword past several years using GPU and Big Data. It makes even small companies and labs to own supercomputers with a small amount of budgets, which is the situation of “Dream Comes True” in the IT and business. In this talk, the history and trends of Big Data and AI platforms are introduced and Big Data predictive analysis should be presented.
Student profile product demonstration on grades, ability, well-being and mind...
Introduction to Big Data and its Trends
1. Jongwook Woo
HiPIC
CalStateLA
Pacific States University
June 4 2020
Jongwook Woo, PhD, jwoo5@calstatela.edu
Big Data AI Center (BigDAI)
California State University Los Angeles
Introduction
to Big Data and its Trends
2. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Big Data Predictive Analysis
Summary
3. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself
Experience:
Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
4. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Universities in Los Angeles
West
North
5. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits
6. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: CDH, Oracle using Hadoop Big Data
https://www.cloudera.com/more/customers/csula.html
7. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: Partners for Services
8. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: Collaborations
SOFTZEN
9. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Big Data Predictive Analysis
Summary
10. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– IoT (Streaming data, Sensor Data) in SmartX
– Social Computing, smart phone, online game
– Bioinformatics, …
Legacy approach
Can do
– Improve the speed of CPU
Increase the storage size
Only Problem
– Too expensive
11. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Traditional Way
12. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Traditional Way
Becomes too Expensive
13. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Issues
Large Scale Data
Too big
Non-/Semi-structured data
3 Vs, 4 Vs,…
– Velocity, Volume, Variety
Traditional Systems can handle them
– But Again, Too expensive
Cannot handle with the legacy approach
Need new systems
Non-expensive
14. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
15. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
Not Expensive
From 2017 Korean
Blockbuster Movie,
“The Fortress”
(남한산성)
16. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
But Works Well with the crazy massive data set
Battle of Nagashino,
1575, Japan
17. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
Not Expensive
http://blog.naver.com/PostView.nhn?blogId=dosims&logNo=221127053677
AD 1409 (Year 9 of King Tae-Jong, Chosun Dynasty, Korea) By Choi family:
최해산(崔海山), 아버지 최무선(崔茂宣)
[Ref] 조선의 비밀 병기 : 총통기 화차(銃筒機火車)|작성자 도심
18. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data
Big Data (Hadoop, Spark, Distributed Deep Learning)
Cluster for Compute and Store
(Distributed File Systems: HDFS, GFS)
…
19. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Super Computer vs Big Data vs Cloud
Traditional Super Computer
(Parallel File Systems: Lustre, PVFS, GPFS)
Cluster for Store
Big Data (Hadoop, Spark, Distributed Deep Learning)
Cluster for Compute and Store
(Distributed File Systems: HDFS, GFS)
However, Cloud Computing adopts
this separated architecture:
with High Speed N/W (> 10Gbps)
and Object Storage
Cluster for Compute
20. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Definition: Big Data
Non-expensive platform, which is distributed parallel computing
systems and that can store a large scale data and process it in
parallel [1, 2]
Apache Hadoop
– Non-expensive Super Computer
– More public than the traditional super computers
• Anyone can own super computer as open source
– In your university labs, small companies, research centers
Other solutions with storage and computing services
– Spark
• mostly integrated into Hadoop with Hadoop community
– NoSQL DB (Cassandra, MongoDB, Redis, Hbase,…)
– ElasticSearch
21. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
What is Hadoop?
21
Apache Hadoop Project in
Jan, 2006 split from Nutch
Hadoop Founder:
o Doug Cutting
Apache Committer:
Lucene, Nutch, …
22. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data: Linearly Scalable
Some people questions that the system to handle 1 ~ 3GB of
data set is not Big Data
Well…. add more servers as more data in the future in Big Data platform
– it is linearly scalable once built
– n time more computing power ideally
Data Size: < 3 GB Data Size: 200 TB >
Add n
servers
23. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data is great for Small Business
Your Business data is the value
Customer data
Operational data
You have your specific data
Big Company does not have a specific data as you have
Potentials for your domain
Your customer data
– Smart marketing and Sales
– Advertisement
Your operational data
– Efficient operation, For Example, Smart*:
• Smart Factory, Smart City
24. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Data Analysis & Visualization
Sentiment Map of Alphago
Positive
Negative
25. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
K-Election 2017
(April 29 – May 9)
26. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Businesses popular in 5 miles of CalStateLA,
USC , UCLA
27. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Jams and other traffic incidents reported
by users in Dec 2017 – Jan 2018:
(Dalyapraz Dauletbak)
28. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Analysis and Prediction Flow
Data Collection
Batch API: Yelp, Google
Streaming: Twitter, Apache
NiFi, Kafka, StereamSets,
Storm
Open Data: Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, Deep Learning,
BI Tools (Qlik, Tableau, …)
Data Visualization
Qlik, Excel PowerMap,
Tableau, Looker, …
- Big Data Engineering
- Big Data Analysis
- Big Data Science Deep Learning
- Data Visualization
29. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Big Data Predictive Analysis
Summary
30. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Analysis and Prediction
Big Data Analysis
Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,..
Big Data for Data Analysis
– How to store, compute, analyze massive dataset?
Big Data Science
How to predict the future trend and pattern with the massive
dataset? => Machine Learning
31. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
Started by Matei Zaharia in 2009,
– and open sourced in 2010
In-Memory storage for intermediate data
20 ~ 100 times faster than
– MapReduce
Good in Machine Learning => Big Data Science
– Iterative algorithms
32. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark (Cont’d)
Spark ML
Supports Machine Learning libraries
Process massive data set to build prediction models
33. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning
Machine Learning
Has been popular since Google Tensorflow, Nov 9 2015
Multiple Cores in GPU
– Even with multiple GPUs and CPUs
Parallel Computing in a chip
GPU (Nvidia GTX 1660 Ti)
1280 CUDA cores
Other Deep Learning Libraries
Tensor Flow
PyTorch
Keras
Caffe, Caffe2
Microsoft Cognitive Toolkit (Previously CNTK)
Apache Mxnet
DeepLearning4j
…
35. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning
CNN
Image Recognition
Video Analysis
NLP for classification, Prediction
RNN
Time Series Prediction
Speech Recognition/Synthesis
Image/Video Captioning
Text Analysis
– Conversation Q&A
GAN
Media Generation
– Photo Realistic Images
Human Image Synthesis: Fake faces
36. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Scale Driving: Deep Learning Process
Deep Learning and Massive Data [3]
“Machine Learning Yearning” Andrew Ng 2016
37. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep learning experts
The
Chasm
Big Data Engineers, Scientists, Analysts, etc.
Another Gap between Deep Learning and Big Data
Communities [6]
38. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Leveraging Big Data Cluster
Existing Big Data cluster with massive data set without using
Big Data
Too slow in data
migration and
single server fails
Single GPU
server for Deep
Learning?
Single server for
Python and R
Traditional
Machine Learning?
Big Data Cluster
39. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning with Spark
What if we combine Deep Learning and Spark?
40. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Leveraging Big Data Cluster
Existing Big Data cluster
Big Data Engineering
Big Data Analysis
Big Data Science
Distributed Deep Learning
– Integrate Deep Learning to the cluster
Not needs data migration and can leverage the
parallel computing and existing large scale data
Big Data Cluster
41. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning with Spark
Deep Learning Pipelines for Apache Spark
Databricks
TensorFlowOnSpark
Yahoo! Inc
BigDL (Distributed Deep Learning Library for Apache Spark)
Intel
DL4J (Deeplearning4j On Spark)
Skymind
Distributed Deep Learning with Keras & Spark
Elephas
42. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Prediction with DDL
DDL: Distributed Deep Learning
Tensor Flow
Distributed Training and Inference in Spark cluster
DDL
43. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML and DDL [2-5]
Deep Learning in Spark cluster
Distributed Deep Learning
DDL
DDL lib
DDL lib
Deep Learning in Spark
44. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Big Data Predictive Analysis: Use Case
Summary
45. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
AWS Review Dataset
Predictive Analysis
Prediction of Users’ ratings
– important measures for purchase and selling
Spark ML: ALS (Alternating Least Squares) algorithm
DDL (Distributed Deep Learning): Neural Collaborative Filtering(NCF)
Dataset : - https://s3.amazonaws.com/amazon-reviews-
pds/tsv/index.txt
Products reviewed between 2005 and 2015 are analyzed
Total product reviews : 9.57 million
File Size : 5.26 GB
46. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary: Performance
47. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary: Mean Absolute Error
48. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
Big Data Predictive Analysis
Summary
49. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary
Introduction to Big Data
Spark ML for Big Data Science
Distributed Deep Learning with Spark
DDL provides more accuracy with the similar performance by
leveraging the Big Data cluster
50. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Questions?
51. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Precision vs Recall
True Positive (TP): Fraud? Yes it is
False Negative (FN): No fraud? but it is
False Positive (FP): Fraud? but it is not
Precision
TP / (TP + FP)
Recall
TP / (TP + FN)
Ref: https://en.wikipedia.org/wiki/Precision_and_recall
Positive:
Event occurs
(Fraud)
Negative: Event
does not
Occur (non
Fraud)
52. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
1. Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo, "Predictive Analysis of Financial
Fraud Detection using Azure and Spark ML", Asia Pacific Journal of Information Systems (APJIS),
VOL.28│NO.4│December 2018, pp308~319
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-
452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en-
us/azure/machine-learning/machine-learning-algorithm-choice
5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra
Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf
6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes-
google-tensorflow-on-apache-spark
8. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache-
spark
53. Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
9. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on-
spark
10. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith-
apache-spark-keynote-by-ziya-ma
11. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep-
learning-with-apache-spark-and-tensorflow.html
12. Tensor Flow Deep Learning Open SAP
13. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory-
solutions-68137094/6
14. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive
15. https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data
16. https://blogs.msdn.microsoft.com/andreasderuiter/2015/02/09/performance-measures-in-azure-ml-
accuracy-precision-recall-and-f1-score/