Introduction to IEEE STANDARDS and its different types.pptx
AI on Big Data
1. Jongwook Woo
HiPIC
CalStateLA
동의대학교
상경대 경제학과 임 동 순 교수
May 29 2018
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
Introduction to AI on Big Data
2. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
인공지능
인공지능과 빅데이터
Summary
3. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Myself
Experience:
Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
Since 2007: Exposed to Big Data at CitySearch.com
2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors
4. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits
5. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Experience (Cont’d): Bring in Big Data R&D and training to
Korea since 2009
Collaborating with LA city since 2016
– Collect, Search, and Analyze City Data
• Spark, Hadoop, ElasticSearch, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training Institute
Since 2008
– Introduce Hadoop Big Data and education to Univ and Research Centers
• Yonsei, Gachon, DongEui
• US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana
State Univ, California State Univ LB
• Europe: Univ of Luxembourg
Myself
7. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Experience in Big Data
Collaboration
Council Member of IBM Spark Technology Center
City of Los Angeles for OpenHub and Open Data
Startup Companies in Los Angeles
External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
– The Big Link, Softzen, Wiken in Korea
Grants
IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant
Partnership
Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS,
Teradata
9. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
인공지능
인공지능과 빅데이터
Summary
10. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive
11. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
12. High Performance Information Computing Center
Jongwook Woo
CalStateLA
What is Hadoop?
12
Hadoop Founder:
o Doug Cutting
Apache Committer:
Lucene, Nutch, …
13. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Super Computer vs Hadoop
Parallel vs. Distributed file systems by Michael Malak
Updated by Jongwook Woo
Cluster for Store Cluster for Compute/Store
Cluster for Compute
15. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Hadoop Ecosystems
http://dawn.dbsdataprojects.com/tag/hadoop/
16. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Definition: Big Data
Non-expensive frameworks that is distributed parallel systems
and that can store a large scale data and process it in parallel [1,
2]
Hadoop
– Non-expensive Super Computer
– More public than the traditional super computers
• You can store and process your applications
– In your university labs, small companies, research centers
Others
– NoSQL DB (Cassandra, MongoDB, Redis, HBase)
– ElasticSearch
17. High Performance Information Computing Center
Jongwook Woo
CalStateLA
NoSQL DB
Key-Value
Memcached, Memcachedb, Redis
Column Oriented (Column Family Store)
BigTable, Hbase
Cassandra (Key-Value Column Oriented)
Amazon SimpleDB
Document Oriented
MongoDB, Couchbase, CouchDB
Graph Oriented
Neo4j, InfiniteGraph
18. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
In-Memory storage for intermediate data
20 ~ 100 times faster than N/W and Disk
– MapReduce
Good in Machine Learning
– Iterative algorithms
19. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Spark and Hadoop
Spark
File Systems: Tachyon
Resource Manager: Mesos
But, Hadoop has been dominating market
Integrating Spark into Hadoop cluster
Cloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix
• Object Storage, S3
Hadoop vendors
– HDP, CDH
Databricks: Spark on AWS & Azure
– No Hadoop ecosystems
20. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Sentiment Map of Alphago
Positive
Negative
21. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Sentiment Map of Lee Se-Dol vs Alphago
YouTube video: “alphago sentiment” by Google
The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTbToiB8wQ2w14a
23. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Mapping of Crimes Occurred within 5miles
from CalStateLA, UCLA and USC in 2015
24. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Review count of popular sub-categories of
business
25. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Businesses popular in 5 miles of CalStateLA,
USC , UCLA
26. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Average Undergraduates Receiving
PELL GRANT in Each College
East Georgia State College: $2,854 Avg.
PELL grant: 97.285%
27. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Big Data Analysis Flow
Data Collection
Batch API: Yelp,
Google
Streaming: Twitter,
Apache NiFi, Kafka,
Storm
Open Data:
Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Datameer, Qlik, Tableau,…)
Data Visualization
Qlik, Datameer, Excel
PowerView
- Big Data Engineering
- Big Data Analysis
- Big Data Science
- Data Visualization
28. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Terms
We know
Data Engineering
– Collect, clean, filter data
Data Analysis
– Find insights from the data
Data Science (Predictive Analysis)
– Predict the trend or pattern from the existing data
Do we know?
Big Data Analysis and Science
– Using Big Data for Data Analysis and Science
• Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,..
– For Massive Data Set
• How to store and compute?
29. High Performance Information Computing Center
Jongwook Woo
CalStateLA
NoSQL DB
Key-Value
Memcached, Memcachedb, Redis
Column Oriented (Column Family Store)
BigTable, Hbase
Cassandra (Key-Value Column Oriented)
Amazon SimpleDB
Document Oriented
MongoDB, Couchbase, CouchDB
Graph Oriented
Neo4j, InfiniteGraph
30. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
인공지능
인공지능과 빅데이터
Summary
33. High Performance Information Computing Center
Jongwook Woo
CalStateLA
7
• Neural Networks
• Multi-Layer Perceptron
• Convolutional Neural
Networks
Deep Learning [9]
34. High Performance Information Computing Center
Jongwook Woo
CalStateLA
7
• good at problems like image classification.
Convolutional Neural Networks
35. High Performance Information Computing Center
Jongwook Woo
CalStateLA
9
• Has 3 types of parameters
▫ W – Hidden weights
▫ U – Hidden to Hidden weights
▫ V – Hidden to Label weights
• Good for Text Processing such as sentiment analysis:
• My Projects > sapDeepLearningTensorflow > Week_03_Unit_05_S
Recurrent Neural Networks (RNN)
36. High Performance Information Computing Center
Jongwook Woo
CalStateLA
10
Neural Networks are resource intensive
o Typically require huge dedicated hardware (RAM, GPUs)
Parameter space huge
o 100s of thousands of parameters
o Tuning is important
Architecture choice is important:
o See http://www.asimovinstitute.org/neural-network-zoo/
Key takeaways from modeling Deep Neural
Networks
37. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
인공지능
인공지능과 빅데이터
Summary
38. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Recap
Spark:
an efficient framework for running computations on
thousands of computers
TensorFlow:
high-performance numerical framework
Get the best of both
Simple API for distributed numerical computing
Can leverage the hardware of the cluster
38
39. High Performance Information Computing Center
Jongwook Woo
CalStateLA
13
Investment in Big-Data
o infrastructure
GPUs
o Require specialized hardware
o – Niche Use-cases
Can enterprises reuse existing infrastructure
o for deep learning applications?
What use-cases in Deep learning can leverage Apache Spark?
Deep Learning + Apache Spark
40. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Spark using TensorFlow [8, 9]
Neural networks
have seen spectacular progress during the last few years
the state of the art in image recognition and automated translation.
TensorFlow
a new framework released by Google
– for numerical computations and neural networks.
Spark and TensorFlow
use Spark and a cluster of machines
– to improve deep learning pipelines with TensorFlow
– how to use TensorFlow and Spark together to train and apply deep learning models
Hyperparameter Tuning:
– use Spark to find the best set of hyperparameters for neural network training,
• leading to 10X reduction in training time and 34% lower error rate.
Deploying models at scale:
– use Spark to apply a trained neural network model on a large amount of data
41. High Performance Information Computing Center
Jongwook Woo
CalStateLA
The accuracy of Spark with the default set of hyperparameters
99.2%.
best result with hyperparameter tuning
– has a 99.47% accuracy on the test set,
• which is a 34% reduction of the test error.
Spark Cluster with TensorFlow
42. High Performance Information Computing Center
Jongwook Woo
CalStateLA
14
Databricks
Platform for running Spark with TensorFlow
BigDL
Intel’s library for deep learning on existing data frameworks.
TensorflowOnSpark
Yahoo’s Distributed Deep Learning on Big Data
SparkNet
AMPLab’s framework for training deep networks in Spark
Efforts on using Deep Learning
Frameworks with Spark
43. High Performance Information Computing Center
Jongwook Woo
CalStateLA
14
DeepLearning4J
Uses Data parallism to train on separate neural networks
DeepDist
Lightning-Fast Deep Learning on Spark Via parallel
stochastic gradient updates
IBM DSX
Efforts on using Deep Learning
Frameworks with Spark
44. High Performance Information Computing Center
Jongwook Woo
CalStateLA
15
Deploying trained models
o to make predictions on data stored in Spark RDDs or Dataframes
o Inception model: https://www.tensorflow.org/tutorials/image_recognition
o Each prediction requires about 4.8 billion operations
o Parallelizing with Spark helps scale operations
Databricks
https://databricks.com/blog/2016/12/21/deep-learning-on-
databricks.html
45. High Performance Information Computing Center
Jongwook Woo
CalStateLA
16
• Distributed model training
Use deep learning libraries like TensorFlow to test different
model hyperparameters on each worker
Task parallelism
Databricks
https://databricks.com/blog/2016/12/21/deep-learning-on-
databricks.html
46. High Performance Information Computing Center
Jongwook Woo
CalStateLA
IBM DSX
Data Science Experience (DSX) includes
TensorFlow libraty
GPU
Easy to develop and run Spark with TensorFlow
Don’t need to configure library
Databricks’ examples run in DSX
–While Databricks CE does not support GPU
Brunel for visualization lately
‹#›
47. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Multiple nodes in the
cluster:
the computations scaled
linearly
a graph
– the computation times (in
seconds)
• with respect to the number of
machines on the cluster:
– using a 13-node cluster,
• train 13 models in parallel,
• which translates into a 7x
speedup compared to training
the models one at a time on one
machine.
Spark Cluster with TensorFlow (Cont’d)
49. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Spark Cluster with TensorFlow (Cont’d)
the learning rate for different numbers of neurons:
The learning rate is critical:
– if it is too low,
• the neural network does not learn anything (high test error).
– If it is too high,
• the training process may oscillate randomly and even diverge in some configurations.
The number of neurons
– not as important for getting a good performance,
• and networks with many neurons
– much more sensitive to the learning rate.
– This is Occam’s Razor principle:
• simpler model tend to be “good enough” for most purposes.
• If you have the time and resource to go after the missing 1% test error, you
must be willing to invest a lot of resources in training,
• to find the proper hyperparameters that will make the difference.
50. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Distributed processing of images using
TensorFlow
Apache Spark with a Deep Learning library
takes an existing neural network (INCEPTION-3)
– applies it to a corpus of images.
requires that TensorFlow be installed on the cluster
Run in IBM DSX
– Not in Databricks CE
• Built by Databricks but needs GPU
Spark integration work flow:
define TensorFlow operations as methods, to be used within Spark tasks.
broadcast the model for use within Spark tasks.
parallelize a list of image URLs.
Using Spark, we process the image URLs in parallel:
– Load image.
– Run inference on the image using TensorFlow to predict the image contents.
51. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Distributed processing of images classification using TensorFlow
use the “Simple image classification with
Inception” example from TensorFlow,
which applies the Inception model to predict the
contents of a set of images.
For example, given Photo of two scuba divers
The Inception model will tell us the contents of the
image:
('scuba diver', 0.88708681),
('electric ray, crampfish, numbfish, torpedo',
0.012277877),
('sea snake', 0.005639134),
('tiger shark, Galeocerdo cuvieri', 0.0051873429),
('reel', 0.0044495272)
52. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Distributed processing of images classification using TensorFlow
(Cont’d)
Each of the lines above represents a “synset,”
or a set of synonymous terms
– representing a concept.
The weight given to each synset
– represents a confidence in how applicable the synset is to the image.
– In this case, “scuba diver” is pretty accurate!
Making predictions with Inception-v3
expensive:
– each prediction requires about 4.8 billion operations (Szegedy et al., 2015).
Even with smaller datasets,
– worthwhile to parallelize this computation.
– distribute these costly predictions using Spark.
53. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Contents
Myself
Introduction To Big Data
인공지능
인공지능과 빅데이터
Summary
54. High Performance Information Computing Center
Jongwook Woo
CalStateLA
Summary
Introduction to Big Data
Introduction to AI
AI on Big Data
59. High Performance Information Computing Center
Jongwook Woo
CalStateLA
References
1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and
Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing
Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011)
2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue
6, pp445-452, ISSN 1942-4795
3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4. How to choose algorithms for Microsoft Azure Machine Learning,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-
choice
5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag
Chhadva, Shubhra Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-
Data-Analysis-using-Spark.pdf
6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7. Github URL: https://github.com/nmelche/IntroductionToBigDataScience
60. High Performance Information Computing Center
Jongwook Woo
CalStateLA
References
8. TensorFrames: Google Tensorflow on Apache Spark,
https://www.slideshare.net/databricks/tensorframes-google-tensorflow-on-apache-spark
9. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-
and-apache-spark
10. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-
frameworks-on-spark
11. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-
at-scalewith-apache-spark-keynote-by-ziya-ma
12. Deep Learning with Apache Spark and TensorFlow,
https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-
tensorflow.html
13. Tensor Flow Deep Learning Open SAP