3. Big-Data Era
§ Collect, Store & Process information at scale
§ Rise of Open Source Software
– Leverage clusters of commodity computers to process the data
§ Data Science
– Bridge between Data and the tools
– Starts with running simple queries
§ Placing Schema on the data and run SQL queries
– R, Octave, Python Scikit learn
§ Data is partitioned and spread across nodes ( HDFS )
– Algorithms with wide data dependency will suffer from n/w delays
– Probability of Node failure increases
3
5. Parallel Systems Distributed Systems
§ Tightly coupled Systems
§ Multiple processors shared same
memory address space
§ Scale Up Servers
§ High Performance Computing(HPC)
§ Disadvantages:
– Scalability
– Expensive
5
§ Loosely coupled Systems
§ Communicate with each other over
Network
§ Scale Out Servers
§ Capable on collaborating to complete
a task
§ Disadvantages:
– Difficult in developing distributed
software
– Network problems
– Reliability & Fault Tolerance
6. Apache Hadoop
§ Hadoop emerges as a leader
– Filesystem Abstraction
– M/R programming model
– Linear scalability
– Automatic failure recovery
– Cheaper solution
§ Challenges
– Transformational APIs missing for Feature Engineering
– Not suitable for ML modeling
§ Multiple passes on same data sets
6
25. Challenges with Data Science
§ Majority of work lies in preprocessing the data
– Feature Engineering
– Choosing the Algorithms
– Convert data to vectors for ML Algorithms
§ Iteration
– Scans over input vector till model converges
– Results are based on experimentation
§ Put the models in Production
– Evaluate its accuracy over time
– Rebuilt the model periodically
25
System should support more flexible
transformation
Multiple Data access from disk
should be effectively handled
Ease Model creation & suitable for
production use
27. SparkML – TF IDF
§ Term Frequency – Inverse Document Frequency
§ Used to build Search Engines
– Score indicate how important a word is to a collection of documents
§ If a word appears frequently in a doc, it’s important
§ But if a word appears in many docs (the, and, of - stop-words) the word is not
meaningful, so lower its score
27
28. SparkML – KMeans Clustering
§ Unsupervised learning
§ Classify items into K different groups
§ Randomly initialize the centroid for these group
§ Compute Euclidean distance between the datapoints & centroid to assign the
group
§ Recompute the centroid once again with all the datapoints which are part of the
same group
§ Repeat till the centroid movement is negligible
28
35. How to use GPUEnabler Plugin
35
Available at https://github.com/IBMSparkGPU/GPUEnabler
Build the package:
This will install the package to the maven local repository.
To include this package to your Spark Application include the dependency in the application’s pom.xml file:
More information can be found in the github repository regarding the APIs and sample programs