SlideShare a Scribd company logo
1 of 72
Download to read offline
Java as a fundamental working tool 
of the Data Scientist 
Speaker : Alexey Zinoviev
About 
● I am a <graph theory, machine learning, traffic jams prediction, 
BigData algorythms> scientist 
● But I'm a <Java, JavaScript, Android, NoSQL, Hadoop, Spark> 
programmer
One of these fine days...
We need in Python dev 'cause Data Mining
You're a programmer, not an analyst
Write your backends!
Let’s talk about it, Java-boy...
Data mining 
Mining coal 
in your data
Hey, man, predict me something!
Man or sofa?
Typical questions for DM 
● Which loan applicants are high-risk? 
● How do we detect phone card fraud? 
● Which customers do prefer product A over product B? 
● What is the revenue prediction for next year?
What is Data Mining? 
13/72
Statistics?
Tag cloud?
Data visualization?
Not OLAP, 100%
Magic part of KDD (Knowledge 
Discovery in Databases) 
1. Selection 
2. Pre-processing 
3. Transformation 
4. Data Mining 
5. Interpretation/Evaluation
How it really works 
1. Share your date with us 
2. Our magic manipulations 
3. Building an answering machine 
4. PROFIT!!!
Data 
20/72
Data examples 
● Facebook users, tweets 
● Weather 
● Sea routes 
● Trade transactions 
● Goverment 
● Medicine (genomic data) 
● Telecommuncations (phone call records)
Data sources 
● Relational Databases 
● Data warehouses (Historical data) 
● Files in CSV or in binary format 
● Internet or electronic mails 
● Scientific, research (R, Octave, 
Matlab)
Target Data & Personal Data 
23/72
Pay with your personal data 
● All your personal data (PD) are being 
deeply mined 
● The industry of collecting, aggregating, 
and brokering PD is “database marketing.” 
● 1.1 billion browser cookies, 200 million 
mobile profiles, and an average of 1,500 
pieces of data per consumer in Acxiom
Preprocessing 
26/72
● Select small pieces 
● Define default values for missed data 
● Remove strange signals from data 
● Merge some tables in one if required
Pattern mining 
28/72
Association rule learning
What is Cluster Analysis? 
It is the process of finding model of function that describes and 
distinguishes data class to predict the class of objects whose class 
label is unknown.
Regression 
● Statistical process for estimating the 
relationships among variables 
● The estimation target is function (it can 
be probability distribution) 
● Can be linear, polynomial, nonlinear and 
etc.
Classification 
● Training set of classified examples 
(supervised learning) 
● Test set of non-classified items 
● Main goal: find a function (classifier) that 
maps input data to a category 
● Computer vision, drug discovery, speech 
recognition, biometric indentification, 
credit scoring
Decision trees
Decision trees
kNN (k-nearest neighbor) 
● There are two classes of objects A & B 
● Define the class of new object, based on 
information about its neighbors 
● Changing the boundaries of an new object 
area, we form a set of neighbors. 
● New object is B becuase majority of the 
neighbors is a B.
Skills & Tools 
36/72
Fashion Languages 
39/72
Why not Octave? 
● It’s free 
● Not full implemented stack of ML 
algorythms 
● All your matrix are belong to us! 
● Single thread model 
● Java support
Why not R? 
● 25% of R packages are written in Java 
● Syntax is too sweet 
● You should read 1000 lines in docs to 
write 1 line of code 
● Single thread model for 95% 
algorythms
Why not Python? 
● Now Python is an idol for young scientists 
due to the low barrier to entry 
● We are not Python developers 
● High-level language 
● Have you ever heard about a Jython? 
● Long long way to real Highload 
production
DM libraries in Python
Java ecosystem 
46/72
JDM 
● Java API for Data mining, JSR 73 and JSR 247 
● javax.datamining.supervised defines the supervised 
function-related interfaces 
● javax.datamining.algorithm contains all mining algorithm 
subclass packages 
● JDM 2.0 adds Text Mining, Time series and so on..
Weka 
● Connectors to R, Octave, Matlab, Hadoop, 
NoSQL/SQL databases 
● Source code of all algorythms in Java 
● Preprocessing tools: discretization, 
normalization, resampling, attribute 
selection, transforming and combining
Weka + Hadoop
SPMF 
● It’s codebase of algorythms in pattern 
mining field 
● It has cool examples and implementation 
of 78 algorythms 
● Cool performance results in specific area 
● Codebase grows very fast
Mahout 
● Driven by Ng et al.’s paper “MapReduce for Machine Learning 
on Multicore” 
● Next algorythms were adopted: Locally Weighted Linear 
Regression(LWLR), Naive Bayes (NB), k-means, Logistic 
Regression, Neural Network (NN), Principal Components Analysis 
(PCA), Support Vector Machine (SVM) and so on.. 
● The complexity was reduced in n times for n processors.
Mahout 
● DataModel (File, MySQL, PostgreSQL, 
Mongo, Cassandra) 
● UserSimilarity 
● ItemSimilarity 
● UserNeighborhood 
● Recommender
Mahout 
● Advanced Implementations of Java’s Collections Framework for 
better Performance. 
● Very close to Apache Giraph 
● New algorythms will build on Spark platform 
● Spark shell 
● Spring + Mahout demo 
● Collaborative Filtering, Classification, Clustering, Dimensionality 
Reduction, Miscellaneous are supported
Spark 
● MapReduce in memory 
● Up to 50x faster than Hadoop 
● Support for Shark (like Hive), MLlib 
(Machine learning), GraphX (graph 
processing) 
● RDD is a basic building block (immutable 
distributed collections of objects)
Spark
Spark + Java 8 
Java 7 search example: 
JavaRDD<String> lines = sc.textFile("hdfs://log.txt").filter( 
new Function<String, Boolean>() { 
public Boolean call(String s) { 
return s.contains("Tomcat"); 
} 
}); 
long numErrors = lines.count(); 
Java 8 search example: 
JavaRDD<String> lines = sc.textFile("hdfs://log.txt") 
.filter(s -> s.contains("Tomcat")); 
long numErrors = lines.count();
Mahout’s killer
MLlib 
● Classification and regression. collaborative filtering and 
clustering, Dimensionality reduction and Optimization are 
supported 
● It extends scikit-learn (Python lib) and Mahout and run on 
Spark 
● Well documented and integrated with many Java solutions
Size Classification Tools 
Lines 
Sample Data 
Analysis and Visualization Whiteboard, bash 
KBs - low MBs 
Prototype Data 
Analysis and Visualization Matlab, Octave, R 
MBs - low GBs 
Online Data 
Storage MySQL (DBs) 
MBs - low GBs 
Online Data 
Analysis NumPy, SciPy, Weka, BLAS/LAPACK 
GBs - TBs - PBs 
BigData 
Storage HDFS, HBase, Cassandra 
GBs - TBs - PBs 
Big Data 
Analysis Hive, Mahout, Hama, Giraph,MLlib
Large graph processing tools 
63/72
Graph Number of 
vertexes 
Number of 
edges 
Volume Data/per day 
Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB 
Facebook 
(friends 
graph) 
1,1 * 10^9 160 * 10^9 1 PB 15 TB 
Road graph 
of EU 
18 * 10^6 42 * 10^6 20 GB 50 MB 
Road graph 
of this city 
250 000 460 000 500 MB 100 KB
MapReduce for iterative calculations 
● High complexity of graph problem reduction to key-value 
model 
● Iteration algorythms, but multiple chained jobs in M/R with 
full saving and reading of each state 
Think like a vertex…
С++ API
● “Mahout in Action”, Owen et. al., Manning Pub. 
● “Pattern Recognition and Machine Learning”, Christopher Bishop, 
Springer Pub. 
● “Elements of Statistical Learning: Data Mining, Inference, and 
Prediction”, Hastie et. al., Springer Pub. 
● “Collective Intelligence in Action” Satnam Alag et. al., Manning 
Pub. 
Books and papers
Your questions?

More Related Content

What's hot

Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To ProductionJen Aman
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)Farzin Bagheri
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoScyllaDB
 
Comparing pregel related systems
Comparing pregel related systemsComparing pregel related systems
Comparing pregel related systemsPrashant Raaghav
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWSAmazon Web Services
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...Alexey Zinoviev
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkJen Aman
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...ScyllaDB
 
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScyllaDB
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 
Dato vs GraphX
Dato vs GraphXDato vs GraphX
Dato vs GraphXKeira Zhou
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandrajohnrjenson
 

What's hot (20)

Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and Go
 
Comparing pregel related systems
Comparing pregel related systemsComparing pregel related systems
Comparing pregel related systems
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...
 
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Dato vs GraphX
Dato vs GraphXDato vs GraphX
Dato vs GraphX
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandra
 

Similar to Joker'14 Java as a fundamental working tool of the Data Scientist

Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Alexey Zinoviev
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
First steps in Data Mining Kindergarten
First steps in Data Mining KindergartenFirst steps in Data Mining Kindergarten
First steps in Data Mining KindergartenAlexey Zinoviev
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSSKevin Crocker
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev PROIDEA
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?Ivo Andreev
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsAlexey Zinoviev
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Data analysis with Pandas and Spark
Data analysis with Pandas and SparkData analysis with Pandas and Spark
Data analysis with Pandas and SparkFelix Crisan
 
Machine Learning with JavaScript
Machine Learning with JavaScriptMachine Learning with JavaScript
Machine Learning with JavaScriptIvo Andreev
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusBoldRadius Solutions
 

Similar to Joker'14 Java as a fundamental working tool of the Data Scientist (20)

Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
First steps in Data Mining Kindergarten
First steps in Data Mining KindergartenFirst steps in Data Mining Kindergarten
First steps in Data Mining Kindergarten
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projects
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Data analysis with Pandas and Spark
Data analysis with Pandas and SparkData analysis with Pandas and Spark
Data analysis with Pandas and Spark
 
Machine Learning with JavaScript
Machine Learning with JavaScriptMachine Learning with JavaScript
Machine Learning with JavaScript
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 

More from Alexey Zinoviev

Kafka pours and Spark resolves
Kafka pours and Spark resolvesKafka pours and Spark resolves
Kafka pours and Spark resolvesAlexey Zinoviev
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Alexey Zinoviev
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Alexey Zinoviev
 
HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...Alexey Zinoviev
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Alexey Zinoviev
 
Joker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBJoker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBAlexey Zinoviev
 
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...Alexey Zinoviev
 
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAndroid Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAlexey Zinoviev
 
Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Alexey Zinoviev
 
Big data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsBig data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsAlexey Zinoviev
 
"Говнокод-шоу"
"Говнокод-шоу""Говнокод-шоу"
"Говнокод-шоу"Alexey Zinoviev
 
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Alexey Zinoviev
 
Алгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиАлгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиAlexey Zinoviev
 
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)Alexey Zinoviev
 
GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!Alexey Zinoviev
 
How to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSHow to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSAlexey Zinoviev
 
Поездка на IT-DUMP 2012
Поездка на IT-DUMP 2012Поездка на IT-DUMP 2012
Поездка на IT-DUMP 2012Alexey Zinoviev
 
MyBatis и Hibernate на одном проекте. Как подружить?
MyBatis и Hibernate на одном проекте. Как подружить?MyBatis и Hibernate на одном проекте. Как подружить?
MyBatis и Hibernate на одном проекте. Как подружить?Alexey Zinoviev
 
Google I/O туда и обратно.
Google I/O туда и обратно.Google I/O туда и обратно.
Google I/O туда и обратно.Alexey Zinoviev
 

More from Alexey Zinoviev (20)

Kafka pours and Spark resolves
Kafka pours and Spark resolvesKafka pours and Spark resolves
Kafka pours and Spark resolves
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
 
Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
 
HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15
 
Joker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBJoker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDB
 
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
 
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAndroid Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find you
 
Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8
 
Big data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsBig data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphs
 
"Говнокод-шоу"
"Говнокод-шоу""Говнокод-шоу"
"Говнокод-шоу"
 
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
 
Алгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиАлгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерности
 
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
 
GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!
 
How to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSHow to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOS
 
Поездка на IT-DUMP 2012
Поездка на IT-DUMP 2012Поездка на IT-DUMP 2012
Поездка на IT-DUMP 2012
 
MyBatis и Hibernate на одном проекте. Как подружить?
MyBatis и Hibernate на одном проекте. Как подружить?MyBatis и Hibernate на одном проекте. Как подружить?
MyBatis и Hibernate на одном проекте. Как подружить?
 
Google I/O туда и обратно.
Google I/O туда и обратно.Google I/O туда и обратно.
Google I/O туда и обратно.
 

Recently uploaded

Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 

Recently uploaded (16)

Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 

Joker'14 Java as a fundamental working tool of the Data Scientist

  • 1. Java as a fundamental working tool of the Data Scientist Speaker : Alexey Zinoviev
  • 2. About ● I am a <graph theory, machine learning, traffic jams prediction, BigData algorythms> scientist ● But I'm a <Java, JavaScript, Android, NoSQL, Hadoop, Spark> programmer
  • 3.
  • 4. One of these fine days...
  • 5. We need in Python dev 'cause Data Mining
  • 6. You're a programmer, not an analyst
  • 8. Let’s talk about it, Java-boy...
  • 9. Data mining Mining coal in your data
  • 10. Hey, man, predict me something!
  • 12. Typical questions for DM ● Which loan applicants are high-risk? ● How do we detect phone card fraud? ● Which customers do prefer product A over product B? ● What is the revenue prediction for next year?
  • 13. What is Data Mining? 13/72
  • 18. Magic part of KDD (Knowledge Discovery in Databases) 1. Selection 2. Pre-processing 3. Transformation 4. Data Mining 5. Interpretation/Evaluation
  • 19. How it really works 1. Share your date with us 2. Our magic manipulations 3. Building an answering machine 4. PROFIT!!!
  • 21. Data examples ● Facebook users, tweets ● Weather ● Sea routes ● Trade transactions ● Goverment ● Medicine (genomic data) ● Telecommuncations (phone call records)
  • 22. Data sources ● Relational Databases ● Data warehouses (Historical data) ● Files in CSV or in binary format ● Internet or electronic mails ● Scientific, research (R, Octave, Matlab)
  • 23. Target Data & Personal Data 23/72
  • 24.
  • 25. Pay with your personal data ● All your personal data (PD) are being deeply mined ● The industry of collecting, aggregating, and brokering PD is “database marketing.” ● 1.1 billion browser cookies, 200 million mobile profiles, and an average of 1,500 pieces of data per consumer in Acxiom
  • 27. ● Select small pieces ● Define default values for missed data ● Remove strange signals from data ● Merge some tables in one if required
  • 30. What is Cluster Analysis? It is the process of finding model of function that describes and distinguishes data class to predict the class of objects whose class label is unknown.
  • 31. Regression ● Statistical process for estimating the relationships among variables ● The estimation target is function (it can be probability distribution) ● Can be linear, polynomial, nonlinear and etc.
  • 32. Classification ● Training set of classified examples (supervised learning) ● Test set of non-classified items ● Main goal: find a function (classifier) that maps input data to a category ● Computer vision, drug discovery, speech recognition, biometric indentification, credit scoring
  • 35. kNN (k-nearest neighbor) ● There are two classes of objects A & B ● Define the class of new object, based on information about its neighbors ● Changing the boundaries of an new object area, we form a set of neighbors. ● New object is B becuase majority of the neighbors is a B.
  • 36. Skills & Tools 36/72
  • 37.
  • 38.
  • 40.
  • 41. Why not Octave? ● It’s free ● Not full implemented stack of ML algorythms ● All your matrix are belong to us! ● Single thread model ● Java support
  • 42.
  • 43. Why not R? ● 25% of R packages are written in Java ● Syntax is too sweet ● You should read 1000 lines in docs to write 1 line of code ● Single thread model for 95% algorythms
  • 44. Why not Python? ● Now Python is an idol for young scientists due to the low barrier to entry ● We are not Python developers ● High-level language ● Have you ever heard about a Jython? ● Long long way to real Highload production
  • 45. DM libraries in Python
  • 47. JDM ● Java API for Data mining, JSR 73 and JSR 247 ● javax.datamining.supervised defines the supervised function-related interfaces ● javax.datamining.algorithm contains all mining algorithm subclass packages ● JDM 2.0 adds Text Mining, Time series and so on..
  • 48. Weka ● Connectors to R, Octave, Matlab, Hadoop, NoSQL/SQL databases ● Source code of all algorythms in Java ● Preprocessing tools: discretization, normalization, resampling, attribute selection, transforming and combining
  • 49.
  • 50.
  • 52. SPMF ● It’s codebase of algorythms in pattern mining field ● It has cool examples and implementation of 78 algorythms ● Cool performance results in specific area ● Codebase grows very fast
  • 53. Mahout ● Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore” ● Next algorythms were adopted: Locally Weighted Linear Regression(LWLR), Naive Bayes (NB), k-means, Logistic Regression, Neural Network (NN), Principal Components Analysis (PCA), Support Vector Machine (SVM) and so on.. ● The complexity was reduced in n times for n processors.
  • 54. Mahout ● DataModel (File, MySQL, PostgreSQL, Mongo, Cassandra) ● UserSimilarity ● ItemSimilarity ● UserNeighborhood ● Recommender
  • 55. Mahout ● Advanced Implementations of Java’s Collections Framework for better Performance. ● Very close to Apache Giraph ● New algorythms will build on Spark platform ● Spark shell ● Spring + Mahout demo ● Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Miscellaneous are supported
  • 56.
  • 57. Spark ● MapReduce in memory ● Up to 50x faster than Hadoop ● Support for Shark (like Hive), MLlib (Machine learning), GraphX (graph processing) ● RDD is a basic building block (immutable distributed collections of objects)
  • 58. Spark
  • 59. Spark + Java 8 Java 7 search example: JavaRDD<String> lines = sc.textFile("hdfs://log.txt").filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("Tomcat"); } }); long numErrors = lines.count(); Java 8 search example: JavaRDD<String> lines = sc.textFile("hdfs://log.txt") .filter(s -> s.contains("Tomcat")); long numErrors = lines.count();
  • 61. MLlib ● Classification and regression. collaborative filtering and clustering, Dimensionality reduction and Optimization are supported ● It extends scikit-learn (Python lib) and Mahout and run on Spark ● Well documented and integrated with many Java solutions
  • 62. Size Classification Tools Lines Sample Data Analysis and Visualization Whiteboard, bash KBs - low MBs Prototype Data Analysis and Visualization Matlab, Octave, R MBs - low GBs Online Data Storage MySQL (DBs) MBs - low GBs Online Data Analysis NumPy, SciPy, Weka, BLAS/LAPACK GBs - TBs - PBs BigData Storage HDFS, HBase, Cassandra GBs - TBs - PBs Big Data Analysis Hive, Mahout, Hama, Giraph,MLlib
  • 63. Large graph processing tools 63/72
  • 64.
  • 65.
  • 66. Graph Number of vertexes Number of edges Volume Data/per day Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB Facebook (friends graph) 1,1 * 10^9 160 * 10^9 1 PB 15 TB Road graph of EU 18 * 10^6 42 * 10^6 20 GB 50 MB Road graph of this city 250 000 460 000 500 MB 100 KB
  • 67. MapReduce for iterative calculations ● High complexity of graph problem reduction to key-value model ● Iteration algorythms, but multiple chained jobs in M/R with full saving and reading of each state Think like a vertex…
  • 68.
  • 70.
  • 71. ● “Mahout in Action”, Owen et. al., Manning Pub. ● “Pattern Recognition and Machine Learning”, Christopher Bishop, Springer Pub. ● “Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Hastie et. al., Springer Pub. ● “Collective Intelligence in Action” Satnam Alag et. al., Manning Pub. Books and papers