Alexey Zinoviev presented this paper on Second Thumbtack Technology Expert Day.
This paper covers next topics: Data Mining, Machine Learning, Octave, R language
YouTube: http://youtu.be/kGIP6XeWiaA
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
First steps in Data Mining Kindergarten
1. Speaker : Alexey Zinoviev
First Steps in Data Mining Kindergarten
2. About
● I am a scientist. The area of my interests includes graph
theory, machine learning, traffic jams prediction, BigData
algorythms.
● But I'm a programmer, so I'm interested in NoSQL databases,
Java, JavaScript, Android, MongoDB, Cassandra, Hadoop,
MapReduce, metaprogramming, reflection.
● I am a fan of variety GEO API (Maps API for example)
6. ● Which loan applicants are high-risk?
● Which customer will respond to a planned promotion?
● How do we detect phone card fraud?
● How do customer profile change over time?
● Which customers do prefer product A over product B?
● What is the revenue prediction for next year?
● Which students are most likely to transfer than others?
● Which tax payer may be cheating the system?
● Who is most likely to violate a probation sentence?
Typical questions for DM
16. ● Relational Databases (transactional
data with many tables)
● Data warehouses (Historical data,
aggregated and updated periodically)
● Files (In special format (e.g., CSV) or
proprietary binary)
● Internet or electronic mail (HTML,
XML, web search results, e-mails)
● Scientific, research (R, Octave, Matlab)
Data sources
19. ● All your personal data (PD) are
being deeply mined
● The industry of collecting,
aggregating, and brokering PD is
“database marketing.”
● 1.1 billion browser cookies , 200
million mobile profiles, and an
average of 1,500 pieces of data per
consumer in Acxiom
Pay with your personal data
23. ● Set of items & transactions
● Need to find rules (simple implication: if A & B than C)
Association rule learning
24. It is the process of finding model of function that describes
and distinguishes data class to predict the class of objects
whose class label is unknown.
What is Cluster Analysis?
25. ● Statistical process for estimating
the relationships among variables
● The estimation target is function
(it can be probability
distribution)
● Can be linear, polynomial,
nonlinear and etc.
Regression
26. ● Training set of classified
examples (supervised learning)
● Test set of non-classified items
● Main goal: find a function
(classifier) that maps input data
to a category
● Computer vision, drug discovery,
speech recognition, biometric
indentification, credit scoring
Classification
27. A tree showing survival of passengers
on the Titanic ("sibsp" is the number
of spouses or siblings aboard).
The figures under the leaves show
the probability of survival and the
percentage of observations in the
leaf.
Decision trees
28. ● There are two classes of objects A &
B (red & blue)
● Define the class of new object,
based on information about its
neighbors
● Changing the boundaries of an new
object area, we form a set of
neighbors.
● New object is B becuase majority of
the neighbors is a B.
kNN
32. ● R is free and R is language
● Graphics and data visualization
● A flexible statistical analysis
toolkit
● Access to powerful, cutting-edge
analytics
● A robust, vibrant community
● Unlimited possibilities
R
33.
34.
35.
36. ● Octave is free and Octave is
language
● C++ support (you can call Octave
functions from C/C++)
● Easy prototyping for Matlab
● Extensive graphics capabilities
for data visualization and
manipulation
● Java support
Octave
37.
38. In conclusion
● All we need in Math …
● Use R & Octave
● Data is only first step
● Dependencies is everywhere