SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Speaker : Alexey Zinoviev
First Steps in Data Mining Kindergarten
About
● I am a scientist. The area of my interests includes graph
theory, machine learning, traffic jams prediction, BigData
algorythms.
● But I'm a programmer, so I'm interested in NoSQL databases,
Java, JavaScript, Android, MongoDB, Cassandra, Hadoop,
MapReduce, metaprogramming, reflection.
● I am a fan of variety GEO API (Maps API for example)
Data mining
Mining coal in
your data
Hey, man, predict me something!
Man or sofa?
● Which loan applicants are high-risk?
● Which customer will respond to a planned promotion?
● How do we detect phone card fraud?
● How do customer profile change over time?
● Which customers do prefer product A over product B?
● What is the revenue prediction for next year?
● Which students are most likely to transfer than others?
● Which tax payer may be cheating the system?
● Who is most likely to violate a probation sentence?
Typical questions for DM
What is Data Mining?
Statistics?
Tag cloud?
Data visualization?
Not OLAP, 100%
1. Selection
2. Pre-processing
3. Transformation
4. Data Mining
5. Interpretation/Evaluation
Magic part of KDD (Knowledge
Discovery in Databases)
1. Share your date with us
2. Thumbtack’s magic manipulations
3. Get Answers Machine
4. PROFIT!!!
Thumbtack’s definition
Data
● Facebook users, tweets
● Weather
● Sea routes
● Trade transactions
● Goverment
● Medicine (genomic data)
● Telecommuncations (phone
call records)
Data examples
● Relational Databases (transactional
data with many tables)
● Data warehouses (Historical data,
aggregated and updated periodically)
● Files (In special format (e.g., CSV) or
proprietary binary)
● Internet or electronic mail (HTML,
XML, web search results, e-mails)
● Scientific, research (R, Octave, Matlab)
Data sources
Target data
Targeting Advertising
● All your personal data (PD) are
being deeply mined
● The industry of collecting,
aggregating, and brokering PD is
“database marketing.”
● 1.1 billion browser cookies , 200
million mobile profiles, and an
average of 1,500 pieces of data per
consumer in Acxiom
Pay with your personal data
Preprocessing
● Select small pieces
● Define default values for missed data
● Remove strange signals from data
● Merge some tables in one if required
Pattern mining
● Set of items & transactions
● Need to find rules (simple implication: if A & B than C)
Association rule learning
It is the process of finding model of function that describes
and distinguishes data class to predict the class of objects
whose class label is unknown.
What is Cluster Analysis?
● Statistical process for estimating
the relationships among variables
● The estimation target is function
(it can be probability
distribution)
● Can be linear, polynomial,
nonlinear and etc.
Regression
● Training set of classified
examples (supervised learning)
● Test set of non-classified items
● Main goal: find a function
(classifier) that maps input data
to a category
● Computer vision, drug discovery,
speech recognition, biometric
indentification, credit scoring
Classification
A tree showing survival of passengers
on the Titanic ("sibsp" is the number
of spouses or siblings aboard).
The figures under the leaves show
the probability of survival and the
percentage of observations in the
leaf.
Decision trees
● There are two classes of objects A &
B (red & blue)
● Define the class of new object,
based on information about its
neighbors
● Changing the boundaries of an new
object area, we form a set of
neighbors.
● New object is B becuase majority of
the neighbors is a B.
kNN
Skills & Tools
Most popular Data Mining
algorythms
● R is free and R is language
● Graphics and data visualization
● A flexible statistical analysis
toolkit
● Access to powerful, cutting-edge
analytics
● A robust, vibrant community
● Unlimited possibilities
R
● Octave is free and Octave is
language
● C++ support (you can call Octave
functions from C/C++)
● Easy prototyping for Matlab
● Extensive graphics capabilities
for data visualization and
manipulation
● Java support
Octave
In conclusion
● All we need in Math …
● Use R & Octave
● Data is only first step
● Dependencies is everywhere
Your questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Marko Grobelnik
 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
caise2013vlc
 
ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data
ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked dataESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data
ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data
eswcsummerschool
 

Was ist angesagt? (20)

Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
 
Data Model vs Ontology Development – a FIBO perspective | Mike Bennett
Data Model vs Ontology Development – a FIBO perspective | Mike BennettData Model vs Ontology Development – a FIBO perspective | Mike Bennett
Data Model vs Ontology Development – a FIBO perspective | Mike Bennett
 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
 
ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data
ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked dataESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data
ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data
 
data science
data sciencedata science
data science
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
An Algebraic Data Model for Graphs and Hypergraphs (Category Theory meetup, N...
An Algebraic Data Model for Graphs and Hypergraphs (Category Theory meetup, N...An Algebraic Data Model for Graphs and Hypergraphs (Category Theory meetup, N...
An Algebraic Data Model for Graphs and Hypergraphs (Category Theory meetup, N...
 
Machine learning
Machine learningMachine learning
Machine learning
 
Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' ProblemsAdding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' Problems
 
Data mining - Process, Techniques and Research Topics
Data mining - Process, Techniques and Research TopicsData mining - Process, Techniques and Research Topics
Data mining - Process, Techniques and Research Topics
 
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
 
A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...
A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...
A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...
 
Programming for data science in python
Programming for data science in pythonProgramming for data science in python
Programming for data science in python
 
Weso research group
Weso research groupWeso research group
Weso research group
 
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.orgEC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
data mining
data miningdata mining
data mining
 
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 

Andere mochten auch

HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...
Alexey Zinoviev
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15
Alexey Zinoviev
 
CodeFest 2012. Нелюбин Д. — Neo4j — графовая база данных
CodeFest 2012. Нелюбин Д. — Neo4j — графовая база данныхCodeFest 2012. Нелюбин Д. — Neo4j — графовая база данных
CodeFest 2012. Нелюбин Д. — Neo4j — графовая база данных
CodeFest
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 

Andere mochten auch (13)

HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...
 
Big data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsBig data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphs
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15
 
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
CodeFest 2012. Нелюбин Д. — Neo4j — графовая база данных
CodeFest 2012. Нелюбин Д. — Neo4j — графовая база данныхCodeFest 2012. Нелюбин Д. — Neo4j — графовая база данных
CodeFest 2012. Нелюбин Д. — Neo4j — графовая база данных
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)
 
Innovate Analytics with Oracle Data Mining & Oracle R
Innovate Analytics with Oracle Data Mining & Oracle RInnovate Analytics with Oracle Data Mining & Oracle R
Innovate Analytics with Oracle Data Mining & Oracle R
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 

Ähnlich wie First steps in Data Mining Kindergarten

ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
eswcsummerschool
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Rohit Dubey
 

Ähnlich wie First steps in Data Mining Kindergarten (20)

Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
L15.pptx
L15.pptxL15.pptx
L15.pptx
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DS
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
Data Analytics Career Paths
Data Analytics Career PathsData Analytics Career Paths
Data Analytics Career Paths
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career path
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Data science a practitioner's perspective
Data science  a practitioner's perspectiveData science  a practitioner's perspective
Data science a practitioner's perspective
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
Open Source Geospatial Business Intelligence (GeoBI): Definition, architectur...
Open Source Geospatial Business Intelligence (GeoBI): Definition, architectur...Open Source Geospatial Business Intelligence (GeoBI): Definition, architectur...
Open Source Geospatial Business Intelligence (GeoBI): Definition, architectur...
 
Kp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxKp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptx
 
Open source for customer analytics
Open source for customer analyticsOpen source for customer analytics
Open source for customer analytics
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
Ets train ppt_big_data_basics_v2.0
Ets train ppt_big_data_basics_v2.0Ets train ppt_big_data_basics_v2.0
Ets train ppt_big_data_basics_v2.0
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
 

Mehr von Alexey Zinoviev

ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
Alexey Zinoviev
 

Mehr von Alexey Zinoviev (20)

Kafka pours and Spark resolves
Kafka pours and Spark resolvesKafka pours and Spark resolves
Kafka pours and Spark resolves
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
 
Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projects
 
Joker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBJoker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDB
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
 
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAndroid Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find you
 
Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8
 
"Говнокод-шоу"
"Говнокод-шоу""Говнокод-шоу"
"Говнокод-шоу"
 
Алгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиАлгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерности
 
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
 
GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!
 
How to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSHow to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOS
 
Поездка на IT-DUMP 2012
Поездка на IT-DUMP 2012Поездка на IT-DUMP 2012
Поездка на IT-DUMP 2012
 
MyBatis и Hibernate на одном проекте. Как подружить?
MyBatis и Hibernate на одном проекте. Как подружить?MyBatis и Hibernate на одном проекте. Как подружить?
MyBatis и Hibernate на одном проекте. Как подружить?
 
Google I/O туда и обратно.
Google I/O туда и обратно.Google I/O туда и обратно.
Google I/O туда и обратно.
 
Google Maps. Zinoviev Alexey.
Google Maps. Zinoviev Alexey.Google Maps. Zinoviev Alexey.
Google Maps. Zinoviev Alexey.
 
Google Docs. Zinoviev Alexey
Google Docs. Zinoviev AlexeyGoogle Docs. Zinoviev Alexey
Google Docs. Zinoviev Alexey
 
ORM battle. MyBatis vs Hibernate
ORM battle. MyBatis vs HibernateORM battle. MyBatis vs Hibernate
ORM battle. MyBatis vs Hibernate
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

First steps in Data Mining Kindergarten

  • 1. Speaker : Alexey Zinoviev First Steps in Data Mining Kindergarten
  • 2. About ● I am a scientist. The area of my interests includes graph theory, machine learning, traffic jams prediction, BigData algorythms. ● But I'm a programmer, so I'm interested in NoSQL databases, Java, JavaScript, Android, MongoDB, Cassandra, Hadoop, MapReduce, metaprogramming, reflection. ● I am a fan of variety GEO API (Maps API for example)
  • 3. Data mining Mining coal in your data
  • 4. Hey, man, predict me something!
  • 6. ● Which loan applicants are high-risk? ● Which customer will respond to a planned promotion? ● How do we detect phone card fraud? ● How do customer profile change over time? ● Which customers do prefer product A over product B? ● What is the revenue prediction for next year? ● Which students are most likely to transfer than others? ● Which tax payer may be cheating the system? ● Who is most likely to violate a probation sentence? Typical questions for DM
  • 7. What is Data Mining?
  • 12. 1. Selection 2. Pre-processing 3. Transformation 4. Data Mining 5. Interpretation/Evaluation Magic part of KDD (Knowledge Discovery in Databases)
  • 13. 1. Share your date with us 2. Thumbtack’s magic manipulations 3. Get Answers Machine 4. PROFIT!!! Thumbtack’s definition
  • 14. Data
  • 15. ● Facebook users, tweets ● Weather ● Sea routes ● Trade transactions ● Goverment ● Medicine (genomic data) ● Telecommuncations (phone call records) Data examples
  • 16. ● Relational Databases (transactional data with many tables) ● Data warehouses (Historical data, aggregated and updated periodically) ● Files (In special format (e.g., CSV) or proprietary binary) ● Internet or electronic mail (HTML, XML, web search results, e-mails) ● Scientific, research (R, Octave, Matlab) Data sources
  • 19. ● All your personal data (PD) are being deeply mined ● The industry of collecting, aggregating, and brokering PD is “database marketing.” ● 1.1 billion browser cookies , 200 million mobile profiles, and an average of 1,500 pieces of data per consumer in Acxiom Pay with your personal data
  • 21. ● Select small pieces ● Define default values for missed data ● Remove strange signals from data ● Merge some tables in one if required
  • 23. ● Set of items & transactions ● Need to find rules (simple implication: if A & B than C) Association rule learning
  • 24. It is the process of finding model of function that describes and distinguishes data class to predict the class of objects whose class label is unknown. What is Cluster Analysis?
  • 25. ● Statistical process for estimating the relationships among variables ● The estimation target is function (it can be probability distribution) ● Can be linear, polynomial, nonlinear and etc. Regression
  • 26. ● Training set of classified examples (supervised learning) ● Test set of non-classified items ● Main goal: find a function (classifier) that maps input data to a category ● Computer vision, drug discovery, speech recognition, biometric indentification, credit scoring Classification
  • 27. A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of survival and the percentage of observations in the leaf. Decision trees
  • 28. ● There are two classes of objects A & B (red & blue) ● Define the class of new object, based on information about its neighbors ● Changing the boundaries of an new object area, we form a set of neighbors. ● New object is B becuase majority of the neighbors is a B. kNN
  • 30.
  • 31. Most popular Data Mining algorythms
  • 32. ● R is free and R is language ● Graphics and data visualization ● A flexible statistical analysis toolkit ● Access to powerful, cutting-edge analytics ● A robust, vibrant community ● Unlimited possibilities R
  • 33.
  • 34.
  • 35.
  • 36. ● Octave is free and Octave is language ● C++ support (you can call Octave functions from C/C++) ● Easy prototyping for Matlab ● Extensive graphics capabilities for data visualization and manipulation ● Java support Octave
  • 37.
  • 38. In conclusion ● All we need in Math … ● Use R & Octave ● Data is only first step ● Dependencies is everywhere