3. 3
Agenda
⢠Data vs. Information
⢠What is Data Mining?
⢠Technical Platform
⢠Data Mining Process Overview
⢠Key Concepts and Terminology
⢠DEMO: Data Mining Process in Detail Using DMX
9. 9
Data Mining
⢠Technologies for analysis of data and discovery of
(very) hidden patterns
⢠Uses a combination of statistics, probability analysis
and database technologies
⢠Fairly young (<20 years old) but clever algorithms
developed through database research
10. 10
What does Data Mining Do?
Explores
Your Data
Finds
Patterns -
Trends
Performs
Predictions
13. 13
ďŹ Data acquisition and
integration from
multiple sources
ďŹ Data transformation
and synthesis
ďŹ Knowledge and
pattern detection
through Data Mining
ďŹ Data enrichment with
logic rules and
hierarchical views
ďŹ Data presentation
and distribution
ďŹ Data publishing for
mass recipients
Integrate Analyze Report
SQL Server
We Need More Than Just Database Engine
15. 15
Server Mining Architecture
Analysis Services
Server
Mining Model
Data Mining Algorithm Data
Source
Excel/Visio/SSRS/Your App
OLE DB/ADOMD/XMLA
Deploy
BIDS
Excel
Visio
SSMS
App
Data
17. 17
Mining Model Mining ModelMining Model
Mining Process
DM EngineDM Engine
Training data
Data to be
predictedMining Model
With
predictions
18. 18
Steps for Building a DM Model
1. Model Creation
⢠Define columns for cases: visually (BIDS), using DMX, or from PMML
2. Model Training
⢠Feed lots of data from a real database, or from a system log
Congratulations! We now have a model
3. Model Testing
⢠Test on sample data to check predictions.
⢠Testing data must be different from training
⢠If we get nonsense, adjust the algorithm, its parameters, model design, or even
data
4. Model Use (Exploration and Prediction)
⢠Use the model on new data to predict outcomes
19. 19
Many Approaches
⢠Work the way you like:
⢠Database experts and SQL veterans:
⢠Write queries in DMX (similar to T-SQL)
⢠Everyone else:
⢠Use Business Intelligence Development Studio (BIDS) â rich GUI
included with SSAS
⢠Hosted in Visual Studio (included!)
⢠You donât have to program â click-click instead
⢠Use Excel 2007 with Data Mining Add-Ins
⢠The âData Miningâ tab has everything you need
⢠âTable Analysisâ tab is easier but simplified
21. 21
Mining Structure
⢠Describes data to be mined
⢠Columns from a data source and their:
⢠Data Type
⢠Content Type
⢠Contains Mining Models
⢠Often we build several different models in one structure
⢠Holds training data, known as Cases (if required)
⢠Holds testing data, known as Holdout (in SQL 2008)
23. 23
Data Mining Model
⢠Container of patterns discovered by a Data Mining
Algorithm amongst the training Cases
⢠A table containing patterns
⢠Expressed by visualisers
⢠Specifies usage of columns already defined in the
Mining Structure
24. 24
Cases: The Things We Study
⢠Case â set of columns (attributes) you want to analyse
⢠Age, Gender, Region, Annual Spending
⢠Case Key â unique ID of a case
26. 26
Data Mining Extensions
DMX
⢠âT-SQLâ for Data Mining
⢠Easy! Like scripting for IT Pros
⢠Two types of statements:
⢠Data Definition
⢠CREATE, ALTER, EXPORT, IMPORT, DROP
⢠Data Manipulation
⢠INSERT INTO, SELECT, DELETE
27. 27
DMX â Just Like T-SQL
CREATE MINING MODEL CreditRisk
(CustID LONG KEY,
Gender TEXT DISCRETE,
Income LONG CONTINUOUS,
Profession TEXT DISCRETE,
Risk TEXT DISCRETE PREDICT)
USING Microsoft_Decision_Trees
INSERT INTO CreditRisk
(CustId, Gender, Income, Profession,
Risk)
Select
CustomerID, Gender, Income,
Profession,Risk
From Customers
Select NewCustomers.CustomerID, CreditRisk.Risk,
PredictProbability(CreditRisk.Risk)
FROM CreditRisk PREDICTION JOIN NewCustomers
ON CreditRisk.Gender=NewCustomer.Gender
AND CreditRisk.Income=NewCustomer.Income
AND CreditRisk.Profession=NewCustomer.Profession
28. 28
Demoâs Steps
1
⢠Create Mining Structure
2
⢠Create Mining Model
3
⢠Process Mining Model
4
⢠Test Model
5
⢠Execute Prediction
SSMS: SQL Server Management StudioCĂ´ng cuĚŁ ÄĂŞĚ taĚŁoracaĚc Mining Model. CaĚccĂ´ng cuĚŁ ÄĆ°ĆĄĚŁc Microsoft cungcâĚpgĂ´Ěm coĚ: Business Inteligence Development Studio, Excel, Visio, SQL Server Management Studio. SaukhitaĚŁoracaĚc Mining Model, câĚnphaĚitriĂŞĚnkhailĂŞnhĂŞĚŁ thĂ´Ěng Analysis Services (A.S). Analysis Service laĚ nĆĄivậnhaĚnh, quaĚnlyĚ caĚc Model.LĆ°u yĚ rÄĚngcaĚc Model saukhiÄĆ°ĆĄĚŁctriĂŞĚnkhailĂŞn A.S chiĚ laĚ caĚc Model rĂ´Ěng. ÄĂŞĚ coĚ thĂŞĚ ÄĆ°avaĚosĆ°Ě duĚŁng, câĚnphaĚi qua mĂ´ĚŁt quaĚ triĚnhgoĚŁi laĚ Training Model (hoÄĚŁc Process Model). ViĚ thĂŞĚ câĚnÄĂŞĚnthaĚnhphâĚnthĆ°Ě 3 ÄoĚ laĚ Data Source. Data source laĚ nĆĄichĆ°ĚadĆ°Ě liĂŞĚŁucâĚnthiĂŞĚtchoviĂŞĚŁc Training Model vaĚ caĚ quaĚ triĚnh Test Model. ViĚ thĂŞĚ câĚnphaĚi chia lĆ°ĆĄĚŁng Data thaĚnh 2 phâĚnriĂŞngbiĂŞĚŁtÄĂŞĚ phuĚŁc vuĚŁ cho 2 taĚc vuĚŁ trĂŞn.ThaĚnhphâĚnthĆ°Ě 4, ÄoĚ laĚ caĚcĆ°ĚngduĚŁngkhaithaĚccaĚc Mining Model ÄaĚ ÄĆ°ĆĄĚŁcxâydĆ°ĚŁng. CaĚcĆ°ĚngduĚŁng coĚ thĂŞĚ laĚ caĚcphâĚnmĂŞĚmÄĆ°ĆĄĚŁc Microsoft cungcâĚpnhĆ° Excel, Visio hoÄĚŁcĆ°ĚngduĚŁng do ngĆ°ĆĄĚiduĚngxâydĆ°ĚŁng. CaĚcĆ°ĚngduĚŁngnaĚygĆĄĚidĆ°Ě liĂŞĚŁucuĚamiĚnhxuĂ´Ěng Analysis Service vaĚ nhậnphaĚnhĂ´Ěi laĚ kĂŞĚt quaĚ cuĚa quaĚ triĚnh Data Mining trĆĄĚ laĚŁi.
Approaches: phĆ°ĆĄngphaĚptiĂŞĚpcận
Concepts:khaĚiniĂŞĚŁmTerminology: thuậtngĆ°Ě
Mining Structures (Analysis Services - Data Mining)The mining structure defines the data from which mining models are built: it specifies the source data view, the number and type of columns, and an optional partition into training and testing sets. A single mining structure can support multiple mining models that share the same domain. The following diagram illustrates the relationship of the data mining structure to the data source, and to its constituent data mining models.
The mining structure in the diagram is based on a data source that contains multiple tables, joined on the CustomerID field. One table contains information about customers, such as the geographical region, age, income and gender, while the related nested table contains multiple rows of additional information about each customer, such as products the customer has purchased. The diagram shows that multiple models can be built on one mining structure, and that the models can use different columns from the structure. Model 1Â Â Â Uses CustomerID, Income, Age, Region, and filters the data on Region.Model 2 Â Â Â Uses CustomerID, Income, Age, Region and filters the data on Age.Model 3 Â Â Â Uses CustomerID, Age, Gender, and the nested table, with no filter.Because the models use different columns for input, and because two of the models additionally restrict the data that is used in the model by applying a filter, the models might have very different results even though they are based on the same data. Note that the CustomerID column is required in all models because it is the only available column that can be used as the case key.This section explains the basic architecture of data mining structures. For more information about how to create, manage, modify, or view data mining structures, see Managing Data Mining Structures and Models.
A data mining model gets data from a mining structure and then analyzes that data by using a data mining algorithm. The mining structure and mining model are separate objects. The mining structure stores information that defines the data source. A mining model stores information derived from statistical processing of the data, such as the patterns found as a result of analysis. A mining model is empty until the data provided by the mining structure has been processed and analyzed. After a mining model has been processed, it contains metadata, results, and bindings back to the mining structure.