SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Name: KRIENGSAK CHANINCHOMPOONUT
Date: December 10th
, 2010
1
As a result of the increased use of various technologies in virtually all
areas of data mining research, obviously the good decision making is as
important as the key of successfully for the organization strategic.
Data mining gives you access to the information that you need to make
intelligent decisions about difficult business problems which somehow
be able to identify rules and patterns in data, so that you can
determine why things happen and predict what will happen in the
future. The Top-Bottom technique can be use when data form as
functions which can be calculate by equation. However in the real
world scenario, dealing with the complex data which is not always
given the accurate outcome because many cases can not be solved
with mathematical equation formula which attempt to map the
unknown factors into the algorithms. Therefore, another solution come
up with Bottom-Top technique that tend to cross validate with the
solutions from both ways which are Top-Bottom and Bottom-Top
2
Top-Down technique
Bottom-Up technique
As a result, the next number of this
dataset are likely to be 0, 4, 7 and
so on as we are able to map the
known factors into equation.
Unlikely the dataset at the bottom
as it need to be learn the unknown
factors from the bottom to top.
Because it could not be found in
any linear proportion data that can
be solve with equation. Instead, it
rather spread out over the graph
with unknown direction. If we still
using the equation to solve this
dataset, we hardly or never detect
any pattern or relationship at all.
So that’s why the bottom-up is
become in efficiency way, by try to
learn a data and recognize them
once the similar pattern appear
again in the dataset.
3
To answer the various types of businesses questions, data mining will
help you finding patterns and relations in data that is not apparent with
human eyes by analysis those dataset using mathematical algorithms
such as decision trees, segmentation, clustering, association and time
series etc. through Microsoft SQL Server technologies and confirm those
found discovery pattern for doing predictions base on the patterns in
historic . Such that the valuable information found can be used for the
various application such as financial applications, marketing & sale
forecast, CRM, ERP etc.
The most topic as discuss in this project will be using the database as
the foundation to provide the appropriate model , algorithms base on
pattern recognition or detection that found in the historical data.
4
To achieve the project, the following tools below are developing
tools with including within this project
Application
Microsoft SQL Database Server (MSSQL)
Microsoft SQL Server Analysis Services (SSAS)
Microsoft SQL Integration Services Connections (SSIS)
Microsoft Visual Studio C#
Microsoft Decision Tree Algorithm
Microsoft Naïve-Bayes Algorithm
Neural Network Algorithm
Hardware
Server running the SQL Database engine and Analysis Services
PC for daily gathering data source and supply to MSSQL
Server running the SSIS for daily updating the SSAS server
PC for C# coding, database, SSAS and data mining design
5
There are 5 phases to implement for this project
Phase I : Identify the business problems
Phase II : Data source collection
Phase III : Database transformation
Phase IV : Data mining model building
Phase V : Model Assessment
6
Data source
Data miningSSAS Database Server
MSSQL Database Server Neural Network
• Data Converting
• SSIS
Convert and Supplying data to MSSQL
Produce data mining
Query data from database
NNproducedatamining
7
To identify the business need, the experiment to demonstrate for
this project involve to the financial application which inquire the
questions as following
To help the financial department mange a currency swap. What
are/is the most factors effected to the US Dollar and Thai Baht
currency exchange rate?
And what is the next day currency exchange rate likely to be?
Let determine the definition of each inquired to identify for the
whole this presentation as following
Fundamental : As is for the financial department inquiring.
8
To get the answering regarding to the first phase questions, the
appropriate data need to be collected on this process which might
get the ideas from the persons whom have the particularly those
experiences background which help to narrow down the huge data
raw into the meaning full data instead gathering all those
meaningless data.
However, the data mining techniques tend to require more
historical data than the standard models and in the case of neural
networks, can be difficult to interpret.
9
Contents Data Source
Economic statistical indicators • Bank of Thailand
Daily Thai stock index • The Stock Exchange of Thailand
Daily Thai bank interest rate • Bank of Thailand
Daily exchanges rates • Bank of Thailand
Daily gold trading price • Bloomberg
• Thai Gold Trader
Daily crude oil prices • Bloomberg
Daily world stock index • Bloomberg
10
Database Tables
Once we got all expected data source, the
data transformation is begin. I wrote the
scripts using C# grabbing all those data from
the raw source and then feeding into the
MSSQL database server which will be auto
daily updating.
32 Tables
The only selected appropriated tables will be
include in this project.
Create views table as usdVSVariables
responding to selected appropriated
Fundamental Database
11
12
SELECT DISTINCT
TOP (100) PERCENT dbo.ExchangeRates.DateKey,
dbo.GoldMarket.DollarPerOunce, dbo.Energy.Value AS CrudeOil,
dbo.ExchangeRates.BuyingSightBill,
StockValue.SETValue, StockValue.DJValue, InterestMRR.MRR,
DepositRate.OneYearMax
FROM dbo.ExchangeRates INNER JOIN
dbo.Energy ON dbo.ExchangeRates.DateKey = dbo.Energy.DateKey INNER
JOIN
dbo.GoldMarket ON dbo.Energy.DateKey = dbo.GoldMarket.DateKey INNER
JOIN
(SELECT T.DateKey, T.Value AS SETValue, D.Value AS DJValue
FROM dbo.StockMarket AS T INNER JOIN
dbo.StockMarket AS D ON T.DateKey = D.DateKey
WHERE (T.Symbol = 'SET') AND (D.Symbol = 'DowJones')) AS
StockValue ON dbo.GoldMarket.DateKey = StockValue.DateKey INNER JOIN
(SELECT DateKey, BankName, MRR
FROM dbo.LoanInterestRate
WHERE (BankName = 'Bangkok Bank')) AS InterestMRR ON
StockValue.DateKey = InterestMRR.DateKey INNER JOIN
(SELECT DateKey, BankName, OneYearMax
FROM dbo.DepositInterestRate
WHERE (BankName = 'Bangkok Bank')) AS DepositRate ON
InterestMRR.DateKey = DepositRate.DateKey
WHERE (dbo.ExchangeRates.DateKey > 19991231) AND (dbo.ExchangeRates.Currency =
'USD') AND (dbo.GoldMarket.DollarPerOunce > 0)
SQL Code
13
14
15
SSAS Sample (Internet connection required) Or follow this link
http://www.youtube.com/watch?v=xjEy-zNE9P8
At this point, I will divide two demonstrations into two different
sections which are
Fundamental : Predict USD-Thai currency rate exchanges
Customers : Identifying perspective customers who are a potential
Let get start the Fundamental data mining implementation first. The
standard approach to modeling the fundamental factors returns the
currency exchange rates is to model the whole attributes associated
as the input variables to predict Thai Baht per dollar as the result by
analyzing the most influent effective factors.
Mining Structure
Data source from SSAS server
Data for training and testing is 70:30
Data type as discretized
Key : DateKey
16
In order to illustrate what are/is the most important variables for the prediction
of Thai Baht per dollar, I aim using hybrid algorithms approach to utilize each
advantages with including a Decision tree, Naïve Bayes to classify which variables
to use for input in the Neural Network algorithm. The decision tree is capable
of detecting rules like “if A then B” However, dealing with continuous values is
not work quite well like “if A then 2.5” but tries to split the node as “if A is > 20
then B” So, that’s why the Neural Network would take over the outcomes given
as the numeric data to compare its results against the Decision Tree. Such that,
my approach to forecast Thai Baht per dollar will be more accurate base on the
associated variables which can be more efficiency predicted the approximately
the next day as the result.
Decision Tree
Neural Network
Input
Variable 2
Variable 3
Variable 4
Variable 5
Variable 6
Variable 1
?
?
?
?
Classify
Variable 2
Variable 6
17
Naïve Bayes
Input
?
?
?
All associated variables can be retrieved by survey,
by using external data research, or by discuss to
persons who have those experience background.
The advantage of using several factors to perform
the forecasting instead depend on only one factor
is they can cross validate the result which provide
more quality and precisely of data interpreted
outcome.
Variable Description Usage
SETValue Thai stock index (SET) Input
DJValue Dow Jones index Input
CrudeOil Crude Oil dollar per barrel Input
DollarPerOunce Gold price dollar per ounce Input
BuyingSightBill Thai Baht per USD currency rate Output – Predicted
DateKey Date dimension Key column
18
In order to get the whole picture of how each attribute related to predicted
value, typically we need to retrieve entirely those attributes historically in
database which will be given an idea of main pattern occurred in the big
cycle for
determining
a ceiling and
floor of data
range. Then
later on we
can spot or
narrow down
in data range
for seeking a
pattern in a
small cycle
base on a big
cycle.
10 Years Data range
1 Year Data range
19
10 Years Gold Price Dollar Per Ounce and
Baht Per USD Currency Relationship Graph
From Jan-01-2000 To Dec-31-2010
DollarPerOunce
20
CrudeOil
10 Years Crude Oil USD/Barrel and
Baht Per USD Currency Relationship Graph
From Jan-01-2000 To Dec-31-2010
21
10 Years Thai SET Index and
Baht Per USD Currency Relationship Graph
From Jan-01-2000 To Dec-31-2010
SETValue
22
10 Years Dow Jones Index and
Baht Per USD Currency Relationship Graph
From Jan-01-2000 To Dec-31-2010
DJValue
23
Decision tree can help identify which factors to be considered
and how each factor has historically been associated with
different outcomes of decision.
Concept : Decision Tree is a classification makes predictions base
on the relationships between input columns in a dataset by
creating a series of splits or nodes in the trees. The algorithm
adds a node to the model every time an input column is found to
be significantly correlated with the predictable columns. To get
the big cycle of data range, in this scenario the algorithms build
2 discretized containing in buckets as following
After process decision tree now it
help to determining which variable
most effected to value under 38.32 and above 38.32
Attribute Baht per USD
Bucket 1 < 38.32
Bucket 2 >= 38.32
24
Dependency Network
Displays the relationships between the attributes that contribute the least
and most important factors to the predictive attribute. The center node of
the chart represents
the predictable
attribute and all
nodes around
represent the input
factors attribute.
The number 1 is the
most important factor
while 4 is the least.
As the diagram,
the SET Value is the
least factor influential. Therefore, it is first disappeared by adjusting then
Crude Oil, DJ Value and Dollar Per Ounce in order. As the result, decision
tree will automatically create tree node in order by most important to least.
1
2
4
3
25
Trees Nodes
Typically, the decision trees is the classification model that contains all cases at the
root node then split itself into the most several influential cases or we call children
nodes which is Value – vEnergy and then each children node split themselves into
the second important factor then split it again until there is no more cases can be
split which is least important or we call leaf nodes as a diagram below.
According to this, the pink histogram represent
value < 38.32 in the opposite green represent
value >= 38.32 which each node split it own into
3 DollarPerOunce node along with data range
and color to indicate the meaning categories.
26
Histogram
Each node might contain only pure single factor or a multi factors in a same
node which contribute statistics ,cases supported and probability as
representing by histogram. These histogram indicate percentage of node
that effect to cases for example if we start travel from root node through
node DollarPerOunce < 543.445 with high percentage histogram represent
by green stripe along with 906 cases, probability 92.65% which imply these
node determine value of Baht/USD greater than 38.32
Even through DJValue were split into greater than 10532 and less than
10532 but both nodes are support Baht/USD > 38.32 as well. Apparently
the only different is they were grouped by two categories that either possibly
can be fall into those node.
If we consider on DJValue and Baht/USD relationship chart, that would help
you understand more clearly.
27
38.32
10532
DJValue
>= 10532
Zone
DJValue
< 10532
Zone
10 Years Dow Jones Index and
Baht Per USD Currency Relationship Graph
From Jan-01-2000 To Dec-31-2010
Dow Jones
28
After processing decision tree, nodes contain low histogram is not influent to predicted
value instead only the most pure color would be include for interpreting.
As a result gold price is the most influent for determining Baht/USD direction. If gold
price is going up, seem likely impact to Baht/USD going down in the opposite direction.
In contrast if gold price is going down then Baht/USD is going up in conversely way.
The dependency
network will help to
confirm Gold price
is most important in
tree algorithm which
can be prove by
looking at the next
level of node gold
price 543.44-862.84.
It split into 3 nodes
of Thai SET index.
Although they are all
most high histogram
but they are seem
likely meaningless.
Because the process
29
38.32
DollarPerOunce
Gold>543.44
30
repeats recursively for each child that given the whole range of SET value
which can be any zone of SET range. However under Baht/USD 38.32 with
Gold price 543.44 – 862.84, there are 3 SET nodes
supporting this scenario possibly occurred.
Apparently, the same observation is
applied for node under gold price below
543.44 which can
be explained on
figures page 27- 28.
For instance,
If gold price drop
below 543.44 with
any range of Dow
Jones are likely to
impact Baht/USD
is going up.
38.32
SETValue
Zone 1
Zone 3
Zone 1
Zone 2
Zone 3
31
Even Decision tree can classify dataset into each segmentation and can point out what is
the most important variable impact to predicted value. However the disadvantage of tree
is built with univariate at root and splits at each node, as each split is made the data is
split base on recursive from root node to leaf node where is usually very little data left to
make a decision. For instance, recall from previous figure under gold price 543.44 –
862.84 node there are 3 nodes splitting which are SET value but those nodes can not
specify exactly what data range of SET are, instead they are given all zone possibly.
Because those 3 nodes are made decision base on their parent node recursively.
Unlikely a Naïve-Bayes, each attribute made decision independence with their own base
on predicted value directly and not recursive from any others nodes. An classifier is made
at leaf nodes. For instance Are small companies with annual profits of more than $500K a
bad credit risk? Are large companies with annual profits in the negative still a good credit
risk? Naïve-Bayes does not consider combinations of attributes like decision tree. So, if
decision tree segments the data that is consider an essential part of big picture then each
segment of data represented by a leaf is described through a Naïve-Bayes.
Absolutely it depend on what is/are business problem defined, if we only looking for the
big picture of data then decision tree would be provide enough information. But if we
need to focus on ,or likely to explore the others attributes those are not depend on big
picture then we need a Naïve-Bayes for this task.
In this case, node Gold price is a big picture as when travel through entirely tree to leaf
node include each path from root. Unfortunately, at the leaf node contain little data which
might be important as well if we process with a Naïve-Bayes at the leaf.
32
1
4
3
2
Dependency Network
After executed a Naïve-Bayes, Dependency Network is given a result of order
important attribute differ from Decision Tree. Crude Oil is a second most important
attribute instead Dow Jones. That because Crude Oil is classified independency
directly into Baht/USD
as same as to others
attribute as well.
However a gold price
still be the first important
one.
Considering an attribute
profiles as each attributes
states by data range that
that represent by color
on the next page.
Baht/USD is split into two cases which are
>= 38.32 and < 38.32 and it seem a case
>= 38.32 is more reliable than case < 38.32
because there are less segmentation than
< 38.32. Therefore those input attributes
has a meaningful of relationship to Baht/USD.
33
Attribute Profiles
Figure on the left shows each
attributes corresponding to Baht-
USD. A pure color indicate the
highest probability occurred.
Such that gold price is very
confidence for determining with
blue contains value below 543.44
is 96% probability support
Baht/USD >= 38.32. In contrast
with the same attribute and data
range fall in a case < 38.32 only
0.83% probability but 50:41 port
potion with value greater than
543.44 instead.
Analyzing the result
Significantly, gold price and crude oil are likely conversely to Baht/USD in the opposite
direction. Since gold price, crude oil price are drop then make Baht/USD going up.
Unlikely Dow Jones and SET are quite not in linear data relationship (Figure page 28
and 30) so they can be either under and above 38.32 zone. For instance Dow Jones
with below and above 1053.85 is 68:32 probability fall in value >= 38.32 and can be <
38.32 as well with probability 34:66. Therefore, Dow Jones and SET value are not quite
well confidence determining Baht/USD direction in Naïve Bayes algorithm that is why
they are low important impacted in dependency network.
34
CrudeOil
38.32
In this phase, I use tools to determine the accuracy of the models that were created,
and examine the models to determine the meaning of discovered patterns and how to
apply to business. For example, a model may determine that Baht/USD is dropped if
gold price or crude oil is going up.
Obviously, a dataset in linear relationship is more meaningfulness than data in random.
Although 10 years gold
price and crude oil
historical dataset can
be the most
appropriate input
attributes to process
data mining.
Occasionally, the same
attribute might
doesn’t contain any
useful patterns with
a different data ranges.
For examples 1 year
of crude oil historical
dataset might contain
35
non linear dataset. But, SET might contains a well useful patterns instead. So it depends
on business needs what try to approach. If only focus on a main scope, then algorithms
One year Baht/USD - Crude Oil Historical
with discretized content under a large historical dataset would be the best fit for this
application. In the other hand, a small of historical dataset with numeric content might be
a best solution for application that focus on a real linear number calculation such as daily
stock forecasting. Because in a large dataset will take a lot of time consuming to produce
the result. Even with a high performance computer especially to produce Neural Network
result which might take a whole month to learn and searching just a small pattern under
a multi attribute input.
Therefore a good approach for a generic result is likely to build a several model using
different algorithms and then compare the accuracy of these models.
One year Baht/USD - SET Historical
36
The accuracy of an
algorithm depends on
the nature of the data,
data range and an
appropriate algorithm.
You may need to repeat
Classification Matrix
the data cleaning and transformation in order to derive more meaningful variables. Then
determine the big picture of dataset with created algorithms. However if the relationships
among attributes are complicated, a neural network may perform better.
Essentially it is very important to work with business analysis who have the proper domain
knowledge to validate to discoveries as a bottom line before deploying those patterns
discovered by data mining to a production used.
Similar to this experiment, a big picture pattern is found by a Decision Tree and Naïve-
Bayes algorithms with a couple input attribute as gold price and crude oil need to be
validated before we move to another step.
However, to accomplish this project I will assume those attributes are the most important
to determine Baht/USD direction as a big picture. For the next step, a Neural Network is a
next algorithm be used for learning and searching a dataset that derived from a previous
algorithms output by attempting form those found pattern in a linear relationship.
37
Recall from the beginning of this presentation, the unknown dataset pattern can be solve
by bottom-up technique. A Neural Network is a good approach for solving a complicated
data as long as the input attributes are the right one.
CONCEPT
Basically, a neural network (NN) is an algorithm based on the operation of biological, in
other words, is an emulation of human brain. It designed to think like a human brain by
learning problems and later solve the others with similar problems.
In the human brain action potentials are the electric signals that neurons use to convey
information to the brain and travel through the net using what is called the synapse. As
this signals are identical, the brain determines what type of information is being received
based on the path that signal took. The brain analyzes the patterns of signals being sent
from that information it can interpret the type of information being received.
To emulate that behavior, the artificial neural network has several components: the node
plays the role of the neuron, the weights are the links between the different nodes, so it
is what the synapse is in the biological net. The input signal is modified by the weights
and summarized to obtain the total input value for a specific node (diagram next page).
There are three layers in a NN: the input layer which holds one node for each input
variable; the bias layer, where there could be several internal layers; and the output
layer that holds the result set. An activation function is used to amplify the results of
that input and obtain the value of particular node.
38
Neuron scheme
Node scheme
A diagram illustrates a neuron scheme, received
the information from others neuron as the input via
a synapse while the connections between neuron
and others forming like a branch or a network. Once
the input is large than determined threshold then
neurons will be fired according to that corresponding
received information.
Similarly to a node scheme does, the perceptron is
In
In
In
Perceptron
taking a weighted sum of inputs and sending the output to others node member, if the
sum is greater than some adjustable threshold value. The inputs x1, x2, x3..xm and
connection weights w1,w2,w3,wm are typically real values. If the feature of some xi tends
to cause the perceptron to
fire, the weight wi will be
positive but if the feature
xi inhibits the perceptron,
the weight wi will be negative
The perceptron consists of
weights, the summation
processor and adjustable
threshold processor or bias
input. A bias input might get
more weight than others
regular input then it comes
39
affecting firing the activate function. There are several algorithms used in neural networks.
The backpropagation is the one of most popular which is used in this project.
Typically, what the backpropagation algorithm does is to propagate backwards the error
obtained in the output layer while comparing the calculated value in the nodes to the real
or desired value. This propagation is made by distributing the error and modifying the
weights or links between the previous and present nodes. Going backwards, the values
of the nodes in the bias input can be modified and so can be the weights between the
input and bias input, but not the values of the nodes in the regular input as they are the
values of the variables we are using. Once the algorithm got to the input layer it goes
again forward with the new modified weights and calculates the results in the output layer
again. This process is repeated until a minimum error is reached.
GOLD
SET
w1
w2
BipolarSigmoid
Function
f Output
One node scheme
Perceptron
As explained on the right, there
are two input attributes, one bias
in the first layer pass forward its
weights to perceptron then sum
the inputs and sending to the
output layer.
The output layer is fired through
the activation function. This entire
process run 20 nodes as the first
layer to produce one output layer
And the following steps are carried
out how it’s work.
BIAS
w3
40
Learning Process
•Split data into 2 set, 85 % training set and 15% for validating.
•Randomly 20 values of each gold price and SET weights from training set.
•Generate the weights for the between the nodes.
•Compare how accuracy the outputs to the actual data (validating set).
•Calculate the learning errors.
•Adjustable the output errors for getting improvement on the results.
•Contribute a new lot of the training set and repeat the process again
until a minimum learning error outputs is reached.
Implementation
•Gold price data range : 1062 – 1413
•SET data range : 684 – 1047
•1 year data range Jan-01-2010 to Dec-31-2010
•24 Hours total learning process time.
•Query statement from SQL Server
Here is how the
learning process
work as it keep try to
recognize the pattern
against the actual
value and solving the
problem with equation.
(Internet connection
required) Or just follow
this link
http://www.youtube.com/watch?v=7ghfX6kK5bo
41
Performance
Due to the learning process quite take so long so it came up with 24 hours for
this experiment which was given total error was 33.43 and 0.14 average error.
Absolutely, it will take only a few minutes to generate the result if data range is
in a month or 10 days but the performance is going down as a result.
One year Baht/USD – Gold Price Result One year Baht/USD – SET Result
This validation given Baht/USD predicted as 33.01 which
is 0.16 error when compare to actual gold price as 33.17
42
Even in 2009, gold price 1091.50 and 681.91 SET were not include in data
range for learning but NN still recognize the similarly pattern occurred in 2010
and try to generated the similarly output.
The occurred pattern is not only rely only on gold price but SET will help NN
to classify this pattern as well for instance in 2009 and 2010 were given the
same gold price as 1091.40 but different SET value as 686.41 and 784.38.
So Baht/USD result will be vary depend on SET input too.
VS
Predicted ResultActual
This learning error historical demonstrate
as much as it getting closer to zero, as
much as NN given an accuracy result. As the NN algorithm goes back and forth to get
the correct weights that will allow it to predict
the output variable, so the weights vary in value
from the initial randomly generated until the
final ones that comply with the error 33.43 total,
each pair of predicted and actual value 0.14
average error different, 0.0002 min and 0.58
max have been found in the learning historical.
43
Implement Neural Network learning video (Internet connection Required)
Or follow this link http://www.youtube.com/watch?v=VRiMbG6XIpk
44
Summary
To answering as financial department inquiring for predicting Thai Baht against
USD currency exchange rate, A Neural Network is a bottom line of this
experiment that derived the classified input attribute from Decision Trees and
Naïve Bayes through the process to analyze using SQL Database and SSAS
to reach the goal of Baht/USD prediction movement in a numeric data, also
covering data pattern recognition with a several algorithms i.e.. classification,
segmentation, approximation, and back propagation approached.
References
3.Neural Network on C# By Andrew Krillov
4.Delivering Business Intelligence By Brian Larson
5.Neural Network, from Wikipedia
6.Back Propagation, from Wikipedia
7.Decision Tree, from Wikipedia
8.Naïve Bayes, from Wikipedia

Weitere ähnliche Inhalte

Was ist angesagt?

Graduation Thesis Sample
Graduation Thesis SampleGraduation Thesis Sample
Graduation Thesis SampleGraduate Thesis
 
Data science and data analytics major similarities and distinctions (1)
Data science and data analytics  major similarities and distinctions (1)Data science and data analytics  major similarities and distinctions (1)
Data science and data analytics major similarities and distinctions (1)Robert Smith
 
Predictive Analytics: Business Perspective & Use Cases
Predictive Analytics: Business Perspective & Use CasesPredictive Analytics: Business Perspective & Use Cases
Predictive Analytics: Business Perspective & Use CasesCagri Sarigoz
 
Idiro Analytics - Analytics & Big Data
Idiro Analytics - Analytics & Big DataIdiro Analytics - Analytics & Big Data
Idiro Analytics - Analytics & Big DataIdiro Analytics
 
Barga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learningBarga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learningmaldonadojorge
 
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014Daniel Westzaan
 
The Impact of Data Science on Finance
The Impact of Data Science on FinanceThe Impact of Data Science on Finance
The Impact of Data Science on FinanceRoger Fried
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Edureka!
 

Was ist angesagt? (20)

Unit2
Unit2Unit2
Unit2
 
Graduation Thesis Sample
Graduation Thesis SampleGraduation Thesis Sample
Graduation Thesis Sample
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data science and data analytics major similarities and distinctions (1)
Data science and data analytics  major similarities and distinctions (1)Data science and data analytics  major similarities and distinctions (1)
Data science and data analytics major similarities and distinctions (1)
 
Capstone Project.pptx
Capstone Project.pptxCapstone Project.pptx
Capstone Project.pptx
 
Predictive Analytics: Business Perspective & Use Cases
Predictive Analytics: Business Perspective & Use CasesPredictive Analytics: Business Perspective & Use Cases
Predictive Analytics: Business Perspective & Use Cases
 
Data analytics
Data analyticsData analytics
Data analytics
 
Idiro Analytics - Analytics & Big Data
Idiro Analytics - Analytics & Big DataIdiro Analytics - Analytics & Big Data
Idiro Analytics - Analytics & Big Data
 
Reports vs analysis
Reports vs analysisReports vs analysis
Reports vs analysis
 
Barga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learningBarga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learning
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Unit 2
Unit 2Unit 2
Unit 2
 
Predictive Model
Predictive ModelPredictive Model
Predictive Model
 
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
 
The Impact of Data Science on Finance
The Impact of Data Science on FinanceThe Impact of Data Science on Finance
The Impact of Data Science on Finance
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
predictive analytics
predictive analyticspredictive analytics
predictive analytics
 
What is business analytics
What is business analyticsWhat is business analytics
What is business analytics
 
Predictive analytics
Predictive analytics Predictive analytics
Predictive analytics
 

Ähnlich wie Data Mining

Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docx
Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docxRunning head CS688 – Data Analytics with R1CS688 – Data Analyt.docx
Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docxtodd271
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016Dave Stokes
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project reportsonalighai
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really DoingDave Stokes
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learningSandeep Garg
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
 
Overview of business intelligence
Overview of business intelligenceOverview of business intelligence
Overview of business intelligenceAhsan Kabir
 
Introduction to Dimesional Modelling
Introduction to Dimesional ModellingIntroduction to Dimesional Modelling
Introduction to Dimesional ModellingAshish Chandwani
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsJen Stirrup
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Data science in demand planning - when the machine is not enough
Data science in demand planning - when the machine is not enoughData science in demand planning - when the machine is not enough
Data science in demand planning - when the machine is not enoughTristan Wiggill
 
Everyday Data Science
Everyday Data ScienceEveryday Data Science
Everyday Data SciencePaul Laughlin
 
Chapter 12 - Analyzing data quantitatively.pdf
Chapter 12 - Analyzing data quantitatively.pdfChapter 12 - Analyzing data quantitatively.pdf
Chapter 12 - Analyzing data quantitatively.pdfssuser864684
 
Moh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxMoh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxAbdullahEmam4
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And IntegrityGerrit Klaschke, CSM
 

Ähnlich wie Data Mining (20)

Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docx
Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docxRunning head CS688 – Data Analytics with R1CS688 – Data Analyt.docx
Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docx
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project report
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really Doing
 
ETL QA
ETL QAETL QA
ETL QA
 
IT Ready - DW: 1st Day
IT Ready - DW: 1st Day IT Ready - DW: 1st Day
IT Ready - DW: 1st Day
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
Overview of business intelligence
Overview of business intelligenceOverview of business intelligence
Overview of business intelligence
 
Introduction to Dimesional Modelling
Introduction to Dimesional ModellingIntroduction to Dimesional Modelling
Introduction to Dimesional Modelling
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and Statistics
 
Analytics
AnalyticsAnalytics
Analytics
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Data science in demand planning - when the machine is not enough
Data science in demand planning - when the machine is not enoughData science in demand planning - when the machine is not enough
Data science in demand planning - when the machine is not enough
 
Everyday Data Science
Everyday Data ScienceEveryday Data Science
Everyday Data Science
 
Chapter 12 - Analyzing data quantitatively.pdf
Chapter 12 - Analyzing data quantitatively.pdfChapter 12 - Analyzing data quantitatively.pdf
Chapter 12 - Analyzing data quantitatively.pdf
 
Moh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxMoh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptx
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
 

Kürzlich hochgeladen

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Kürzlich hochgeladen (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Data Mining

  • 1. Name: KRIENGSAK CHANINCHOMPOONUT Date: December 10th , 2010 1
  • 2. As a result of the increased use of various technologies in virtually all areas of data mining research, obviously the good decision making is as important as the key of successfully for the organization strategic. Data mining gives you access to the information that you need to make intelligent decisions about difficult business problems which somehow be able to identify rules and patterns in data, so that you can determine why things happen and predict what will happen in the future. The Top-Bottom technique can be use when data form as functions which can be calculate by equation. However in the real world scenario, dealing with the complex data which is not always given the accurate outcome because many cases can not be solved with mathematical equation formula which attempt to map the unknown factors into the algorithms. Therefore, another solution come up with Bottom-Top technique that tend to cross validate with the solutions from both ways which are Top-Bottom and Bottom-Top 2
  • 3. Top-Down technique Bottom-Up technique As a result, the next number of this dataset are likely to be 0, 4, 7 and so on as we are able to map the known factors into equation. Unlikely the dataset at the bottom as it need to be learn the unknown factors from the bottom to top. Because it could not be found in any linear proportion data that can be solve with equation. Instead, it rather spread out over the graph with unknown direction. If we still using the equation to solve this dataset, we hardly or never detect any pattern or relationship at all. So that’s why the bottom-up is become in efficiency way, by try to learn a data and recognize them once the similar pattern appear again in the dataset. 3
  • 4. To answer the various types of businesses questions, data mining will help you finding patterns and relations in data that is not apparent with human eyes by analysis those dataset using mathematical algorithms such as decision trees, segmentation, clustering, association and time series etc. through Microsoft SQL Server technologies and confirm those found discovery pattern for doing predictions base on the patterns in historic . Such that the valuable information found can be used for the various application such as financial applications, marketing & sale forecast, CRM, ERP etc. The most topic as discuss in this project will be using the database as the foundation to provide the appropriate model , algorithms base on pattern recognition or detection that found in the historical data. 4
  • 5. To achieve the project, the following tools below are developing tools with including within this project Application Microsoft SQL Database Server (MSSQL) Microsoft SQL Server Analysis Services (SSAS) Microsoft SQL Integration Services Connections (SSIS) Microsoft Visual Studio C# Microsoft Decision Tree Algorithm Microsoft Naïve-Bayes Algorithm Neural Network Algorithm Hardware Server running the SQL Database engine and Analysis Services PC for daily gathering data source and supply to MSSQL Server running the SSIS for daily updating the SSAS server PC for C# coding, database, SSAS and data mining design 5
  • 6. There are 5 phases to implement for this project Phase I : Identify the business problems Phase II : Data source collection Phase III : Database transformation Phase IV : Data mining model building Phase V : Model Assessment 6
  • 7. Data source Data miningSSAS Database Server MSSQL Database Server Neural Network • Data Converting • SSIS Convert and Supplying data to MSSQL Produce data mining Query data from database NNproducedatamining 7
  • 8. To identify the business need, the experiment to demonstrate for this project involve to the financial application which inquire the questions as following To help the financial department mange a currency swap. What are/is the most factors effected to the US Dollar and Thai Baht currency exchange rate? And what is the next day currency exchange rate likely to be? Let determine the definition of each inquired to identify for the whole this presentation as following Fundamental : As is for the financial department inquiring. 8
  • 9. To get the answering regarding to the first phase questions, the appropriate data need to be collected on this process which might get the ideas from the persons whom have the particularly those experiences background which help to narrow down the huge data raw into the meaning full data instead gathering all those meaningless data. However, the data mining techniques tend to require more historical data than the standard models and in the case of neural networks, can be difficult to interpret. 9
  • 10. Contents Data Source Economic statistical indicators • Bank of Thailand Daily Thai stock index • The Stock Exchange of Thailand Daily Thai bank interest rate • Bank of Thailand Daily exchanges rates • Bank of Thailand Daily gold trading price • Bloomberg • Thai Gold Trader Daily crude oil prices • Bloomberg Daily world stock index • Bloomberg 10
  • 11. Database Tables Once we got all expected data source, the data transformation is begin. I wrote the scripts using C# grabbing all those data from the raw source and then feeding into the MSSQL database server which will be auto daily updating. 32 Tables The only selected appropriated tables will be include in this project. Create views table as usdVSVariables responding to selected appropriated Fundamental Database 11
  • 12. 12 SELECT DISTINCT TOP (100) PERCENT dbo.ExchangeRates.DateKey, dbo.GoldMarket.DollarPerOunce, dbo.Energy.Value AS CrudeOil, dbo.ExchangeRates.BuyingSightBill, StockValue.SETValue, StockValue.DJValue, InterestMRR.MRR, DepositRate.OneYearMax FROM dbo.ExchangeRates INNER JOIN dbo.Energy ON dbo.ExchangeRates.DateKey = dbo.Energy.DateKey INNER JOIN dbo.GoldMarket ON dbo.Energy.DateKey = dbo.GoldMarket.DateKey INNER JOIN (SELECT T.DateKey, T.Value AS SETValue, D.Value AS DJValue FROM dbo.StockMarket AS T INNER JOIN dbo.StockMarket AS D ON T.DateKey = D.DateKey WHERE (T.Symbol = 'SET') AND (D.Symbol = 'DowJones')) AS StockValue ON dbo.GoldMarket.DateKey = StockValue.DateKey INNER JOIN (SELECT DateKey, BankName, MRR FROM dbo.LoanInterestRate WHERE (BankName = 'Bangkok Bank')) AS InterestMRR ON StockValue.DateKey = InterestMRR.DateKey INNER JOIN (SELECT DateKey, BankName, OneYearMax FROM dbo.DepositInterestRate WHERE (BankName = 'Bangkok Bank')) AS DepositRate ON InterestMRR.DateKey = DepositRate.DateKey WHERE (dbo.ExchangeRates.DateKey > 19991231) AND (dbo.ExchangeRates.Currency = 'USD') AND (dbo.GoldMarket.DollarPerOunce > 0) SQL Code
  • 13. 13
  • 14. 14
  • 15. 15 SSAS Sample (Internet connection required) Or follow this link http://www.youtube.com/watch?v=xjEy-zNE9P8
  • 16. At this point, I will divide two demonstrations into two different sections which are Fundamental : Predict USD-Thai currency rate exchanges Customers : Identifying perspective customers who are a potential Let get start the Fundamental data mining implementation first. The standard approach to modeling the fundamental factors returns the currency exchange rates is to model the whole attributes associated as the input variables to predict Thai Baht per dollar as the result by analyzing the most influent effective factors. Mining Structure Data source from SSAS server Data for training and testing is 70:30 Data type as discretized Key : DateKey 16
  • 17. In order to illustrate what are/is the most important variables for the prediction of Thai Baht per dollar, I aim using hybrid algorithms approach to utilize each advantages with including a Decision tree, Naïve Bayes to classify which variables to use for input in the Neural Network algorithm. The decision tree is capable of detecting rules like “if A then B” However, dealing with continuous values is not work quite well like “if A then 2.5” but tries to split the node as “if A is > 20 then B” So, that’s why the Neural Network would take over the outcomes given as the numeric data to compare its results against the Decision Tree. Such that, my approach to forecast Thai Baht per dollar will be more accurate base on the associated variables which can be more efficiency predicted the approximately the next day as the result. Decision Tree Neural Network Input Variable 2 Variable 3 Variable 4 Variable 5 Variable 6 Variable 1 ? ? ? ? Classify Variable 2 Variable 6 17 Naïve Bayes Input ? ? ?
  • 18. All associated variables can be retrieved by survey, by using external data research, or by discuss to persons who have those experience background. The advantage of using several factors to perform the forecasting instead depend on only one factor is they can cross validate the result which provide more quality and precisely of data interpreted outcome. Variable Description Usage SETValue Thai stock index (SET) Input DJValue Dow Jones index Input CrudeOil Crude Oil dollar per barrel Input DollarPerOunce Gold price dollar per ounce Input BuyingSightBill Thai Baht per USD currency rate Output – Predicted DateKey Date dimension Key column 18
  • 19. In order to get the whole picture of how each attribute related to predicted value, typically we need to retrieve entirely those attributes historically in database which will be given an idea of main pattern occurred in the big cycle for determining a ceiling and floor of data range. Then later on we can spot or narrow down in data range for seeking a pattern in a small cycle base on a big cycle. 10 Years Data range 1 Year Data range 19
  • 20. 10 Years Gold Price Dollar Per Ounce and Baht Per USD Currency Relationship Graph From Jan-01-2000 To Dec-31-2010 DollarPerOunce 20
  • 21. CrudeOil 10 Years Crude Oil USD/Barrel and Baht Per USD Currency Relationship Graph From Jan-01-2000 To Dec-31-2010 21
  • 22. 10 Years Thai SET Index and Baht Per USD Currency Relationship Graph From Jan-01-2000 To Dec-31-2010 SETValue 22
  • 23. 10 Years Dow Jones Index and Baht Per USD Currency Relationship Graph From Jan-01-2000 To Dec-31-2010 DJValue 23
  • 24. Decision tree can help identify which factors to be considered and how each factor has historically been associated with different outcomes of decision. Concept : Decision Tree is a classification makes predictions base on the relationships between input columns in a dataset by creating a series of splits or nodes in the trees. The algorithm adds a node to the model every time an input column is found to be significantly correlated with the predictable columns. To get the big cycle of data range, in this scenario the algorithms build 2 discretized containing in buckets as following After process decision tree now it help to determining which variable most effected to value under 38.32 and above 38.32 Attribute Baht per USD Bucket 1 < 38.32 Bucket 2 >= 38.32 24
  • 25. Dependency Network Displays the relationships between the attributes that contribute the least and most important factors to the predictive attribute. The center node of the chart represents the predictable attribute and all nodes around represent the input factors attribute. The number 1 is the most important factor while 4 is the least. As the diagram, the SET Value is the least factor influential. Therefore, it is first disappeared by adjusting then Crude Oil, DJ Value and Dollar Per Ounce in order. As the result, decision tree will automatically create tree node in order by most important to least. 1 2 4 3 25
  • 26. Trees Nodes Typically, the decision trees is the classification model that contains all cases at the root node then split itself into the most several influential cases or we call children nodes which is Value – vEnergy and then each children node split themselves into the second important factor then split it again until there is no more cases can be split which is least important or we call leaf nodes as a diagram below. According to this, the pink histogram represent value < 38.32 in the opposite green represent value >= 38.32 which each node split it own into 3 DollarPerOunce node along with data range and color to indicate the meaning categories. 26
  • 27. Histogram Each node might contain only pure single factor or a multi factors in a same node which contribute statistics ,cases supported and probability as representing by histogram. These histogram indicate percentage of node that effect to cases for example if we start travel from root node through node DollarPerOunce < 543.445 with high percentage histogram represent by green stripe along with 906 cases, probability 92.65% which imply these node determine value of Baht/USD greater than 38.32 Even through DJValue were split into greater than 10532 and less than 10532 but both nodes are support Baht/USD > 38.32 as well. Apparently the only different is they were grouped by two categories that either possibly can be fall into those node. If we consider on DJValue and Baht/USD relationship chart, that would help you understand more clearly. 27
  • 28. 38.32 10532 DJValue >= 10532 Zone DJValue < 10532 Zone 10 Years Dow Jones Index and Baht Per USD Currency Relationship Graph From Jan-01-2000 To Dec-31-2010 Dow Jones 28
  • 29. After processing decision tree, nodes contain low histogram is not influent to predicted value instead only the most pure color would be include for interpreting. As a result gold price is the most influent for determining Baht/USD direction. If gold price is going up, seem likely impact to Baht/USD going down in the opposite direction. In contrast if gold price is going down then Baht/USD is going up in conversely way. The dependency network will help to confirm Gold price is most important in tree algorithm which can be prove by looking at the next level of node gold price 543.44-862.84. It split into 3 nodes of Thai SET index. Although they are all most high histogram but they are seem likely meaningless. Because the process 29 38.32 DollarPerOunce Gold>543.44
  • 30. 30 repeats recursively for each child that given the whole range of SET value which can be any zone of SET range. However under Baht/USD 38.32 with Gold price 543.44 – 862.84, there are 3 SET nodes supporting this scenario possibly occurred. Apparently, the same observation is applied for node under gold price below 543.44 which can be explained on figures page 27- 28. For instance, If gold price drop below 543.44 with any range of Dow Jones are likely to impact Baht/USD is going up. 38.32 SETValue Zone 1 Zone 3 Zone 1 Zone 2 Zone 3
  • 31. 31 Even Decision tree can classify dataset into each segmentation and can point out what is the most important variable impact to predicted value. However the disadvantage of tree is built with univariate at root and splits at each node, as each split is made the data is split base on recursive from root node to leaf node where is usually very little data left to make a decision. For instance, recall from previous figure under gold price 543.44 – 862.84 node there are 3 nodes splitting which are SET value but those nodes can not specify exactly what data range of SET are, instead they are given all zone possibly. Because those 3 nodes are made decision base on their parent node recursively. Unlikely a Naïve-Bayes, each attribute made decision independence with their own base on predicted value directly and not recursive from any others nodes. An classifier is made at leaf nodes. For instance Are small companies with annual profits of more than $500K a bad credit risk? Are large companies with annual profits in the negative still a good credit risk? Naïve-Bayes does not consider combinations of attributes like decision tree. So, if decision tree segments the data that is consider an essential part of big picture then each segment of data represented by a leaf is described through a Naïve-Bayes. Absolutely it depend on what is/are business problem defined, if we only looking for the big picture of data then decision tree would be provide enough information. But if we need to focus on ,or likely to explore the others attributes those are not depend on big picture then we need a Naïve-Bayes for this task. In this case, node Gold price is a big picture as when travel through entirely tree to leaf node include each path from root. Unfortunately, at the leaf node contain little data which might be important as well if we process with a Naïve-Bayes at the leaf.
  • 32. 32 1 4 3 2 Dependency Network After executed a Naïve-Bayes, Dependency Network is given a result of order important attribute differ from Decision Tree. Crude Oil is a second most important attribute instead Dow Jones. That because Crude Oil is classified independency directly into Baht/USD as same as to others attribute as well. However a gold price still be the first important one. Considering an attribute profiles as each attributes states by data range that that represent by color on the next page. Baht/USD is split into two cases which are >= 38.32 and < 38.32 and it seem a case >= 38.32 is more reliable than case < 38.32 because there are less segmentation than < 38.32. Therefore those input attributes has a meaningful of relationship to Baht/USD.
  • 33. 33 Attribute Profiles Figure on the left shows each attributes corresponding to Baht- USD. A pure color indicate the highest probability occurred. Such that gold price is very confidence for determining with blue contains value below 543.44 is 96% probability support Baht/USD >= 38.32. In contrast with the same attribute and data range fall in a case < 38.32 only 0.83% probability but 50:41 port potion with value greater than 543.44 instead. Analyzing the result Significantly, gold price and crude oil are likely conversely to Baht/USD in the opposite direction. Since gold price, crude oil price are drop then make Baht/USD going up. Unlikely Dow Jones and SET are quite not in linear data relationship (Figure page 28 and 30) so they can be either under and above 38.32 zone. For instance Dow Jones with below and above 1053.85 is 68:32 probability fall in value >= 38.32 and can be < 38.32 as well with probability 34:66. Therefore, Dow Jones and SET value are not quite well confidence determining Baht/USD direction in Naïve Bayes algorithm that is why they are low important impacted in dependency network.
  • 34. 34 CrudeOil 38.32 In this phase, I use tools to determine the accuracy of the models that were created, and examine the models to determine the meaning of discovered patterns and how to apply to business. For example, a model may determine that Baht/USD is dropped if gold price or crude oil is going up. Obviously, a dataset in linear relationship is more meaningfulness than data in random. Although 10 years gold price and crude oil historical dataset can be the most appropriate input attributes to process data mining. Occasionally, the same attribute might doesn’t contain any useful patterns with a different data ranges. For examples 1 year of crude oil historical dataset might contain
  • 35. 35 non linear dataset. But, SET might contains a well useful patterns instead. So it depends on business needs what try to approach. If only focus on a main scope, then algorithms One year Baht/USD - Crude Oil Historical with discretized content under a large historical dataset would be the best fit for this application. In the other hand, a small of historical dataset with numeric content might be a best solution for application that focus on a real linear number calculation such as daily stock forecasting. Because in a large dataset will take a lot of time consuming to produce the result. Even with a high performance computer especially to produce Neural Network result which might take a whole month to learn and searching just a small pattern under a multi attribute input. Therefore a good approach for a generic result is likely to build a several model using different algorithms and then compare the accuracy of these models. One year Baht/USD - SET Historical
  • 36. 36 The accuracy of an algorithm depends on the nature of the data, data range and an appropriate algorithm. You may need to repeat Classification Matrix the data cleaning and transformation in order to derive more meaningful variables. Then determine the big picture of dataset with created algorithms. However if the relationships among attributes are complicated, a neural network may perform better. Essentially it is very important to work with business analysis who have the proper domain knowledge to validate to discoveries as a bottom line before deploying those patterns discovered by data mining to a production used. Similar to this experiment, a big picture pattern is found by a Decision Tree and Naïve- Bayes algorithms with a couple input attribute as gold price and crude oil need to be validated before we move to another step. However, to accomplish this project I will assume those attributes are the most important to determine Baht/USD direction as a big picture. For the next step, a Neural Network is a next algorithm be used for learning and searching a dataset that derived from a previous algorithms output by attempting form those found pattern in a linear relationship.
  • 37. 37 Recall from the beginning of this presentation, the unknown dataset pattern can be solve by bottom-up technique. A Neural Network is a good approach for solving a complicated data as long as the input attributes are the right one. CONCEPT Basically, a neural network (NN) is an algorithm based on the operation of biological, in other words, is an emulation of human brain. It designed to think like a human brain by learning problems and later solve the others with similar problems. In the human brain action potentials are the electric signals that neurons use to convey information to the brain and travel through the net using what is called the synapse. As this signals are identical, the brain determines what type of information is being received based on the path that signal took. The brain analyzes the patterns of signals being sent from that information it can interpret the type of information being received. To emulate that behavior, the artificial neural network has several components: the node plays the role of the neuron, the weights are the links between the different nodes, so it is what the synapse is in the biological net. The input signal is modified by the weights and summarized to obtain the total input value for a specific node (diagram next page). There are three layers in a NN: the input layer which holds one node for each input variable; the bias layer, where there could be several internal layers; and the output layer that holds the result set. An activation function is used to amplify the results of that input and obtain the value of particular node.
  • 38. 38 Neuron scheme Node scheme A diagram illustrates a neuron scheme, received the information from others neuron as the input via a synapse while the connections between neuron and others forming like a branch or a network. Once the input is large than determined threshold then neurons will be fired according to that corresponding received information. Similarly to a node scheme does, the perceptron is In In In Perceptron taking a weighted sum of inputs and sending the output to others node member, if the sum is greater than some adjustable threshold value. The inputs x1, x2, x3..xm and connection weights w1,w2,w3,wm are typically real values. If the feature of some xi tends to cause the perceptron to fire, the weight wi will be positive but if the feature xi inhibits the perceptron, the weight wi will be negative The perceptron consists of weights, the summation processor and adjustable threshold processor or bias input. A bias input might get more weight than others regular input then it comes
  • 39. 39 affecting firing the activate function. There are several algorithms used in neural networks. The backpropagation is the one of most popular which is used in this project. Typically, what the backpropagation algorithm does is to propagate backwards the error obtained in the output layer while comparing the calculated value in the nodes to the real or desired value. This propagation is made by distributing the error and modifying the weights or links between the previous and present nodes. Going backwards, the values of the nodes in the bias input can be modified and so can be the weights between the input and bias input, but not the values of the nodes in the regular input as they are the values of the variables we are using. Once the algorithm got to the input layer it goes again forward with the new modified weights and calculates the results in the output layer again. This process is repeated until a minimum error is reached. GOLD SET w1 w2 BipolarSigmoid Function f Output One node scheme Perceptron As explained on the right, there are two input attributes, one bias in the first layer pass forward its weights to perceptron then sum the inputs and sending to the output layer. The output layer is fired through the activation function. This entire process run 20 nodes as the first layer to produce one output layer And the following steps are carried out how it’s work. BIAS w3
  • 40. 40 Learning Process •Split data into 2 set, 85 % training set and 15% for validating. •Randomly 20 values of each gold price and SET weights from training set. •Generate the weights for the between the nodes. •Compare how accuracy the outputs to the actual data (validating set). •Calculate the learning errors. •Adjustable the output errors for getting improvement on the results. •Contribute a new lot of the training set and repeat the process again until a minimum learning error outputs is reached. Implementation •Gold price data range : 1062 – 1413 •SET data range : 684 – 1047 •1 year data range Jan-01-2010 to Dec-31-2010 •24 Hours total learning process time. •Query statement from SQL Server Here is how the learning process work as it keep try to recognize the pattern against the actual value and solving the problem with equation. (Internet connection required) Or just follow this link http://www.youtube.com/watch?v=7ghfX6kK5bo
  • 41. 41 Performance Due to the learning process quite take so long so it came up with 24 hours for this experiment which was given total error was 33.43 and 0.14 average error. Absolutely, it will take only a few minutes to generate the result if data range is in a month or 10 days but the performance is going down as a result. One year Baht/USD – Gold Price Result One year Baht/USD – SET Result This validation given Baht/USD predicted as 33.01 which is 0.16 error when compare to actual gold price as 33.17
  • 42. 42 Even in 2009, gold price 1091.50 and 681.91 SET were not include in data range for learning but NN still recognize the similarly pattern occurred in 2010 and try to generated the similarly output. The occurred pattern is not only rely only on gold price but SET will help NN to classify this pattern as well for instance in 2009 and 2010 were given the same gold price as 1091.40 but different SET value as 686.41 and 784.38. So Baht/USD result will be vary depend on SET input too. VS Predicted ResultActual This learning error historical demonstrate as much as it getting closer to zero, as much as NN given an accuracy result. As the NN algorithm goes back and forth to get the correct weights that will allow it to predict the output variable, so the weights vary in value from the initial randomly generated until the final ones that comply with the error 33.43 total, each pair of predicted and actual value 0.14 average error different, 0.0002 min and 0.58 max have been found in the learning historical.
  • 43. 43 Implement Neural Network learning video (Internet connection Required) Or follow this link http://www.youtube.com/watch?v=VRiMbG6XIpk
  • 44. 44 Summary To answering as financial department inquiring for predicting Thai Baht against USD currency exchange rate, A Neural Network is a bottom line of this experiment that derived the classified input attribute from Decision Trees and Naïve Bayes through the process to analyze using SQL Database and SSAS to reach the goal of Baht/USD prediction movement in a numeric data, also covering data pattern recognition with a several algorithms i.e.. classification, segmentation, approximation, and back propagation approached. References 3.Neural Network on C# By Andrew Krillov 4.Delivering Business Intelligence By Brian Larson 5.Neural Network, from Wikipedia 6.Back Propagation, from Wikipedia 7.Decision Tree, from Wikipedia 8.Naïve Bayes, from Wikipedia