Data Mining

Name: KRIENGSAK CHANINCHOMPOONUT
Date: December 10th
, 2010
1

As a result of the increased use of various technologies in virtually all
areas of data mining research, obviously the good decision making is as
important as the key of successfully for the organization strategic.
Data mining gives you access to the information that you need to make
intelligent decisions about difficult business problems which somehow
be able to identify rules and patterns in data, so that you can
determine why things happen and predict what will happen in the
future. The Top-Bottom technique can be use when data form as
functions which can be calculate by equation. However in the real
world scenario, dealing with the complex data which is not always
given the accurate outcome because many cases can not be solved
with mathematical equation formula which attempt to map the
unknown factors into the algorithms. Therefore, another solution come
up with Bottom-Top technique that tend to cross validate with the
solutions from both ways which are Top-Bottom and Bottom-Top
2

Top-Down technique
Bottom-Up technique
As a result, the next number of this
dataset are likely to be 0, 4, 7 and
so on as we are able to map the
known factors into equation.
Unlikely the dataset at the bottom
as it need to be learn the unknown
factors from the bottom to top.
Because it could not be found in
any linear proportion data that can
be solve with equation. Instead, it
rather spread out over the graph
with unknown direction. If we still
using the equation to solve this
dataset, we hardly or never detect
any pattern or relationship at all.
So that’s why the bottom-up is
become in efficiency way, by try to
learn a data and recognize them
once the similar pattern appear
again in the dataset.
3

To answer the various types of businesses questions, data mining will
help you finding patterns and relations in data that is not apparent with
human eyes by analysis those dataset using mathematical algorithms
such as decision trees, segmentation, clustering, association and time
series etc. through Microsoft SQL Server technologies and confirm those
found discovery pattern for doing predictions base on the patterns in
historic . Such that the valuable information found can be used for the
various application such as financial applications, marketing & sale
forecast, CRM, ERP etc.
The most topic as discuss in this project will be using the database as
the foundation to provide the appropriate model , algorithms base on
pattern recognition or detection that found in the historical data.
4

To achieve the project, the following tools below are developing
tools with including within this project
Application
Microsoft SQL Database Server (MSSQL)
Microsoft SQL Server Analysis Services (SSAS)
Microsoft SQL Integration Services Connections (SSIS)
Microsoft Visual Studio C#
Microsoft Decision Tree Algorithm
Microsoft Naïve-Bayes Algorithm
Neural Network Algorithm
Hardware
Server running the SQL Database engine and Analysis Services
PC for daily gathering data source and supply to MSSQL
Server running the SSIS for daily updating the SSAS server
PC for C# coding, database, SSAS and data mining design
5

There are 5 phases to implement for this project
Phase I : Identify the business problems
Phase II : Data source collection
Phase III : Database transformation
Phase IV : Data mining model building
Phase V : Model Assessment
6

Data source
Data miningSSAS Database Server
MSSQL Database Server Neural Network
• Data Converting
• SSIS
Convert and Supplying data to MSSQL
Produce data mining
Query data from database
NNproducedatamining
7

To identify the business need, the experiment to demonstrate for
this project involve to the financial application which inquire the
questions as following
To help the financial department mange a currency swap. What
are/is the most factors effected to the US Dollar and Thai Baht
currency exchange rate?
And what is the next day currency exchange rate likely to be?
Let determine the definition of each inquired to identify for the
whole this presentation as following
Fundamental : As is for the financial department inquiring.
8

To get the answering regarding to the first phase questions, the
appropriate data need to be collected on this process which might
get the ideas from the persons whom have the particularly those
experiences background which help to narrow down the huge data
raw into the meaning full data instead gathering all those
meaningless data.
However, the data mining techniques tend to require more
historical data than the standard models and in the case of neural
networks, can be difficult to interpret.
9

Contents Data Source
Economic statistical indicators • Bank of Thailand
Daily Thai stock index • The Stock Exchange of Thailand
Daily Thai bank interest rate • Bank of Thailand
Daily exchanges rates • Bank of Thailand
Daily gold trading price • Bloomberg
• Thai Gold Trader
Daily crude oil prices • Bloomberg
Daily world stock index • Bloomberg
10

Database Tables
Once we got all expected data source, the
data transformation is begin. I wrote the
scripts using C# grabbing all those data from
the raw source and then feeding into the
MSSQL database server which will be auto
daily updating.
32 Tables
The only selected appropriated tables will be
include in this project.
Create views table as usdVSVariables
responding to selected appropriated
Fundamental Database
11

12
SELECT DISTINCT
TOP (100) PERCENT dbo.ExchangeRates.DateKey,
dbo.GoldMarket.DollarPerOunce, dbo.Energy.Value AS CrudeOil,
dbo.ExchangeRates.BuyingSightBill,
StockValue.SETValue, StockValue.DJValue, InterestMRR.MRR,
DepositRate.OneYearMax
FROM dbo.ExchangeRates INNER JOIN
dbo.Energy ON dbo.ExchangeRates.DateKey = dbo.Energy.DateKey INNER
JOIN
dbo.GoldMarket ON dbo.Energy.DateKey = dbo.GoldMarket.DateKey INNER
JOIN
(SELECT T.DateKey, T.Value AS SETValue, D.Value AS DJValue
FROM dbo.StockMarket AS T INNER JOIN
dbo.StockMarket AS D ON T.DateKey = D.DateKey
WHERE (T.Symbol = 'SET') AND (D.Symbol = 'DowJones')) AS
StockValue ON dbo.GoldMarket.DateKey = StockValue.DateKey INNER JOIN
(SELECT DateKey, BankName, MRR
FROM dbo.LoanInterestRate
WHERE (BankName = 'Bangkok Bank')) AS InterestMRR ON
StockValue.DateKey = InterestMRR.DateKey INNER JOIN
(SELECT DateKey, BankName, OneYearMax
FROM dbo.DepositInterestRate
WHERE (BankName = 'Bangkok Bank')) AS DepositRate ON
InterestMRR.DateKey = DepositRate.DateKey
WHERE (dbo.ExchangeRates.DateKey > 19991231) AND (dbo.ExchangeRates.Currency =
'USD') AND (dbo.GoldMarket.DollarPerOunce > 0)
SQL Code

15
SSAS Sample (Internet connection required) Or follow this link
http://www.youtube.com/watch?v=xjEy-zNE9P8

At this point, I will divide two demonstrations into two different
sections which are
Fundamental : Predict USD-Thai currency rate exchanges
Customers : Identifying perspective customers who are a potential
Let get start the Fundamental data mining implementation first. The
standard approach to modeling the fundamental factors returns the
currency exchange rates is to model the whole attributes associated
as the input variables to predict Thai Baht per dollar as the result by
analyzing the most influent effective factors.
Mining Structure
Data source from SSAS server
Data for training and testing is 70:30
Data type as discretized
Key : DateKey
16

In order to illustrate what are/is the most important variables for the prediction
of Thai Baht per dollar, I aim using hybrid algorithms approach to utilize each
advantages with including a Decision tree, Naïve Bayes to classify which variables
to use for input in the Neural Network algorithm. The decision tree is capable
of detecting rules like “if A then B” However, dealing with continuous values is
not work quite well like “if A then 2.5” but tries to split the node as “if A is > 20
then B” So, that’s why the Neural Network would take over the outcomes given
as the numeric data to compare its results against the Decision Tree. Such that,
my approach to forecast Thai Baht per dollar will be more accurate base on the
associated variables which can be more efficiency predicted the approximately
the next day as the result.
Decision Tree
Neural Network
Input
Variable 2
Variable 3
Variable 4
Variable 5
Variable 6
Variable 1
?
?
?
?
Classify
Variable 2
Variable 6
17
Naïve Bayes
Input
?
?
?

All associated variables can be retrieved by survey,
by using external data research, or by discuss to
persons who have those experience background.
The advantage of using several factors to perform
the forecasting instead depend on only one factor
is they can cross validate the result which provide
more quality and precisely of data interpreted
outcome.
Variable Description Usage
SETValue Thai stock index (SET) Input
DJValue Dow Jones index Input
CrudeOil Crude Oil dollar per barrel Input
DollarPerOunce Gold price dollar per ounce Input
BuyingSightBill Thai Baht per USD currency rate Output – Predicted
DateKey Date dimension Key column
18

In order to get the whole picture of how each attribute related to predicted
value, typically we need to retrieve entirely those attributes historically in
database which will be given an idea of main pattern occurred in the big
cycle for
determining
a ceiling and
floor of data
range. Then
later on we
can spot or
narrow down
in data range
for seeking a
pattern in a
small cycle
base on a big
cycle.
10 Years Data range
1 Year Data range
19

10 Years Gold Price Dollar Per Ounce and
Baht Per USD Currency Relationship Graph
From Jan-01-2000 To Dec-31-2010
DollarPerOunce
20

CrudeOil
10 Years Crude Oil USD/Barrel and
From Jan-01-2000 To Dec-31-2010
21

10 Years Thai SET Index and
From Jan-01-2000 To Dec-31-2010
SETValue
22

10 Years Dow Jones Index and
From Jan-01-2000 To Dec-31-2010
DJValue
23

Decision tree can help identify which factors to be considered
and how each factor has historically been associated with
different outcomes of decision.
Concept : Decision Tree is a classification makes predictions base
on the relationships between input columns in a dataset by
creating a series of splits or nodes in the trees. The algorithm
adds a node to the model every time an input column is found to
be significantly correlated with the predictable columns. To get
the big cycle of data range, in this scenario the algorithms build
2 discretized containing in buckets as following
After process decision tree now it
help to determining which variable
most effected to value under 38.32 and above 38.32
Attribute Baht per USD
Bucket 1 < 38.32
Bucket 2 >= 38.32
24

Dependency Network
Displays the relationships between the attributes that contribute the least
and most important factors to the predictive attribute. The center node of
the chart represents
the predictable
attribute and all
nodes around
represent the input
factors attribute.
The number 1 is the
most important factor
while 4 is the least.
As the diagram,
the SET Value is the
least factor influential. Therefore, it is first disappeared by adjusting then
Crude Oil, DJ Value and Dollar Per Ounce in order. As the result, decision
tree will automatically create tree node in order by most important to least.
1
2
4
3
25

Trees Nodes
Typically, the decision trees is the classification model that contains all cases at the
root node then split itself into the most several influential cases or we call children
nodes which is Value – vEnergy and then each children node split themselves into
the second important factor then split it again until there is no more cases can be
split which is least important or we call leaf nodes as a diagram below.
According to this, the pink histogram represent
value < 38.32 in the opposite green represent
value >= 38.32 which each node split it own into
3 DollarPerOunce node along with data range
and color to indicate the meaning categories.
26

Histogram
Each node might contain only pure single factor or a multi factors in a same
node which contribute statistics ,cases supported and probability as
representing by histogram. These histogram indicate percentage of node
that effect to cases for example if we start travel from root node through
node DollarPerOunce < 543.445 with high percentage histogram represent
by green stripe along with 906 cases, probability 92.65% which imply these
node determine value of Baht/USD greater than 38.32
Even through DJValue were split into greater than 10532 and less than
10532 but both nodes are support Baht/USD > 38.32 as well. Apparently
the only different is they were grouped by two categories that either possibly
can be fall into those node.
If we consider on DJValue and Baht/USD relationship chart, that would help
you understand more clearly.
27

38.32
10532
DJValue
>= 10532
Zone
DJValue
< 10532
Zone
10 Years Dow Jones Index and
From Jan-01-2000 To Dec-31-2010
Dow Jones
28

After processing decision tree, nodes contain low histogram is not influent to predicted
value instead only the most pure color would be include for interpreting.
As a result gold price is the most influent for determining Baht/USD direction. If gold
price is going up, seem likely impact to Baht/USD going down in the opposite direction.
In contrast if gold price is going down then Baht/USD is going up in conversely way.
The dependency
network will help to
confirm Gold price
is most important in
tree algorithm which
can be prove by
looking at the next
level of node gold
price 543.44-862.84.
It split into 3 nodes
of Thai SET index.
Although they are all
most high histogram
but they are seem
likely meaningless.
Because the process
29
38.32
DollarPerOunce
Gold>543.44

30
repeats recursively for each child that given the whole range of SET value
which can be any zone of SET range. However under Baht/USD 38.32 with
Gold price 543.44 – 862.84, there are 3 SET nodes
supporting this scenario possibly occurred.
Apparently, the same observation is
applied for node under gold price below
543.44 which can
be explained on
figures page 27- 28.
For instance,
If gold price drop
below 543.44 with
any range of Dow
Jones are likely to
impact Baht/USD
is going up.
38.32
SETValue
Zone 1
Zone 3
Zone 1
Zone 2
Zone 3

31
Even Decision tree can classify dataset into each segmentation and can point out what is
the most important variable impact to predicted value. However the disadvantage of tree
is built with univariate at root and splits at each node, as each split is made the data is
split base on recursive from root node to leaf node where is usually very little data left to
make a decision. For instance, recall from previous figure under gold price 543.44 –
862.84 node there are 3 nodes splitting which are SET value but those nodes can not
specify exactly what data range of SET are, instead they are given all zone possibly.
Because those 3 nodes are made decision base on their parent node recursively.
Unlikely a Naïve-Bayes, each attribute made decision independence with their own base
on predicted value directly and not recursive from any others nodes. An classifier is made
at leaf nodes. For instance Are small companies with annual profits of more than $500K a
bad credit risk? Are large companies with annual profits in the negative still a good credit
risk? Naïve-Bayes does not consider combinations of attributes like decision tree. So, if
decision tree segments the data that is consider an essential part of big picture then each
segment of data represented by a leaf is described through a Naïve-Bayes.
Absolutely it depend on what is/are business problem defined, if we only looking for the
big picture of data then decision tree would be provide enough information. But if we
need to focus on ,or likely to explore the others attributes those are not depend on big
picture then we need a Naïve-Bayes for this task.
In this case, node Gold price is a big picture as when travel through entirely tree to leaf
node include each path from root. Unfortunately, at the leaf node contain little data which
might be important as well if we process with a Naïve-Bayes at the leaf.

32
1
4
3
2
Dependency Network
After executed a Naïve-Bayes, Dependency Network is given a result of order
important attribute differ from Decision Tree. Crude Oil is a second most important
attribute instead Dow Jones. That because Crude Oil is classified independency
directly into Baht/USD
as same as to others
attribute as well.
However a gold price
still be the first important
one.
Considering an attribute
profiles as each attributes
states by data range that
that represent by color
on the next page.
Baht/USD is split into two cases which are
>= 38.32 and < 38.32 and it seem a case
>= 38.32 is more reliable than case < 38.32
because there are less segmentation than
< 38.32. Therefore those input attributes
has a meaningful of relationship to Baht/USD.

33
Attribute Profiles
Figure on the left shows each
attributes corresponding to Baht-
USD. A pure color indicate the
highest probability occurred.
Such that gold price is very
confidence for determining with
blue contains value below 543.44
is 96% probability support
Baht/USD >= 38.32. In contrast
with the same attribute and data
range fall in a case < 38.32 only
0.83% probability but 50:41 port
potion with value greater than
543.44 instead.
Analyzing the result
Significantly, gold price and crude oil are likely conversely to Baht/USD in the opposite
direction. Since gold price, crude oil price are drop then make Baht/USD going up.
Unlikely Dow Jones and SET are quite not in linear data relationship (Figure page 28
and 30) so they can be either under and above 38.32 zone. For instance Dow Jones
with below and above 1053.85 is 68:32 probability fall in value >= 38.32 and can be <
38.32 as well with probability 34:66. Therefore, Dow Jones and SET value are not quite
well confidence determining Baht/USD direction in Naïve Bayes algorithm that is why
they are low important impacted in dependency network.

34
CrudeOil
38.32
In this phase, I use tools to determine the accuracy of the models that were created,
and examine the models to determine the meaning of discovered patterns and how to
apply to business. For example, a model may determine that Baht/USD is dropped if
gold price or crude oil is going up.
Obviously, a dataset in linear relationship is more meaningfulness than data in random.
Although 10 years gold
price and crude oil
historical dataset can
be the most
appropriate input
attributes to process
data mining.
Occasionally, the same
attribute might
doesn’t contain any
useful patterns with
a different data ranges.
For examples 1 year
of crude oil historical
dataset might contain

35
non linear dataset. But, SET might contains a well useful patterns instead. So it depends
on business needs what try to approach. If only focus on a main scope, then algorithms
One year Baht/USD - Crude Oil Historical
with discretized content under a large historical dataset would be the best fit for this
application. In the other hand, a small of historical dataset with numeric content might be
a best solution for application that focus on a real linear number calculation such as daily
stock forecasting. Because in a large dataset will take a lot of time consuming to produce
the result. Even with a high performance computer especially to produce Neural Network
result which might take a whole month to learn and searching just a small pattern under
a multi attribute input.
Therefore a good approach for a generic result is likely to build a several model using
different algorithms and then compare the accuracy of these models.
One year Baht/USD - SET Historical

36
The accuracy of an
algorithm depends on
the nature of the data,
data range and an
appropriate algorithm.
You may need to repeat
Classification Matrix
the data cleaning and transformation in order to derive more meaningful variables. Then
determine the big picture of dataset with created algorithms. However if the relationships
among attributes are complicated, a neural network may perform better.
Essentially it is very important to work with business analysis who have the proper domain
knowledge to validate to discoveries as a bottom line before deploying those patterns
discovered by data mining to a production used.
Similar to this experiment, a big picture pattern is found by a Decision Tree and Naïve-
Bayes algorithms with a couple input attribute as gold price and crude oil need to be
validated before we move to another step.
However, to accomplish this project I will assume those attributes are the most important
to determine Baht/USD direction as a big picture. For the next step, a Neural Network is a
next algorithm be used for learning and searching a dataset that derived from a previous
algorithms output by attempting form those found pattern in a linear relationship.

37
Recall from the beginning of this presentation, the unknown dataset pattern can be solve
by bottom-up technique. A Neural Network is a good approach for solving a complicated
data as long as the input attributes are the right one.
CONCEPT
Basically, a neural network (NN) is an algorithm based on the operation of biological, in
other words, is an emulation of human brain. It designed to think like a human brain by
learning problems and later solve the others with similar problems.
In the human brain action potentials are the electric signals that neurons use to convey
information to the brain and travel through the net using what is called the synapse. As
this signals are identical, the brain determines what type of information is being received
based on the path that signal took. The brain analyzes the patterns of signals being sent
from that information it can interpret the type of information being received.
To emulate that behavior, the artificial neural network has several components: the node
plays the role of the neuron, the weights are the links between the different nodes, so it
is what the synapse is in the biological net. The input signal is modified by the weights
and summarized to obtain the total input value for a specific node (diagram next page).
There are three layers in a NN: the input layer which holds one node for each input
variable; the bias layer, where there could be several internal layers; and the output
layer that holds the result set. An activation function is used to amplify the results of
that input and obtain the value of particular node.

38
Neuron scheme
Node scheme
A diagram illustrates a neuron scheme, received
the information from others neuron as the input via
a synapse while the connections between neuron
and others forming like a branch or a network. Once
the input is large than determined threshold then
neurons will be fired according to that corresponding
received information.
Similarly to a node scheme does, the perceptron is
In
In
In
Perceptron
taking a weighted sum of inputs and sending the output to others node member, if the
sum is greater than some adjustable threshold value. The inputs x1, x2, x3..xm and
connection weights w1,w2,w3,wm are typically real values. If the feature of some xi tends
to cause the perceptron to
fire, the weight wi will be
positive but if the feature
xi inhibits the perceptron,
the weight wi will be negative
The perceptron consists of
weights, the summation
processor and adjustable
threshold processor or bias
input. A bias input might get
more weight than others
regular input then it comes

39
affecting firing the activate function. There are several algorithms used in neural networks.
The backpropagation is the one of most popular which is used in this project.
Typically, what the backpropagation algorithm does is to propagate backwards the error
obtained in the output layer while comparing the calculated value in the nodes to the real
or desired value. This propagation is made by distributing the error and modifying the
weights or links between the previous and present nodes. Going backwards, the values
of the nodes in the bias input can be modified and so can be the weights between the
input and bias input, but not the values of the nodes in the regular input as they are the
values of the variables we are using. Once the algorithm got to the input layer it goes
again forward with the new modified weights and calculates the results in the output layer
again. This process is repeated until a minimum error is reached.
GOLD
SET
w1
w2
BipolarSigmoid
Function
f Output
One node scheme
Perceptron
As explained on the right, there
are two input attributes, one bias
in the first layer pass forward its
weights to perceptron then sum
the inputs and sending to the
output layer.
The output layer is fired through
the activation function. This entire
process run 20 nodes as the first
layer to produce one output layer
And the following steps are carried
out how it’s work.
BIAS
w3

40
Learning Process
•Split data into 2 set, 85 % training set and 15% for validating.
•Randomly 20 values of each gold price and SET weights from training set.
•Generate the weights for the between the nodes.
•Compare how accuracy the outputs to the actual data (validating set).
•Calculate the learning errors.
•Adjustable the output errors for getting improvement on the results.
•Contribute a new lot of the training set and repeat the process again
until a minimum learning error outputs is reached.
Implementation
•Gold price data range : 1062 – 1413
•SET data range : 684 – 1047
•1 year data range Jan-01-2010 to Dec-31-2010
•24 Hours total learning process time.
•Query statement from SQL Server
Here is how the
learning process
work as it keep try to
recognize the pattern
against the actual
value and solving the
problem with equation.
(Internet connection
required) Or just follow
this link
http://www.youtube.com/watch?v=7ghfX6kK5bo

41
Performance
Due to the learning process quite take so long so it came up with 24 hours for
this experiment which was given total error was 33.43 and 0.14 average error.
Absolutely, it will take only a few minutes to generate the result if data range is
in a month or 10 days but the performance is going down as a result.
One year Baht/USD – Gold Price Result One year Baht/USD – SET Result
This validation given Baht/USD predicted as 33.01 which
is 0.16 error when compare to actual gold price as 33.17

42
Even in 2009, gold price 1091.50 and 681.91 SET were not include in data
range for learning but NN still recognize the similarly pattern occurred in 2010
and try to generated the similarly output.
The occurred pattern is not only rely only on gold price but SET will help NN
to classify this pattern as well for instance in 2009 and 2010 were given the
same gold price as 1091.40 but different SET value as 686.41 and 784.38.
So Baht/USD result will be vary depend on SET input too.
VS
Predicted ResultActual
This learning error historical demonstrate
as much as it getting closer to zero, as
much as NN given an accuracy result. As the NN algorithm goes back and forth to get
the correct weights that will allow it to predict
the output variable, so the weights vary in value
from the initial randomly generated until the
final ones that comply with the error 33.43 total,
each pair of predicted and actual value 0.14
average error different, 0.0002 min and 0.58
max have been found in the learning historical.

43
Implement Neural Network learning video (Internet connection Required)
Or follow this link http://www.youtube.com/watch?v=VRiMbG6XIpk

44
Summary
To answering as financial department inquiring for predicting Thai Baht against
USD currency exchange rate, A Neural Network is a bottom line of this
experiment that derived the classified input attribute from Decision Trees and
Naïve Bayes through the process to analyze using SQL Database and SSAS
to reach the goal of Baht/USD prediction movement in a numeric data, also
covering data pattern recognition with a several algorithms i.e.. classification,
segmentation, approximation, and back propagation approached.
References
3.Neural Network on C# By Andrew Krillov
4.Delivering Business Intelligence By Brian Larson
5.Neural Network, from Wikipedia
6.Back Propagation, from Wikipedia
7.Decision Tree, from Wikipedia
8.Naïve Bayes, from Wikipedia

Data Mining

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Data Mining

Ähnlich wie Data Mining (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Mining