1. Bridging the gap between data and knowledge
Bridging the gap between data and knowledge
with The Unscrambler X
Discover how data mining can benefit you.
Discover how data mining can benefit you.
Marion Cuny
CAMO Software AS
CAMO Software AS
www.camo.com
2. 2
Content
1. Improve your work time efficiency
2. Combine data from many sources for enhanced
understanding of complex systems
3. Understand the structure of your data and locate the root
cause of process/product deviations
4. Design more efficient processes and products
5. Predict quality at an early stage and classify raw
material/batch attributes
6. Conclusions
6 C l i
www.camo.com
4. 4
Organized and annotated projects
and audit trail
Project Navigator
Know the project progression by
looking at the:
looking at the:
• Project organization,
• Audit trail and
• Information and notes displayed for
Information and notes displayed for
Info and Notes Boxes each object.
www.camo.com
5. 5
Preview the results of your pretreatment
Save time in optimizing the
Save time in optimizing the
parameters of your pretreatments
before performing them.
before performing them.
www.camo.com
6. 6
Conclusion
• Organized data save you a lot of time!
What did I/my colleague do last month with this
dataset?
What was the plot that was showing the results?
• Preview of results: don’t do things that don’t give
don t don t
good results.
www.camo.com
8. 8
Import data for various sources
Unscrambler matrices
U bl ti
ASCII Text
Excel. Also possible to use copy‐paste
and drag and drop
Matlab
Spectral formats
Database (Oracle, SQL,..)
D b (O l SQL )
www.camo.com
10. 10
System Integration Partners
• Integration for online monitoring and control:
– Siemens SiPAT
– Optimal SynTQ
– Symbion
y
– ABB XPAT & FTSW integration
– GE Fanuc
GE Fanuc
www.camo.com
13. 13
Combine them in the analysis
• X and Y matrices can be in separated datasets
p
• Aggregate matrices
www.camo.com
14. 14
Conclusion
• See relationships and create models between
any kind of data:
y
– Different type
– Different stages of the p
g process
and get a clear understanding of what is going on.
www.camo.com
16. Fundamentals of Multivariate Statistical Process Control
• Th Ellipse i k
The Elli is known
as Hotellings T2
Ellipse and represents
a 95% confidence
region.
• There are regions
in the multivariate
Variable 2
control chart that
are forbidden in
the i i t
th univariate
charts.
• There are also
regions in the
univariate sense
that are out of
Variable
V i bl 1 control in a
multivariate sense
www.camo.com
17. 17
Design Space: As defined by ICH Q8
The multidimensional combination and interaction of input
p
variables and process parameters that have been demonstrated to
provide assurance of quality
Design Space
Desired State
Undesired State
www.camo.com
18. 18
NIR Spectroscopy for monitoring the
granulation process
• Acquire NIR spectra during the process
• Goal: Understand batch behavior, and follow process
trajectories with PCA
High Shear Granulator (Glatt
g S ea a ua o ( a
TMG) with diffuse reflectance
probe and NIR spectrometer
collecting spectra at 2 second
collecting spectra at 2 second
interval
www.camo.com
19. 19
High Shear Wet Granulation
• Granulation process is important to:
• increase particle size
• enhance compressibility
• improve hydrophilicity
• improve product h
i d t homogeneity
it
• The process has three stages:
• Dry mix phase - lactose & starch ( minutes)
(2 )
• Liquid addition phase – PVP and water (1-2 minutes)
• Granulation (3-5 minutes)
www.camo.com
20. 20
Granulation batches studied
• Diffuse reflection NIR spectra collected at 2-3 second
intervals for 15 batches, giving 130-180 spectra per batch
• Each spectrum 1100-2200 nm (1101 variables)
• First three batches run at target conditions
– Some process changes in terms of addition rates,
impeller speeds, granulation time in other batches
• PCA model to find patterns and groupings, and model the
granulation process
www.camo.com
21. 21
First derivative NIR spectra of HSG process
Color coded to highlight the stages of the process:
Mixing of lactose & starch
Liquid Addition – water & PVP
Granulation
OH peaks increase on addition
Change in CH bands due to binders
www.camo.com
22. 22
PCA analysis: line plot of PC score 1
Batches 4 & 5 differ: no PVP was added during the liquid
addition phase
dditi h
Batch 6: target conditions with longer granulation time
www.camo.com
23. 23
PCA score plots of 3 batches run under
target conditions
Granulation – end point
Dry mixing phase
Liquid addition phase
www.camo.com
25. 25
Conclusion
• The structure of a data set is revealed by PCA.
• Note: sometime you need pre-treatment to reveal
pre treatment
the structure accurately.
www.camo.com
27. 27
Principle of DoE
• Perform the least number of experiments to
cover the design space in an efficient way.
X2 X2
max max
min min
min max min max
X1 X1
www.camo.com
28. 28
Why do we use DoE compared to the
“scientific approach”?
scientific approach ?
• One variable at a time approach:
pp
In order to establish a relationship between cause and effect,
each cause must be investigated separately, all other
conditions being fixed.
• The limit of the one variable at a time approach:
X2 X2 Actual optimum
X1 X1
www.camo.com
29. 29
The logical approach
Set the goal of the experimentation (model type)
Select the variables to include in the design
Select the response variables
Select the appropriate design
X Y
Ex: Maximize the Ex: Cooking time, Ex: Stability BBD, Ex: CCD
quality of our cookies: temperature, chocolate preference, cost
Quadratic model content
www.camo.com
31. 31
Define variables tab
All the variables are defined in the same table.
Easy definition thanks to the tick box menu and radio buttons.
Easy definition thanks to the tick box menu and radio buttons
www.camo.com
32. 32
Choose the design tab
Auto‐selection of the best suiting design
Designs stated as actions
Information on the selected design
www.camo.com
33. 33
Design details
Select the resolution of the design depending on your goal and the number of
experiment to run.
www.camo.com
36. 36
Summary
The calculation of the power for the two
response variables shows that to detect a
difference of 0.6 for the preference this
design is not appropriate as the power is
d h
below 0.8.
We can look for the LSD that can be found.
W l k f th LSD th t b f d
www.camo.com
47. 47
Visualizing groups
• PCA score plot
• Clustering
Make a model to predict the group:
Make a model to predict the group
SIMCA, PLSDA, SVM and LDA
www.camo.com
48. 48
SIMCA Classification
• Soft Independent modeling of Class Analogies:
p g g
– Make a PCA model for each class;
– Project new samples onto the model.
j p
Maximum
Center
Center distance to the
distance to the
of model (Si)
PC2 model
Samples from Maximum
g p
group A PC1 group A
g p leverage for the
leverage for the
Samples from model (Hi)
group B
PC1 group B
PC1
Samples from
group C PC1 group C
www.camo.com
49. 49
SIMCA Classification
• Soft Independent modeling of Class Analogies:
p g g
– Make a PCA model for each class;
– Project new samples onto the model.
j p
PC2
Samples from
group A
group A PC1 group A
PC1 group A
Samples from
group B
PC1 group B
PC1
Samples from
group C PC1 group C
www.camo.com
50. 50
Example dataset
NIR data of:
• 83 samples: 67 calibration and 16 test
• 2600 variables
• 5 groups but only 4 for creating the models
www.camo.com
52. 52
Classification
• PCA model on independent classes
www.camo.com
53. 53
Classification of the new samples
All the foreign samples are
All th f i l
rejected by all models.
MCC samples not
recognized by its model.
recognized by its model
www.camo.com
54. 54
The MCC sample is detected as outlier as its
leverage is too important
www.camo.com
55. 55
PLS Discriminant Analysis
• Each class is represented by a 0 / 1 variable:
– Build a regression model with those variables as
responses (
p (PLS1 for 1 or 2 classes, else PLS2);
, );
– Make predictions for new samples:
close to 1 means “member”, close to 0 “non member”.
A B C
Samples from 1 0 0 Predicted Predicted Predicted
group A 1 0 0
1 1 1
Samples from 0 1 0
group B 0 1 0
0 1 0
0 0 0
Samples from 0 0 1
group C 0 0 1 0 1 Measured 0 1 Measured 0 1 Measured
0 0 1 Model B
Model A Model C
Classification
www.camo.com
56. 56
Example data set
Spectra
p
Category variables:
2 values: 0 & 1
www.camo.com
62. 62
Conclusions
• MVA can be used for classification /
characterization as well as quantification
q
purposes
• Samples are in a group or not or getting a
specific predicted value and you get diagnostic
tools to understand the results
• Diagnostics made at an early stage enable you
to correct for deviation and decrease the cost of
waste/reproduce.
www.camo.com
64. 64
Objectives and Tools
Objective
j The Unscrambler X
• Process Understanding • Design of Experiments (DoE)
• Identification and understanding of • Statistical Hypothesis Tests
raw materials • Exploratory Data Analysis
p y y
• Product and Process Development • Regression modelling
• Root Cause Analysis • Classification
• Prediction of Quality • Prediction
Define Design Analyze Implement Improve
www.camo.com
65. 65
General Conclusions
• Multivariate analysis:
– gives y a g
g you global
picture.
– is an understanding
tool.
– is an improving tool.
www.camo.com
66. 66
Benefits
• Multivariate analysis in The Unscrambler X benefits:
– Team work (project architecture, notes, info)
(p j , , )
– Reporting work (informative plots, report generator)
www.camo.com