Breast Cancer Diagnostics with Bayesian Networks

Breast Cancer Diagnostics with Bayesian Networks
Interpreting the Wisconsin Breast Cancer Database with BayesiaLab
Stefan Conrady, stefan.conrady@conradyscience.com
Dr. Lionel Jouffe, jouffe@bayesia.com
May 20, 2013

Table of Contents
Case Study & Tutorial
Introduction 4
Background 6
Wisconsin Breast Cancer Database 6
Notation 7
Model Development 8
Data Import 8
Unsupervised Learning 13
Model 1: Markov Blanket 16
Model 1: Performance 21
K-Folds Cross-Validation 23
Model 2: Augmented Markov Blanket 25
Model 2a: Performance 28
Structural Coefficient 32
Model 2b: Augmented Markov Blanket (SC=0.3) 38
Model 2b: Performance 39
Conclusion 40
Model Inference 41
Interactive Inference 42
Adaptive Questionnaire 43
Target Interpretation Tree 46
Summary 52
Appendix
Framework: The Bayesian Network Paradigm 53
Acyclic Graphs & Bayes’s Rule 53
Compact Representation of the Joint Probability Distribution 54
References 55
Contact Information 56
Bayesia USA 56
ii
www.bayesia.us | www.bayesia.sg | www.bayesia.com

Bayesia Singapore Pte. Ltd. 56
Bayesia S.A.S. 56
Copyright 56
www.bayesia.us | www.bayesia.sg | www.bayesia.com
iii

Case Study & Tutorial
Introduction
Data classification is one of the most common tasks in the field of statistical analysis and countless methods
have been developed for this purpose over time. A common approach is to develop a model based on
known historical data, i.e. where the class membership of a record is known, and to use this generalization
to predict the class membership for a new set of observations.
Applications of data classifications permeate virtually all fields of study, including social sciences, engineer-
ing, biology, etc. In the medical field, classification problems often appear in the context of disease identifi-
cation, i.e. making a diagnosis about a patient’s condition. The medical sciences have a long history of de-
veloping large body of knowledge, which links observable symptoms with known types of illnesses. It is the
physician’s task to use the available medical knowledge to make inference based on the patient’s symptoms,
i.e. to classify the medical condition in order to enable appropriate treatment.
Over the last two decades, so-called medical expert systems have emerged, which are meant to support phy-
sicians in their diagnostic work. Given the sheer amount of medical knowledge in existence today, it should
not be surprising that significant benefits are expected from such machine-based support in terms of medical
reasoning and inference.
In this context, several papers by Wolberg, Street, Heisey and Managasarian became much-cited examples.
They proposed an automated method for the classification of Fine Needle Aspirates1 through imaging proc-
essing and machine learning with the objective of achieving a greater accuracy in distinguishing between
malignant and benign cells for the diagnosis of breast cancer. At the time of their study, the practice of vis-
ual inspection of FNA yielded inconsistent diagnostic accuracy. The proposed new approach would increase
this accuracy reliably to over 95%. This research was quickly translated into clinical practice and has since
been applied with continued success.
As part of their studies in the late 1980s and 1990s, the research team generated what became known as the
Wisconsin Breast Cancer Database, which contains measurements of hundreds of FNA samples and the as-
sociated diagnoses. This database has been extensively studied, even outside the medical field. Statisticians
and computer scientists have proposed a wide range of techniques for this classification problem and have
continuously raised the benchmark for predictive performance.
Our objective with this paper is to present Bayesian networks as a highly practical framework for working
with this kind of classification problem. We intend to demonstrate how the BayesiaLab software can ex-
4 www.bayesia.us | www.bayesia.sg | www.bayesia.com
1 Fine needle aspiration (FNA) is a percutaneous (“through the skin”) procedure that uses a fine gauge needle (22 or 25
gauge) and a syringe to sample fluid from a breast cyst or remove clusters of cells from a solid mass. With FNA, the
cellular material taken from the breast is usually sent to the pathology laboratory for analysis.

tremely quickly, and relatively simply, create Bayesian network models that achieve the performance of the
best custom-developed models, while only requiring a fraction of the development time.
Furthermore, we wish to illustrate how Bayesian networks can help researchers and practitioners generate a
deeper understanding of the underlying problem domain. Beyond merely producing predictions, we can use
Bayesian networks to precisely quantify the importance of individual variables and employ BayesiaLab to
help identify the most efficient path towards a diagnosis.
BayesiaLab’s speed of model building, its excellent classification performance, plus the ease of interpretation
provide researchers with a powerful new tool. Bayesian networks and BayesiaLab have thus become a driver
in accelerating research.
www.bayesia.us | www.bayesia.sg | www.bayesia.com 5

Background
To provide context for this study, we quote Mangasarian, Street and Wolberg (1994), who conducted the
original research related breast cancer diagnosis with digital image processing and machine learning:
Most breast cancers are detected by the patient as a lump in the breast. The majority of breast
lumps are benign, so it is the physician’s responsibility to diagnose breast cancer, that is, to distin-
guish benign lumps from malignant ones. There are three available methods for diagnosing breast
cancer: mammography, FNA with visual interpretation and surgical biopsy. The reported sensitiv-
ity, i.e. ability to correctly diagnose cancer when the disease is present of mammography varies
from 68% to 79%, of FNA with visual interpretation from 65% to 98%, and of surgical biopsy
close to 100%.
Therefore mammography lacks sensitivity, FNA sensitivity varies widely, and surgical biopsy, al-
though accurate, is invasive, time consuming and costly. The goal of the diagnostic aspect of our
research is to develop a relatively objective system that diagnoses FNAs with an accuracy that ap-
proaches the best achieved visually.
Wisconsin Breast Cancer Database
This breast cancer database was created through the clinical work of Dr. William H. Wolberg at the Univer-
sity of Wisconsin Hospitals in Madison. As of 1992, Dr. Wolberg had collected 699 instances of patient
diagnoses in this database, consisting of two classes: 458 benign cases (65.5%) and 241 malignant cases
(34.5%).
The following eleven attributes2 are included in the database:
1. Sample code number
2. Clump Thickness (1 - 10)
3. Uniformity of Cell Size (1 - 10)
4. Uniformity of Cell Shape (1 - 10)
5. Marginal Adhesion (1 - 10)
6. Single Epithelial Cell Size (1 - 10)
7. Bare Nuclei (1 - 10)
8. Bland Chromatin (1 - 10)
9. Normal Nucleoli (1 - 10)
10. Mitoses (1 - 10)
11. Class (benign/malignant)
2 “Attribute” and “variable” are used interchangeably throughout the paper.

Attributes #2 through #10 were computed from digital images of fine needle aspirates (FNA) of breast
masses. These features describe the characteristics of the cell nuclei in the image. The attribute #11, Class,
was established via subsequent biopsies or via long-term monitoring of the tumor.
We will not go into detail here regarding the definition of the attributes and their measurement. Rather, we
refer the reader to papers referenced in the bibliography.
The Wisconsin Breast Cancer Database is available to any interested researcher from the UC Irvine Machine
Learning Repository.3 We use this database in its original format without any further transformation, so
our results can be directly compared to dozens of methods that have been developed since the original
study.
Notation
To clearly distinguish between natural language, software-specific functions and study-specific variable
names, the following notation is used:
• BayesiaLab-specific functions, keywords, commands, etc., are capitalized and printed in bold type. You
can look up such terms in the BayesiaLab Library (library.bayesia.com) for more details.
• The names of variables, attributes, nodes, and node states are capitalized and italicized.
3 UC Irvine Machine Learning Repository website: http://archive.ics.uci.edu/ml/

Model Development
Data Import
Our modeling process begins with importing the database,4 which is formatted as a text file with comma-
separated values. Therefore, we start with Data | Open Data Source | Text File.
The Data Import Wizard then guides us through the required steps. In the first dialogue box of the Data
Import Wizard, we click on Define Typing and specify that we wish to set aside a Test Set from the data-
base.
4 If we exclude the variable Sample code number, this database can also be used with the publicly-available evaluation
version of BayesiaLab, which is limited to a maximum of ten nodes. Deleting this variable does not affect the workflow
or the results of the analysis.

Following common practice, we will randomly select 20% of the 699 records as Test Set, and, conse-
quently, the remaining 80% will serve as our Learning Set set.5
In the next step, the Data Import Wizard will suggest the data format for each variable. Attributes 2
through 10 are identified as continuous variables and Class is read as a discrete variable. Only for the first
variable, Sample code number, we have to specify Row Identifier, so it is not mistaken for a continuous pre-
dictor variable.
In the next step, the Information Panel reports that we have a total of 16 missing values in the entire data-
set. We can also see that the column Bare Nuclei is labeled with a small question mark, indicating the pres-
ence of missing values in this particular column.
5 “Learning/Test Set” and “Learning/Test Sample” are used interchangeably in this paper.

We now need to specify the type of Missing Values Imputation. Given the small size of the dataset, and the
small number of missing values, we will choose the Structural EM method.6
A critical element of the data import process is the discretization of all continuous variables. On the next
screen we click Select All Continuous to apply the same discretization algorithm across all continuous vari-
ables. Alternatively, we could choose the type of discretization individually by variable. However, we will
not discuss this option any further in this paper.
As the objective of this exercise is classification, we choose the Decision Tree algorithm from the drop-down
menu in the Multiple Discretization panel. This discretizes each variable for a maximum information gain
with respect to the Target Class.
6 For more details on missing values imputation with Bayesian network, see Conrady and Jouffe (2012).

Bayesian networks are entirely non-parametric, probabilistic models, and for their estimation they require a
certain minimum number of observations. To help us with the selection of the number of discretization lev-
els (or Intervals), we use the heuristic of five observations per parameter and probability cell. Given that we
have a relatively small database with only 560 observations,7 three discretization intervals for each variable
appear to be an appropriate choice. If we used a higher number of Intervals, we would need more observa-
tions for a reliable estimation of the parameters.
Upon clicking Finish, we will immediately see a representation of the newly imported database in the form
of a fully unconnected Bayesian network in the Graph Panel. Each variable is now represented as a blue
node in the graph panel of BayesiaLab.
7 560 cases are in the training set (80%) and 139 are in the test set (20%).

The question mark symbol, which is associated with the Bare Nuclei node, indicates that there are missing
values for this variable. Hovering over the question mark with the mouse pointer while pressing the “i” key
will show the number of missing values.
Optionally, BayesiaLab can display an import report summarizing the obtained discretizations for all vari-
ables.

Unsupervised Learning
When exploring a new domain, we generally recommended performing Unsupervised Learning on the newly
imported database. This is also the case here, even though our principal objective is predictive modeling, for
which Supervised Learning will later be the main tool.
Learning | Unsupervised Structural Learning | EQ initiates the EQ Algorithm, which is suitable for the initial
review of the database. For larger databases with significantly more variables, the Maximum Weight Span-
ning Tree is a very fast algorithm and can be used instead.
Upon learning, the initial Bayesian network looks like this:
In its “raw” form, the crossing arcs make this network somewhat tricky to read. BayesiaLab has a number
of layout algorithms that can quickly “disentangle” such a network and produce a much more user-friendly
format.

We can select View | Automatic Layout or alternative use the shortcut “P”.
Now we can visually review the learned network structure and compare it to our own domain knowledge.
This allows for a “sanity check” of the database and the variables, and it may highlight any inconsistencies.
Beyond visually inspecting the network structure, BayesiaLab allows us to visualize the quantitative part of
this network. To do this, we first need to switch into the Validation Mode by clicking on the highlighted
button in the lower-lefthand corner of the Graph Panel, or by alternatively using the “F5” key as a shortcut.
We can now display the Pearson Correlation between the nodes that are directly linked in the graph by se-
lecting Analysis | Visual | Pearson’s Correlation from the menu.

Each arc’s thickness is now proportional to the Pearson Correlation between the connected nodes. Also, the
blue and red colors indicate positive and negative correlations respectively. Any unexpected sign of correla-
tions would thus become apparent very quickly. In our example, we only have positive correlations and thus
all arcs are blue.
Additionally, callouts indicate that further information can be displayed. We can opt to display this
numerical information via View | Display Arc Comments.

This function is also available via a button in the menu:
Model 1: Markov Blanket
Now that we have performed an initial review of the dataset with the Unsupervised Learning step, we can
return to the Modeling Mode by clicking on the corresponding button in the lower lefthand corner of the

screen or using the shortcut “F4”.8
This allows us to proceed to the modeling stage. Given our objective of predicting the state of the variable
Class, i.e. benign versus malignant, we will define Class as the Target Variable by right-clicking on the node
and selecting Set as Target Variable from the contextual menu. Alternatively, we can double-click on Class
while holding the shortcut “T” pressed. We need to specify this explicitly, so the subsequent Supervised
Learning algorithm can use Class as the dependent variable.
This setting is confirmed by the “bullseye”appearance of the new Target Node.
8 We will mostly omit further references to switching between Modeling Mode (F4) and Validation Mode (F5). The
required modes can generally be inferred from the context.

Upon this selection, all Supervised Learning algorithms become available under Learning | Supervised Learn-
ing.
In many cases, the Markov Blanket algorithm is a good starting point for a predictive model. This algorithm
is extremely fast and can even be applied to databases with thousands of variables and millions of records,
even though database size is not a concern in this particular study.
Upon learning the Markov Blanket for Class, and once again applying the Automatic Layout, the resulting
Bayesian network looks as follows:
Markov Blanket Definition
The Markov Blanket for a node A is the set of nodes composed of A’s parents, its children, and its
children’s other parents (=spouses).
The Markov Blanket of the node A contains all the variables, which,
if we know their states, will shield the node A from the rest of the
network. This means that the Markov Blanket of a node is the only
knowledge needed to predict the behavior of that node A. Learning a
Markov Blanket selects relevant predictor variables, which is particu-
larly helpful when there is a large number of variables in the database.
In fact, this can also serve as a highly-efficient variable selection
method in preparation for other types of modeling, e.g. neural net-
works.

This network suggests that Class has a direct probabilistic relationship with all variables except Marginal
Adhesion and Single Epithelial Cell Size, which are both disconnected. The lack of their connection with the
Target indicates that these nodes are independent given the nodes in the Markov Blanket.
Beyond distinguishing between predictors (connected nodes) and non-predictors (disconnected nodes), we
can further examine the relationship versus the Target Node Class by highlighting the Mutual Information
of the arcs connecting the nodes. This function is accessible within the Validation Mode via Analysis | Vis-
ual | Arcs’ Mutual Information.
Note
We can see on the graph learned earlier with the EQ algorithm that Uniformity of Cell Shape is the
node that makes these two nodes conditionally independent of Class.

We will also go ahead and immediately select View | Display Arc Comments.
The thickness of the arcs is now proportional to the Mu-
tual Information, i.e. the strength of the relationship be-
tween the nodes. Intuitively, Mutual Information measures
the information that X and Y share: it measures how much
knowing one of these variables reduces our uncertainty
about the other. For example, if X and Y are independent,
then knowing X does not provide any information about Y
and vice versa, so their Mutual Information is zero. At the other extreme, if X and Y are identical then all
information conveyed by X is shared with Y: knowing X determines the value of Y and vice versa.
Formal Definition of Mutual Information
I(X;Y ) = p(x,y)log
p(x,y)
p(x)p(y)
⎛
⎝⎜
⎞
⎠⎟
x∈X
∑
y∈Y
∑

In the top part of the comment box attached to each arc, the Mutual Information of the arc is
shown. Expressed as a percentage and highlighted in blue, we see the relative Mutual Informa-
tion in the direction of the arc (parent node ➔ child node). And, at the bottom, we have the
relative Mutual Information in the opposite direction of the arc (child node ➔ parent node).
Model 1: Performance
As we are not equipped with specific domain knowledge about the variables, we will not further interpret
these relationships but rather run an initial test regarding the Network Performance. We want to know how
well this Markov Blanket model can predict the states of the Class variable, i.e. Benign versus Malignant.
This test is available via Analysis | Network Performance | Target.
Using our previously defined Test Set for validating our model, we obtain the following, rather encouraging
results:

Of the 88 Benign cases of the test set, 3 were incorrectly identified, which corresponds to a false positive
rate of 3.41%. More importantly though, of the 51 Malignant cases, all were identified correctly (true posi-
tives) with no false negatives. The overall performance can be expressed as the Total Precision, which is
computed as total number of correct predictions (true positives + true negatives) divided by the total num-
ber of cases in the Test Set , i.e. (85 +51) ÷ 139 = 97.84%.
As the selection of the Learning Set and the Test Set during the data import process is random, BayesiaLab
may learn slightly different networks based on different Learning Sets after each data import. Hence, your
own network performance evaluation could deviate from what is shown above, unless you chose the same
Fixed Seed for the random number generator when you defined Data Typing during the data import proc-
ess.

K-Folds Cross-Validation
To mitigate the sampling artifacts that may occur in a one-off test, we can systematically learn networks on
a sequence of different subsets and then aggregate the test results. Analogous to the original papers on this
topic, we will perform K-Folds Cross Validation, which will iteratively select K different Learning Sets and
Test Sets and then, based on those, learn the networks and test their performance.
The Cross Validation can then be started via Tools | Cross Validation | Targeted Evaluation | K-Folds.
We use the same learning algorithm as before, i.e. the Markov Blanket, and we choose 10 as the number of
sub-samples to be analyzed. Of the total dataset of 699 cases, each of the ten iterations will create a Test Set
of 69 randomly drawn samples, and use the remaining 630 as the Learning Set. This means that BayesiaLab
learns one network per Learning Set and then tests the performance on the respective Test Set.

The summary, including the synthesized results, is shown below.
These results confirm the good performance of this model. The Total Precision is 97%, with a false negative
rate of 2%. This means 2% of the cases were predicted as Benign, while the were actually Malignant.

Clicking Comprehensive Report produces a summary, which can also be saved in HTML format. This is
convenient for subsequent editing, as the generated HTML file can be opened and edited as a spreadsheet.
Value Benign Malignant
Gini Index 33.95% 64.59%
Relative Gini Index 98.50% 98.55%
Mean Lift 1.42 2.04
Relative Lift Index 99.74% 99%
Value
Benign
(458)
Malignant
(241)
Benign (446) 441 5
Malignant (253) 17 236
Value
Benign
(458)
Malignant
(241)
Benign (446) 98.88% 1.12%
Malignant (253) 6.72% 93.28%
Value
Benign
(458)
Malignant
(241)
Benign (446) 96.29% 2.07%
Malignant (253) 3.71% 97.93%
R: 0.93817485358
R2: 0.88017205588
Occurrences
Reliability
Precision
Sampling Method: K-Folds
Learning Algorithm: Markov Blanket
Target: Class
Relative Gini Index Mean: 98.53%
Relative Lift Index Mean: 99.37%
Total Precision: 96.85%
As our Markov Blanket modeling is already performing at a level comparable to the models that have been
published in the literature, we might be tempted to conclude our analysis at this point. However, we will
attempt to see whether further performance improvements are possible.
Model 2: Augmented Markov Blanket
BayesiaLab offers an extension to the Markov Blanket algorithm, namely the Augmented Markov Blanket,
which performs an Unsupervised Learning Algorithm on the nodes in the Markov Blanket. This allows
identifying influence paths between the predictor variables and can potentially help improve the prediction
performance.

This algorithm can be started via Learning | Supervised Learning | Augmented Markov Blanket.
As expected, the resulting network is somewhat more complex than the standard Markov Blanket.
If we save the original Markov Blanket and the new Augmented Markov Blanket under different file names,
we can use Tools | Compare | Structure to highlight the differences between both. Given that the addition of
three arcs is immediately visible, this function may appear as overkill for our particular example. However,

in more complex situation, Structure Comparison can be rather helpful, and so we will spell out the details.
We choose the original network and the newly learned network as the Reference Network and the Com-
parison Network respectively.
Upon selection, a table provides a list of common arcs and those arcs that have been added in the Compari-
son Network, which was learned with the Augmented Markov Blanket algorithm:

Clicking Charts provides a visual representation of these differences. The additional arcs, compared to the
original Markov Blanket network, are now highlighted in blue. Conversely, had any arcs been deleted, those
would be shown in red.
Model 2a: Performance
We now proceed to performance evaluation with this new Augmented Markov Blanket network, analogous
to the Markov Blanket model: Analysis | Network Performance | Target
Given that we had originally split the dataset into a Learning Set and a Test Set, the Network Performance
evaluation is once again carried out separately on both subsets.

Interestingly, the performance on the Test Set is better than on the Learning Set. This indicates that overfit-
ting is not a problem here.

A summary for either subset can be saved by clicking Comprehensive Report. The out-of-sample Test Set
report is generally the more important one. It is shown below.
Gini Index 36.52% 63.01%
Mean Lift 1.45 1.99
Relative Lift Index 99.92% 99.79%
Value
Benign
(88)
Malignant
(51)
Benign (86) 86 0
Malignant (53) 2 51
Value
Benign
(88)
Malignant
(51)
Benign (86) 100% 0%
Malignant (53) 3.77% 96.23%
Value
Benign
(88)
Malignant
(51)
Benign (86) 97.73% 0%
Malignant (53) 2.27% 100%
Occurrences
Reliability
Precision
Target: Class
R: 0.97499525394
R2: 0.95061574521
As with the earlier model, we repeat K-Folds Cross Validation for the Augmented Markov Blanket. The
results are shown below, first as a screenshot and then as a spreadsheet generated via Comprehensive Re-
port.

Gini Index 33.95% 64.58%
Mean Lift 1.42 2.04
Value
Benign
(458)
Malignant
(241)
Benign (448) 442 6
Value
Benign
(458)
Malignant
(241)
Benign (448) 98.66% 1.34%
Malignant (251) 6.37% 93.63%
Value
Benign
(458)
Malignant
(241)
Benign (448) 96.51% 2.49%
Malignant (251) 3.49% 97.51%
R: 0.93877413371
R2: 0.88129687412
Occurrences
Reliability
Precision
Learning Algorithm: Augmented Markov Blanket
Target: Class
Despite the greater complexity of this new network, we do not see an improvement in any of the perform-
ance measures.

Structural Coefficient
Up to this point, the difference in network complexity was a only function of the choice of learning algo-
rithm. We will now address the Structural Coefficient (SC), which is the only parameter adjustable across all
the learning algorithms in BayesiaLab. In essence, this parameter determines a kind of significance thresh-
old, and thus it influences the degree of complexity of the induced networks.
By default, this Structural Coefficient is set to 1, which reliably prevents the learning algorithms from over-
fitting the model to the data. In studies with relatively few observations, the analyst’s judgment is needed for
determining a potential downward adjustment of this parameter. On the other hand, when data sets are
very large, increasing the parameter to values higher than 1 will help manage the network complexity.
Given the fairly simple network structure of the Markov Blanket model, complexity was of no concern.
Augmented Markov Blanket is more complex, but still very manageable. The question is, could a more
complex network provide greater precision without overfitting? To answer this question, we will perform a
Structural Coefficient Analysis, which generates several metrics that help in making the trade-off between
complexity and precision: Tools | Cross Validation | Structural Coefficient Analysis
BayesiaLab prompts us to specify the range of the Structural Coefficient to be examined and the number of
iterations to be performed. It is worth noting that the Minimum Structural Coefficient should not be set to
0, or even close to 0. A value of 0 would imply a fully connected network, which can take a very long time
to learn depending on the number of variables, or even exceed the memory capacity of the computer run-
ning BayesiaLab.
Number of Iterations determines the interval steps to be taken within the specified range of the Structural
Coefficient. Given the relatively light computational load, we choose 25 iterations. With more complex
models, we might be more conservative, as each iteration re-learns and re-evaluates the network. Further-
more, we select to compute all metrics.

The resulting report shows how the network changes as a function of the Structural Coefficient. This can be
interpreted as the degree of confidence the analyst should have in any particular arc in the structure.

Clicking Graphs, will show a synthesized network, consisting of all structures generated during the iterative
learning process.
The reference structure is represented by black arcs, which show the original network learned immediately
prior to the start of the Structural Coefficient Analysis. The blue-colored arcs are not contained in the refer-
ence structure, but they appear in networks that have been learned as a function of the different Structural
Coefficients (SC). The thickness of the arcs is proportional to the frequency of individual arcs existing in the
learned networks.
More importantly for us, however, is determining the correct level of network complexity for a reliable and
accurate prediction performance while avoiding overfitting the data. We can plot several different metrics in
this context by clicking Curve.

Typically, the “elbow” of the L-shaped curve above identifies a suitable value for the Structural Coefficient
(SC). More formally, we would look for the point on the curve where the second derivative is maximized.
With a visual inspection, an SC value of around 0.3 appears to be a good candidate for that point. The por-
tion of the curve, where SC values approach 0, shows the characteristic pattern of overfitting, which is to be
avoided.
We will also plot the Target’s Precision alone as a function of the SC. On the surface, the curve for the
Learning Set resembles an L-shape too, but the curve moves only within roughly 2 percentage points, i.e.
between 97% and 99%. For practical purposes, this means that the curve is virtually flat.

As a result, the Structure/Target’s Precision Ratio
i.e.
Structure
Target's Precision
⎛
⎝⎜
⎞
⎠⎟
is primarily a function of the numera-
tor, i.e. the Structure, as the denominator, Target’s Precision, is nearly constant across a wide range of SC
values, as per the graph above.
If both Learning and Test Sets are available, a Validation Measure ɣ can be computed to help choose the
most appropriate Structural Coefficient.
This measure is based on the Test Set’s mean negative log-likelihood (returned by the network learned from
the Learning Set) and on the variances of the negative log-likelihood of the Test Set and Learning Set (re-
turned by the network learned from Learning Set).
γ = µLL,Test × max(1,
σLL,Test
2
σLL,Learning
2
)
The range between roughly 0.3 and 0.6, i.e. the section around the minimum of the curve, suggests suitable
values for the Structural Coefficient.

Model 2b: Augmented Markov Blanket (SC=0.3)
Given the results from the Structural Coefficient Analysis, we now wish to relearn the network with an SC
value of 0.3. The SC value can be set by right-clicking on the background of the Graph Panel and then se-
lecting Edit Structural Coefficient from the Contextual Menu, or alternatively via the menu, i.e. Edit | Edit
Structural Coefficient.
Once we relearn the network, using the same Augmented Markov Blanket algorithm as before, we obtain a
more complex network. The key question is, will this increase in complexity improve the performance or
perhaps be counterproductive?

Model 2b: Performance
We repeat the Network Performance Analysis and generate the Comprehensive Report for the Test Set.
Gini Index 36.60% 63.15%
Mean Lift 1.45 1.99
Value
Benign
(88)
Malignant
(51)
Benign (86) 86 0
Malignant (53) 2 51
Value
Benign
(88)
Malignant
(51)
Benign (86) 100% 0%
Malignant (53) 3.77% 96.23%
Value
Benign
(88)
Malignant
(51)
Benign (86) 97.73% 0%
Malignant (53) 2.27% 100%
Occurrences
Reliability
Precision
Target: Class
R: 0.97908818201
R2: 0.95861366815

Secondly, we perform K-Folds Cross Validation:
Gini Index 33.86% 64.42%
Mean Lift 1.42 2.04
Value
Benign
(458)
Malignant
(241)
Benign (447) 441 6
Value
Benign
(458)
Malignant
(241)
Benign (447) 98.66% 1.34%
Malignant (252) 6.75% 93.25%
Value
Benign
(458)
Malignant
(241)
Benign (447) 96.29% 2.49%
Malignant (252) 3.71% 97.51%
R: 0.94052337963
R2: 0.88458422762
Occurrences
Reliability
Precision
Learning Algorithm: Augmented Markov Blanket
Target: Class
Conclusion
All models reviewed, Model 1 (Markov Blanket), Model 2a (Augmented Markov Blanket, SC=1), Model 2b
(Augmented Markov Blanket, SC=0.3), have performed at very similar levels in terms of classification per-
formance. Total Precision and false positives/negatives are shown as the key metrics in the summary table
below.
Total&
Precision
False&
Positives
False&
Negatives
Total&
Precision
False&
Positives
False&
Negatives
Markov&Blanket&(SC=1) 97.84% 3 0 96.85% 17 5
Augmented&Markov&Blanket&(SC=1) 98.56% 2 0 96.85% 16 6
Augmented&Markov&Blanket&(SC=0.3) 98.56% 2 0 96.71% 17 6
Test&Set&(n=139) 10JFold&CrossJValidation&(n=699)
Summary
Reestimating these models with more observations could potentially change the results and might more
clearly differentiate the classification performance. For now, we select the Augment Markov Blanket
(SC=1), and it will serve as the basis for the next section of this paper, Model Inference.

Model Inference
Without further discussion of the merits of each model specification, we will now show how the learned
Augment Markov Blanket model can be applied in practice and used for inference. First, we need to go to
Validation Mode (F5). We can now bring up all the Monitors in the Monitor Panel by selecting all the
nodes (Ctrl+A) and double-clicking on any one of them. More conveniently, the Monitors can be displayed
by right-clicking inside the Monitor Panel and selecting Sort | Target Correlation from the Contextual
Menu.
Alternatively, we can do the same via Monitor | Sort | Target Correlation.
Monitors are then automatically created for all the nodes correlated with the Target Node. The Monitor of
Target Node is placed first in the Monitor Panel, followed by the other Monitors in order of their correla-
tion with the Target Node, from highest to lowest.

Interactive Inference
For instance, we can use now BayesiaLab to review the individual predictions made based on the model.
This feature is called Interactive Inference, which can be accessed from the menu via Inference | Interactive
Inference.
Also, we have a choice of using either the Learning Set or the Test Set for inference. For our purposes, we
choose the Test Set.
The Navigation Bar allows scrolling through each record of the test set. Record #0 can be seen below with
all the associated observations highlighted in green. Given the observations shown, the model predicts a

99.97% probability that Class is Benign (the Monitor of the Target Node is highlighted in red).
Most cases are rather clear-cut, as above, with probabilities for either diagnosis around 99% or higher.
However, there are a number of exceptions, such as case #11. Here, the probability of malignancy is ap-
proximately 75%.
Adaptive Questionnaire
In situations, when only individual cases are under review, rather than a batch of cases from a database,
BayesiaLab can provide case-by-case diagnosis support with the Adaptive Questionnaire.
For a a Target Node with more than two states, the Adaptive Questionnaire requires that we define a Tar-
get State. Setting the Target State allows BayesiaLab to compute Binary Mutual Information and then focus

on the defined Target State. Technically, setting the Target State is not necessary in our particular example
as the Target Node is binary.
The Adaptive Questionnaire can be started from the menu via Inference | Adaptive Questionnaire.
We can set Based on a Target State to Malignant, as we want to highlight this particular state.
Furthermore, we can set the cost of collecting observations via the Cost Editor, which can be started via the
Edit Costs button. This is helpful when certain observations are more costly to obtain than others.9
Unfortunately, our example is not ideally suited to illustrate this feature, as the FNA attributes are all col-
lected at the same time, rather than consecutively. However, one can imagine that in other contexts a physi-
cian will start the diagnosis process by collecting easy-to-obtain data, such as blood pressure, before pro-
ceeding to more elaborate (and more expensive) diagnostic techniques, such as performing an angiogram.
9 Beyond monetary measures, “cost” could reflect, for instance, the degree of pain associated with a surgical procedure.

Once the Adaptive Questionnaire is started, BayesiaLab presents the Monitor of the Target Node (red) and
its marginal probability, with the Target State highlighted. Again, as shown below, the Monitors are auto-
matically ordered in the sequence of their importance, from high to low, with regard to diagnosing the Tar-
get State of the Target Node.
This means that the ideal first piece of evidence is Uniformity of Cell Size. Let us suppose this metric is equal
to 3 (<=4.5) for the case under investigation. Upon setting this first observation, BayesiaLab will compute
the new probability distribution of the Target Node, given the evidence. We see that the probability of
Class=Malignant has increased to 58.53%. Given the evidence, BayesiaLab also recomputes the ideal new
order of questions and now presents Bare Nuclei as the next most relevant question.
Let us now assume that Bare Nuclei is not available for observation. We instead set the node Clump Thick-
ness to Clump Thickness<=4.5.

Given this latest piece of evidence, the probability distribution of Class is once again updated, as is the array
of questions. The small gray arrows inside the Monitors indicate how the probabilities have changed com-
pared to the prior iteration.
It is important to point out that not only the Target Node is updated as we set evidence. Rather, all nodes
are being updated upon setting evidence, reflecting the omnidirectional nature of inference within a Bayesian
network.
We can continue this process of updating until we have exhausted all available evidence, or until we have
reached an acceptable level of certainty regarding the diagnosis.
Target Interpretation Tree
Although its tree structure is not displayed, the Adaptive Questionnaire is a dynamic tree for seeking evi-
dence. More specifically, it is a tree that applies to one specific case given its observed evidence. The Target
Interpretation Tree is a static tree that is induced from all cases. As such it provides a more general ap-
proach in terms of searching for the optimum sequence of gathering evidence.

The Target Interpretation Tree can be started from the menu via Analysis | Target Interpretation Tree.
Upon starting this function, we need to set several options. We define the Search Stop Criteria, and set the
Maximum Size of Evidence to 3 and the Minimum Joint Probability to 1 (percent). Furthermore, we check
the Center on State box and select Malignant from the drop-down menu. This way, Malignant will be high-
lighted in each node of the to-be-generated tree.
By default, the tree is presented in a top-down format.
Often, it may be more convenient to change the layout to a left-to-right format via the Switch Position but-
ton in the upper lefthand corner of the window that contains the tree.

The following tree is presented in the left-to-right layout.
This tree prescribes in which sequence evidence should be sought for gaining the maximum amount of in-
formation towards a diagnosis. Going from left to right, we see how the probability distribution for Class
changes given the evidence set thus far.
The leftmost node in the tree, without any evidence set, shows the marginal probability distribution of
Class. The bottom panel of this node shows Uniformity of Cells Size as the most important evidence to seek.

The three branches that emerge from the node represent the possible states of Uniformity of Cells Size, i.e.
the hard evidence we can observe. If we set evidence analogously to what we did in the Adaptive Question-
naire, we will choose the middle branch with the value Uniformity of Cell Size<=4.5 (2/3).
This evidence updates the probabilities of the Target State, now predicting a 58.53% probability of Class=
Malignant. At the same time we can see what is the next best piece of evidence to seek. Here, it is Bare Nu-
clei, which will provide the greatest information gain towards the diagnosis of Class. The information gain
is quantified with the Score displayed at the bottom of the node.
The Score is the Conditional Mutual Information of the node Bare Nuclei with regard to the Target Node,
divided by the cost of observing the evidence if the option Utilize Evidence Cost was checked. In our case, as
we did not check this option, the Score is equal to the Conditional Mutual Information.
We can quickly verify the Score of 7.1% by running the Mapping function. First, we set the evidence on
Uniformity of Cell Size (<=4.5) and then run Analysis | Visual | Mapping.

The Mapping window features drop-down menus for Node Analysis and Arc Analysis. However, we are
only interested in Node Analysis, and we select Mutual Information with the Target Node as the metric to
be displayed.
The size of the nodes, beyond a fixed minimum size,10 is now proportional to the Mutual Information with
the Target Node. To see the specific values, we right-click on the background of the window and select Dis-
play Scores on Nodes from the Contextual Menu.
10 The minimum and maximum sizes can be changed via Edit Sizes from the Contextual Menu in the Mapping Window.

This shows us, given Uniformity of Cell Size<=4.5, the Mutual Information of Bare Nuclei with the Target
Node is 0.0711, or 7.1%. Note that the node on which evidence has already been set, i.e. Uniformity of Cell
Size, shows a Conditional Mutual Information of 0.
So, learning Bare Nuclei will bring the highest information gain among the remaining variables. For in-
stance, if we now observed Bare Nuclei>5.5 (3/3), the probability of Class=Malignant would reach 98.33%.

Finally, BayesiaLab also reports the joint probability of each tree node, i.e. the probability that all pieces of
evidence in a branch, up to and including that tree node, would occur.
This says that the joint probability of Uniformity of Cell Size<=4.5 and Bare Nuclei>5.5 is 5.32%.
As opposed to this somewhat artificial illustration of a Target Interpretation Tree in the context of FNA-
based diagnosis, Target Interpretation Trees are often prepared for emergency situations, such as triage
classification, in which rapid diagnosis with constrained resources is essential. We believe that our example
still conveys the idea of “optimum escalation” in obtaining evidence towards a diagnosis.
Summary
By using Bayesian networks as the framework and BayesiaLab as the tool, we have shown a practical new
modeling and analysis approach based on the widely studied Wisconsin Breast Cancer Database.
BayesiaLab can rapidly machine-learn reliable models, even without prior domain knowledge and without
hypothesis. The classification performance of the BayesiaLab-generated Bayesian network models is on par
with all studies on this topic that are published to date. Beyond the predictive performance, BayesiaLab en-
ables a range of analysis and interpretation functions, which can help the researcher gain deeper domain
knowledge and perform inference more efficiently.

Appendix
Framework: The Bayesian Network Paradigm11
Acyclic Graphs & Bayes’s Rule
Probabilistic models based on directed acyclic graphs have a long and rich tradition, beginning with the
work of geneticist Sewall Wright in the 1920s. Variants have appeared in many fields. Within statistics, such
models are known as directed graphical models; within cognitive science and artificial intelligence, such
models are known as Bayesian networks. The name honors the Rev. Thomas Bayes (1702-1761), whose
rule for updating probabilities in the light of new evidence is the foundation of the approach.
Rev. Bayes addressed both the case of discrete probability distributions of data and the more complicated
case of continuous probability distributions. In the discrete case, Bayes’ theorem relates the conditional and
marginal probabilities of events A and B, provided that the probability of B does not equal zero:
P(A∣B) =
P(B∣A)P(A)
P(B)
In Bayes’ theorem, each probability has a conventional name:
P(A) is the prior probability (or “unconditional” or “marginal” probability) of A. It is “prior” in the sense
that it does not take into account any information about B; however, the event B need not occur after
event A. In the nineteenth century, the unconditional probability P(A) in Bayes’s rule was called the “ante-
cedent” probability; in deductive logic, the antecedent set of propositions and the inference rule imply con-
sequences. The unconditional probability P(A) was called “a priori” by Ronald A. Fisher.
P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is de-
rived from or depends upon the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called the likelihood.
P(B) is the prior or marginal probability of B, and acts as a normalizing constant.
Bayes theorem in this form gives a mathematical representation of how the conditional probability of event
A given B is related to the converse conditional probability of B given A.
The initial development of Bayesian networks in the late 1970s was motivated by the need to model the top-
down (semantic) and bottom-up (perceptual) combination of evidence in reading. The capability for bidirec-
tional inferences, combined with a rigorous probabilistic foundation, led to the rapid emergence of Bayesian
networks as the method of choice for uncertain reasoning in AI and expert systems replacing earlier, ad hoc
rule-based schemes.
11 Adapted from Pearl (2000), used with permission.

The nodes in a Bayesian network represent variables
of interest (e.g. the temperature of a device, the gen-
der of a patient, a feature of an object, the occur-
rence of an event) and the links represent statistical
(informational) or causal dependencies among the
variables. The dependencies are quantified by condi-
tional probabilities for each node given its parents in
the network. The network supports the computation
of the posterior probabilities of any subset of vari-
ables given evidence about any other subset.
Compact Representation of the Joint
Probability Distribution
“The central paradigm of probabilistic reasoning is
to identify all relevant variables x1, . . . , xN in the
environment [i.e. the domain under study], and
make a probabilistic model p(x1, . . . , xN) of their interaction [i.e. represent the variables’ joint probability
distribution].”
Bayesian networks are very attractive for this purpose as they can, by means of factorization, compactly
represent the joint probability distribution of all variables.
“Reasoning (inference) is then performed by introducing evidence that sets variables in known states, and
subsequently computing probabilities of interest, conditioned on this evidence. The rules of probability,
combined with Bayes’ rule make for a complete reasoning system, one which includes traditional deductive
logic as a special case.” (Barber, 2012)

References
Abdrabou, E. A.M.L, and A. E.B.M Salem. “A Breast Cancer Classifier Based on a Combination of Case-
Based Reasoning and Ontology Approach” (n.d.).
Conrady, Stefan, and Lionel Jouffe. “Missing Values Imputation - A New Approach to Missing Values
Processing with Bayesian Networks,” January 4, 2012. http://bayesia.us/index.php/missingvalues.
El-Sebakhy, E. A, K. A Faisal, T. Helmy, F. Azzedin, and A. Al-Suhaim. “Evaluation of Breast Cancer Tu-
mor Classification with Unconstrained Functional Networks Classifier.” In The 4th ACS/IEEE Interna-
tional Conf. on Computer Systems and Applications, 281–287, 2006.
Hung, M. S, M. Shanker, and M. Y Hu. “Estimating Breast Cancer Risks Using Neural Networks.” Journal
of the Operational Research Society 53, no. 2 (2002): 222–231.
Karabatak, M., and M. C Ince. “An Expert System for Detection of Breast Cancer Based on Association
Rules and Neural Network.” Expert Systems with Applications 36, no. 2 (2009): 3465–3469.
Mangasarian, Olvi L, W. Nick Street, and William H Wolberg. “Breast Cancer Diagnosis and Prognosis via
Linear Programming.” OPERATIONS RESEARCH 43 (1995): 570–577.
Mu, T., and A. K Nandi. “BREAST CANCER DIAGNOSIS FROM FINE-NEEDLE ASPIRATION USING
SUPERVISED COMPACT HYPERSPHERES AND ESTABLISHMENT OF CONFIDENCE OF MA-
LIGNANCY” (n.d.).
Pearl, Judea. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge University Press, 2009.
Pearl, Judea, and Stuart Russell. Bayesian Networks. UCLA Congnitive Systems Laboratory, November
2000. http://bayes.cs.ucla.edu/csl_papers.html.
Wolberg, W. H, W. N Street, D. M Heisey, and O. L Mangasarian. “Computer-derived Nuclear Features
Distinguish Malignant from Benign Breast Cytology* 1.” Human Pathology 26, no. 7 (1995): 792–
796.
Wolberg, William H, W. Nick Street, and O. L Mangasarian. “MACHINE LEARNING TECHNIQUES TO
DIAGNOSE BREAST CANCER FROM IMAGE-PROCESSED NUCLEAR FEATURES OF FINE
NEEDLE ASPIRATES” (n.d.). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.2109.
Wolberg, William H, W. Nick Street, and Olvi L Mangasarian. “Breast Cytology Diagnosis Via Digital Im-
age Analysis” (1993). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.9894.
———. “Breast Cytology Diagnosis Via Digital Image Analysis” (1993).
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.9894.

Contact Information
Bayesia USA
312 Hamlet’s End Way
Franklin, TN 37067
USA
Phone: +1 888-386-8383
info@bayesia.us
www.bayesia.us
Bayesia Singapore Pte. Ltd.
20 Cecil Street
#14-01, Equity Plaza
Singapore 049705
Phone: +65 3158 2690
info@bayesia.sg
www.bayesia.sg
Bayesia S.A.S.
6, rue Léonard de Vinci
BP 119
53001 Laval Cedex
France
Phone: +33(0)2 43 49 75 69
info@bayesia.com
www.bayesia.com
Copyright
© 2013 Bayesia S.A.S., Bayesia USA and Bayesia Singapore. All rights reserved.

Breast Cancer Diagnostics with Bayesian Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Breast Cancer Diagnostics with Bayesian Networks

Similar to Breast Cancer Diagnostics with Bayesian Networks (20)

More from Bayesia USA

More from Bayesia USA (17)

Breast Cancer Diagnostics with Bayesian Networks