Don Eppes, an FBI agent, is investigating a serial killer in Los Angeles but has no leads. He asks his brother Charlie, a mathematician, for help. Charlie examines a map with the crime locations marked but sees no obvious pattern. While pondering the problem, Charlie realizes mathematics may be able to reveal a pattern and predict future crime locations.
The numbers behind numb3 rs solving crime with mathematics (malestrom)
1. I
4 1
1 1
1 ft
SOLVIN G CRIME WITH MATHEMATICS
1
- *
THE NUMBERS BEHIND
NUMB3RS
KEITH DEVLIN . N P R ' S " M o t h Guy" and
G A R ! ' L O R D E hI, the M o t h C o n s u l t a n t on
NU MB3RS", t h e h it C B S tel evision series
2. A COMPANION TO THE HIT CBS
CRIME SERIES NUMB3RS PRESENTS
THE FASCINATING WAYS MATHEMATICS
IS USED TO FIGHT REAL-LIFE CRIME
• :i k im
Using the popular CBS prime-time TV crime series NUMB3RS' as
a springboard, Keith Devlin (known to millions of NPR listeners
as "the Math Guy" on NPR's Weekend Edition with Scott Simon)
and Gary Lorden (the math consultant to NUMB3RS " explain
)
real-life mathematical techniques used by the FBI and other law
enforcement agencies to catch and convict criminals. From
forensics to counterterrorism. the Riemann hypothesis lo image
enhancement, solving murders to beating casino odds, Devlin
and Lorden present compelling cases that illustrate how ad
vanced mathematics can be used in state-of-the-art criminal
investigations.
P r a i s e for t h e t e l e v i s i o n s e r i e s :
"NUMB3RS L O O K S LIKE A W I N N 3 R . "
—USA Today
3. A PLUME BOOK
THE NUMBERS BEHIND NUMB3RS
DR. KEITH DEVLIN is executive director o f Stanford University's Center for
the Study o f Language and Information and a consulting professor o f
mathematics at Stanford. Devlin has a B.Sc. degree in Mathematics from
King's College London (1968) and a Ph.D. in Mathematics from the Uni
versity o f Bristol (1971). He is a fellow o f the American Association for
the Advancement o f Science, a World Economic Forum fellow, and a
former member o f the Mathematical Sciences Education Board o f the
U.S. National Academy o f Sciences. The author o f twenty-five books,
Devlin has been a regular contributor to National Public Radio's popular
program Weekend Edition, where he is known as "the Math Guy" in his
on-air conversations with host Scott Simon. His monthly column, "Dev
lin's Angle," appears on Mathematical Association o f America's web
journal MAA Online.
DR. GARY L O R D E N is a professor in the mathematics department o f the
California Institute o f Technology in Pasadena. He graduated from
Caltech with a B.S. in mathematics in 1962, received his Ph.D. in math
ematics from Cornell University in 1966, and taught at Northwestern
University before returning to Caltech in 1968. A fellow o f the Institute
of Mathematical Statistics, Lorden has taught statistics, probability, and
other mathematics at all levels from freshman to doctoral. Lorden has
also been active as a consultant and expert witness in mathematics and
statistics for government agencies and laboratories, private companies,
and law firms. For many years he consulted for Caltech's Jet Propulsion
Laboratory for their space exploration programs. He has participated in
highly classified research projects aimed at enhancing the ability o f gov
ernment agencies (such as the NSA) to protect national security. Lorden
is the chief mathematics consultant for the CBS T V series NUMB3RS.
7. Acknowledgments
The authors want to thank NUMB3RS creators Cheryl Heuton and Nick
Falacci for creating Charlie Eppes, television's first mathematics super
hero, and succeeding brilliantly in putting math on television in prime
time. Their efforts have been joined by a stellar team o f other writers,
actors, producers, directors, and specialists whose work has inspired us to
write this book. The gifted actor David Krumholtz has earned the undy
ing love o f mathematicians everywhere for bringing Charlie to life in a
way that has led millions o f people to see mathematics in a completely
new light. Thanks also to NUMB3RS researchers Andy Black and Matt
Kolokoff for being wonderful to work with in coming up with endless
applications o f mathematics to make the writers' dreams come true.
We wish to express our particular thanks to mathematician Dr.
Lenny Rudin o f Cognitech, one o f the world's foremost experts on im
age enhancement, for considerable help with Chapter 5 and for provid
ing the images we show in that chapter.
Finally, Ted Weinstein, our agent, found us an excellent publisher in
David Cashion o f Plume, and both worked tirelessly to turn a manuscript
that we felt was as reader-friendly as possible, given that this is a math
book, into one that, we have to acknowledge, is now a lot more so!
Keith Devlin, Palo Alto, CA
Gary Lorden, Pasadena, CA
8.
9. Contents
Introduction
The Hero Is a Mathematician? ix
1 Finding t h e H o t Z o n e 1
Criminal Geographic Profiling
2 Fighting Crime w i t h Statistics 101 13
3 D a t a Mining 25
Finding Meaningful Patterns in
Masses of Information
4 When Does the Writing First Appear
on the Wall? 51
Changepoint Detection
5 I m a g e Enhancement and Reconstruction 63
6 Predicting t h e Future 77
Bayesian Inference
7 D N A Profiling 89
8 S e c r e t s — M a k i n g and Breaking C o d e s 105
9 H o w Reliable Is t h e Evidence? 121
Doubts about Fingerprints
10 Connecting t h e Dots 137
The Math of Networks
10. viii Contents
11 The Prisoner's Dilemma, Risk Analysis,
and Counterterrorism 153
12 M a t h e m a t i c s in t h e C o u r t r o o m 175
13 C r i m e in t h e Casino 193
Using Math to Beat the System
Appendix
Mathematical Synopses of the Episodes
in the First Three Seasons of NUMB3RS 207
Index 233
11. INTRODUCTION
The Hero Is a
Mathematician ?
On January 23, 2005, a new television crime series called NUMB3RS de
buted. Created by the husband-and-wife team Nick Falacci and Cheryl
Heuton, the series was produced by Paramount Network Television
and acclaimed Hollywood veterans Ridley and Tony Scott, whose movie
credits include Alien, Top Gun, and Gladiator. Throughout its run,
NUMB3RS has regularly beat out the competition to be the most watched
series in its time slot on Friday nights.
What has surprised many is that one o f the show's two heroes is a
mathematician, and much o f the action revolves around mathematics,
as professor Charlie Eppes uses his powerful skills to help his older
brother, Don, an FBI agent, identify and catch criminals. Many viewers,
and several critics, have commented that the stories are entertaining,
but the basic premise is far-fetched: You simply can't use math to solve
crimes, they say. As this book proves, they are wrong. You can use math
to solve crimes, and law enforcement agencies do—not in every instance
to be sure, but often enough to make math a powerful weapon in the
never-ending fight against crime. In fact, the very first episode o f the
series was closely based on a real-life case, as we will discuss in the next
chapter.
Our book sets out to describe, in a nontechnical fashion, some o f the
major mathematical techniques currently available to the police, CIA,
and FBI. Most o f these methods have been mentioned during episodes
of NUMB3RS, and while we frequently link our explanations to what
was depicted on the air, our focus is on the mathematical techniques
and how they can be used in law enforcement. In addition we describe
12. X Introduction
some real-life cases where mathematics played a role in solving a crime
that have not been used in the T V series—at least not directly.
In many ways, NUMB3RS is similar to good science fiction, which is
based on correct physics or chemistry. Each week, NUMB3RS presents a
dramatic story in which realistic mathematics plays a key role in the nar
rative. The producers o f NUMB3RS go to great lengths to ensure that the
mathematics used in the scripts is correct and that the applications shown
are possible. Although some o f the cases viewers see are fictional, they
certainly could have happened, and in some cases very well may. Though
the T V series takes some dramatic license, this book does not. In The
Numbers Behind NUMB3RS, you will discover the mathematics that can
be, and is, used in fighting real crime and catching actual criminals.
15. CHAPTER
Finding the Hot Zone
1 Criminal Geographic Profiling
FBI Special Agent D o n Eppes looks again at t h e large street m a p of Los
Angeles spread across t h e dining-room table of his father's h o u s e . T h e
crosses inked o n t h e m a p s h o w t h e locations w h e r e , over a period of
several m o n t h s , a b r u t a l serial killer has struck, raping and t h e n m u r d e r
ing a n u m b e r of y o u n g w o m e n . D o n ' s j o b is t o catch t h e killer before h e
strikes again. But t h e investigation has stalled. D o n is o u t of clues, a n d
doesn't k n o w w h a t t o d o next.
"Can I help?" T h e voice is that of D o n ' s y o u n g e r brother, Charlie, a
brilliant y o u n g professor of m a t h e m a t i c s at t h e n e a r b y university CalSci.
D o n has always b e e n in awe of his b r o t h e r ' s incredible ability at m a t h ,
and frankly w o u l d w e l c o m e any help h e can get. B u t . . . help from a
mathematician?
"This case isn't about numbers, Charlie." T h e edge in Don's voice is
caused m o r e by frustration than anger, b u t Charlie seems not to notice, and
his reply is totally matter-of-fact b u t insistent: "Everything is numbers."
D o n is n o t convinced. Sure, h e has often h e a r d Charlie say that
m a t h e m a t i c s is all a b o u t patterns—identifying t h e m , analyzing t h e m ,
m a k i n g predictions a b o u t t h e m . But it didn't take a m a t h genius t o see
that t h e crosses o n t h e m a p w e r e scattered haphazardly. T h e r e w a s n o
pattern, n o way anyone could predict w h e r e t h e next cross w o u l d g o —
the exact location w h e r e t h e next y o u n g girl w o u l d b e attacked. Maybe
it w o u l d occur that very evening. If only there w e r e s o m e regularity t o
the a r r a n g e m e n t of t h e crosses, a p a t t e r n that could b e c a p t u r e d w i t h a
mathematical equation, t h e w a y D o n r e m e m b e r s from his schooldays
2 2
that the equation x + y = 9 describes a circle.
16. 2 T H E NUMBERS B E H I N D NUMB3RS
L o o k i n g at t h e m a p , even Charlie has t o agree there is n o way to use
m a t h t o predict w h e r e t h e killer w o u l d strike next. H e strolls over to the
w i n d o w a n d stares o u t across t h e garden, t h e silence of the evening
b r o k e n only by t h e continual flick-flick-jiick-ftick of t h e automatic sprin
kler w a t e r i n g t h e lawn. Charlie's eyes see t h e sprinkler b u t his m i n d is
far away. H e h a d t o a d m i t that D o n w a s probably right. Mathematics
could b e used t o d o lots of things, far m o r e t h a n m o s t people realized.
But in o r d e r t o use m a t h , t h e r e h a d t o b e s o m e sort of pattern.
Flick-Jiick-flick-jlick. T h e sprinkler continued to do its job. T h e r e was
t h e brilliant m a t h e m a t i c i a n in N e w York w h o used mathematics to study
t h e w a y t h e h e a r t w o r k s , helping doctors spot tiny irregularities in a
heartbeat before t h e p e r s o n has a h e a r t attack.
Flick-flick-flick-flick. T h e r e were all those mathematics-based c o m p u t e r
p r o g r a m s the banks utilized t o track credit card purchases, looking for a
sudden change in the p a t t e r n that might indicate identity theft or a stolen
card.
Flick-flick-flick-flick. W i t h o u t clever m a t h e m a t i c a l algorithms, the cell
p h o n e in Charlie's p o c k e t w o u l d have b e e n twice as big and a lot
heavier.
Flick-flick-flick-flick. In fact, t h e r e w a s scarcely any area of m o d e r n life
that did n o t d e p e n d , often in a crucial way, o n m a t h e m a t i c s . But there
h a d t o b e a p a t t e r n , o t h e r w i s e t h e m a t h can't get started.
Flick-flick-flick-flick. For t h e first t i m e , Charlie notices t h e sprinkler,
and suddenly h e k n o w s w h a t t o do. H e has his answer. H e could help
solve D o n ' s case, a n d t h e solution has b e e n staring h i m in t h e face all
along. H e j u s t h a d n o t realized it.
H e drags D o n over t o t h e window. "We've b e e n asking the w r o n g
question," h e says. " F r o m w h a t y o u know, there's n o way y o u can pre
dict w h e r e t h e killer will strike next." H e points t o t h e sprinkler. "Just
like, n o m a t t e r h o w m u c h y o u study w h e r e each d r o p of w a t e r hits the
grass, there's n o w a y y o u can predict w h e r e the next d r o p will land.
T h e r e ' s t o o m u c h uncertainty." H e glances at D o n t o m a k e sure his
older b r o t h e r is listening. "But suppose you could n o t see t h e sprinkler,
a n d all y o u h a d t o g o o n was t h e p a t t e r n of w h e r e all the drops landed.
T h e n , using m a t h , y o u could w o r k o u t exactly w h e r e the sprinkler m u s t
be. You can't use t h e p a t t e r n of drops t o predict forward t o the next
17. Finding the Hot Zone 3
drop, b u t y o u can use it t o w o r k b a c k w a r d t o t h e source. It's t h e s a m e
with your killer."
D o n finds it difficult to accept w h a t his b r o t h e r seems t o b e suggesting.
"Charlie, are you telling m e you can figure o u t w h e r e the killer lives?"
Charlie's answer is simple: "Yes."
D o n is still skeptical that Charlie's idea can really w o r k , b u t he's
impressed by his b r o t h e r ' s confidence and passion, a n d so h e agrees t o
let h i m assist w i t h t h e investigation.
Charlie's first step is to learn s o m e basic facts from the science of crimi
nology: First, h o w do serial killers behave? Here, his years of experience as
a mathematician have taught h i m h o w to recognize the key factors and
ignore all the others, so that a seemingly complex problem can b e reduced
to one with just a few key variables. Talking with D o n and the other agents
at the FBI office where his elder brother works, h e learns, for instance, that
violent serial criminals exhibit certain tendencies in selecting locations.
They tend to strike close to their h o m e , b u t n o t t o o close; they always set
a "buffer z o n e " around their residence w h e r e they will n o t strike, an area
that is too close for comfort; outside that comfort zone, the frequency of
crime locations decreases as the distance from h o m e increases.
T h e n , back in his office in t h e CalSci m a t h e m a t i c s d e p a r t m e n t ,
Charlie gets t o w o r k in earnest, feverishly covering his blackboards
w i t h mathematical equations and formulas. His goal: t o find t h e m a t h
ematical key t o d e t e r m i n e a "hot z o n e " — a n area o n t h e m a p , derived
from the crime locations, w h e r e t h e p e r p e t r a t o r is m o s t likely t o live.
As always w h e n h e w o r k s o n a difficult m a t h e m a t i c a l p r o b l e m , t h e
h o u r s fly by as Charlie tries o u t m a n y unsuccessful approaches. T h e n ,
finally, h e has an idea h e thinks should w o r k . H e erases his previous
chalk scribbles o n e m o r e t i m e a n d writes this complicated-looking
formula o n t h e board:*
=k
p, Y,
*We'll take a closer look at this formula in a moment.
18. 4 THE NUMBERS B E H I N D NUMB3RS
" T h a t should d o t h e trick," h e says t o himself.
T h e next step is t o fine-tune his formula by checking it against exam
ples of past serial crimes D o n provides h i m with. W h e n h e inputs the
crime locations from those previous cases into his formula, does it accu
rately predict w h e r e t h e criminals lived? This is t h e m o m e n t of truth,
w h e n Charlie will discover w h e t h e r his m a t h e m a t i c s reflects reality.
S o m e t i m e s it doesn't, and h e learns that w h e n h e first decided which
factors t o take into a c c o u n t and which to ignore, h e m u s t have got it
w r o n g . But this time, after Charlie m a k e s a few m i n o r adjustments, the
formula s e e m s t o w o r k .
T h e next day, b u r s t i n g w i t h e n e r g y and conviction, Charlie shows u p
at t h e FBI offices w i t h a p r i n t o u t of the crime-location m a p w i t h the
2 2
"hot z o n e " p r o m i n e n t l y displayed. Just as the equation x + y = 9 that
D o n r e m e m b e r e d from his schooldays describes a circle, so that w h e n
t h e e q u a t i o n is fed into a suitably p r o g r a m m e d c o m p u t e r it will draw
t h e circle, so t o o w h e n Charlie fed his n e w equation into his computer,
it also p r o d u c e d a picture. N o t a circle this time—Charlie's equation is
m u c h m o r e complicated. W h a t it gave was a series of concentric col
ored regions d r a w n o n D o n ' s crime m a p of Los Angeles, regions that
h o m e d in o n t h e h o t z o n e w h e r e the killer lives.
H a v i n g this m a p will still leave a lot of w o r k for D o n and his col
leagues, b u t finding t h e killer is n o longer like looking for a needle in a
haystack. T h a n k s t o Charlie's m a t h e m a t i c s , the haystack has suddenly
dwindled t o a m e r e sackful of hay.
19. Finding t h e H o t Zone 5
Charlie explains to D o n and the other FBI agents w o r k i n g t h e case that
the serial criminal has tried n o t to reveal w h e r e h e lives, picking victims in
w h a t h e thinks is a r a n d o m p a t t e r n of locations, b u t that t h e m a t h e m a t i
cal formula nevertheless reveals the truth: a h o t z o n e in which t h e crimi
nal's residence is located, to a very high probability. D o n and the t e a m
decide to investigate m e n within a certain range of ages, w h o live in t h e
h o t zone, and use surveillance and stealth tactics t o obtain D N A evidence
from the suspects' discarded cigarette butts, drinking straws, and the like,
which can be m a t c h e d w i t h D N A from t h e crime-scene investigations.
Within a few days—and a few heart-stopping m o m e n t s — t h e y have
their m a n . T h e case is solved. D o n tells his y o u n g e r brother, " T h a t ' s
some formula you've got there, Charlie."
FACT OR FICTION?
Leaving out a few dramatic twists, the above is w h a t t h e T V audience saw
in the very first episode of NUMB3RS, broadcast o n January 23, 2005.
Many viewers could n o t believe that mathematics could help capture a
criminal in this way. In fact, that entire first episode w a s based fairly closely
on a real case in which a single mathematical equation was used t o identify
the hot zone w h e r e a criminal lived. It was the very equation, reproduced
above, that viewers saw Charlie write o n his blackboard.
T h e real-life m a t h e m a t i c i a n w h o p r o d u c e d t h a t formula is n a m e d
Kim Rossmo. T h e technique of using m a t h e m a t i c s t o predict w h e r e
a serial criminal lives, w h i c h R o s s m o helped t o establish, is called
geographic profiling.
In the 1980s R o s s m o w a s a y o u n g constable o n t h e police force in
Vancouver, Canada. W h a t m a d e h i m u n u s u a l for a police officer w a s his
talent for mathematics. T h r o u g h o u t school h e h a d b e e n a " m a t h w h i z , "
the kind of student w h o m a k e s fellow students, a n d often teachers, a
little nervous. T h e story is told that early in t h e twelfth g r a d e , b o r e d
w i t h the slow pace of his m a t h e m a t i c s course, h e asked t o take t h e final
exam in the second w e e k of t h e semester. After scoring o n e h u n d r e d
percent, h e was excused from t h e r e m a i n d e r of t h e course.
Similarly b o r e d w i t h t h e typical slow progress of police investigations
involving violent serial criminals, R o s s m o decided t o g o back t o school,
20. 6 T H E NUMBERS B E H I N D NUMB3RS
ending u p w i t h a Ph.D. in criminology from Simon Fraser University, the
first cop in Canada t o get one. His thesis advisers, Paul and Patricia
Brantingham, w e r e pioneers in t h e development of mathematical models
(essentially sets of equations that describe a situation) of criminal
behavior, particularly those that describe w h e r e crimes are m o s t likely to
occur based o n w h e r e a criminal lives, works, and plays. (It was the
Brantinghams w h o noticed the location patterns of serial criminals
that T V veiwers saw Charlie learning a b o u t from D o n and his FBI
colleagues.)
Rossmo's interest w a s a little different from the Brantinghams'. H e
did n o t w a n t t o study p a t t e r n s of criminal behavior. As a police officer,
h e w a n t e d t o use actual data a b o u t t h e locations of crimes linked to a
single u n k n o w n p e r p e t r a t o r as an investigative tool t o help the police find
t h e criminal.
R o s s m o h a d s o m e initial successes in re-analyzing old cases, and after
receiving his Ph.D. and b e i n g p r o m o t e d to detective inspector, h e pur
sued his interest in developing b e t t e r m a t h e m a t i c a l m e t h o d s to do w h a t
h e c a m e t o call criminal g e o g r a p h i c targeting (CGT). O t h e r s called the
m e t h o d "geographic profiling," since it c o m p l e m e n t e d the well-known
t e c h n i q u e of "psychological profiling" used by investigators to find
criminals based o n their motivations and psychological characteristics.
G e o g r a p h i c profiling a t t e m p t s t o locate a likely base of operation for a
criminal b y analyzing t h e locations of their crimes.
R o s s m o hit u p o n t h e key idea b e h i n d his seemingly m a g i c formula
while riding o n a bullet train in J a p a n o n e day in 1991. Finding himself
w i t h o u t a n o t e p a d t o w r i t e on, h e scribbled it o n a napkin. W i t h
later refinements, the formula b e c a m e the principal e l e m e n t of a
c o m p u t e r p r o g r a m R o s s m o w r o t e , called Rigel ( p r o n o u n c e d RYE-gel,
a n d n a m e d after t h e star in the constellation Orion, the H u n t e r ) . Today,
R o s s m o sells Rigel, along w i t h training and consultancy, to police
and o t h e r investigative agencies a r o u n d the world t o help t h e m find
criminals.
W h e n R o s s m o describes h o w Rigel works to a law enforcement
agency interested in t h e p r o g r a m , h e offers his favorite m e t a p h o r — t h a t
of d e t e r m i n i n g t h e location of a rotating lawn sprinkler by analyzing the
p a t t e r n of t h e w a t e r drops it sprays o n t h e g r o u n d . W h e n NUMB3RS
21. Finding the Hot Zone 7
cocreators Cheryl H e u t o n and Nick Falacci w e r e w o r k i n g o n their pilot
episode, they t o o k Rossmo's o w n m e t a p h o r as t h e w a y Charlie w o u l d hit
u p o n the formula and explain the idea t o his brother.
Rossmo h a d s o m e early successes dealing w i t h serial crime investiga
tions in Canada, b u t w h a t really m a d e h i m a h o u s e h o l d n a m e a m o n g
law enforcement agencies all over N o r t h America w a s t h e case of t h e
South Side Rapist in Lafayette, Louisiana.
For m o r e t h a n t e n years, an u n k n o w n assailant, his face w r a p p e d
bandit-style in a scarf, h a d b e e n stalking w o m e n in t h e t o w n a n d assault
ing t h e m . In 1998 t h e local police, s n o w e d u n d e r by t h o u s a n d s of tips
and a corresponding n u m b e r of suspects, b r o u g h t R o s s m o in t o help.
Using Rigel, R o s s m o analyzed t h e crime-location data a n d p r o d u c e d a
m a p m u c h like the o n e Charlie displayed in NUMB3RS, w i t h b a n d s of
color indicating the h o t z o n e and its increasingly h o t interior rings. T h e
m a p enabled police t o n a r r o w d o w n t h e h u n t t o half a square mile a n d
about a d o z e n suspects. Undercover officers c o m b e d t h e h o t z o n e using
the same techniques p o r t r a y e d in NUMB3RS, t o obtain D N A samples of
all males of t h e right age r a n g e in t h e area.
Frustration set in w h e n each of t h e suspects in t h e h o t z o n e w a s
cleared by D N A evidence. But t h e n they g o t lucky. T h e lead investigator,
McCullan "Mac" Gallien, received an a n o n y m o u s tip pointing t o a very
unlikely suspect—a sheriff's d e p u t y from a n e a r b y d e p a r t m e n t . As j u s t
o n e m o r e tip o n t o p of t h e m o u n t a i n h e already had, Mac w a s inclined
t o just file it, b u t o n a w h i m h e decided t o check t h e deputy's address.
N o t even close t o t h e h o t z o n e . Still s o m e t h i n g niggled h i m , and h e d u g
a little deeper. A n d t h e n h e hit t h e jackpot. T h e d e p u t y h a d previously
lived at a n o t h e r address—right in t h e h o t z o n e ! D N A evidence w a s
collected from a cigarette butt, and it m a t c h e d t h a t t a k e n from t h e
crime scenes. T h e d e p u t y w a s arrested, a n d R o s s m o b e c a m e an instant
celebrity in t h e crime-fighting world.
Interestingly, w h e n H e u t o n and Falacci w e r e w r i t i n g t h e pilot epi
sode of NUMB3RS, based o n this real-life case, they could n o t resist
incorporating the s a m e d r a m a t i c twist at t h e end. W h e n Charlie first
applies his formula, n o D N A m a t c h e s are found a m o n g t h e suspects in
the h o t z o n e , as h a p p e n e d w i t h Rossmo's formula in Lafayette. Charlie's
belief in his m a t h e m a t i c a l analysis is so s t r o n g that w h e n D o n tells h i m
22. 8 THE NUMBERS B E H I N D NUMB3RS
t h e search has d r a w n a blank, h e initially refuses t o accept this o u t c o m e .
"You m u s t have missed h i m , " h e says.
Frustrated and upset, Charlie huddles w i t h D o n at their father Alan's
h o u s e , and Alan says, "I k n o w t h e p r o b l e m can't b e t h e m a t h , Charlie. It
m u s t b e s o m e t h i n g else." This r e m a r k spurs D o n t o realize that finding
t h e killer's residence m a y b e t h e w r o n g goal. "If y o u tried to find m e
w h e r e I live, y o u w o u l d probably fail because I'm almost never there,"
h e notes. " I ' m usually at work." Charlie seizes o n this n o t i o n t o pursue
a different line of attack, modifying his calculations t o look for two
h o t z o n e s , o n e t h a t m i g h t contain t h e killer's residence and t h e other
his place of w o r k . This t i m e Charlie's m a t h w o r k s . D o n m a n a g e s t o
identify a n d catch t h e criminal j u s t before h e kills a n o t h e r victim.
T h e s e days, Rossmo's c o m p a n y ECRI (Environmental Criminology
Research, Inc.) offers t h e p a t e n t e d c o m p u t e r package Rigel along w i t h
training in h o w t o use it effectively t o solve crimes. R o s s m o himself
travels a r o u n d t h e world, t o Asia, Africa, E u r o p e , and t h e Middle East,
assisting in criminal investigations and giving lectures to police and
criminologists. T w o years of training, by R o s s m o or o n e of his assistants,
is required t o learn t o adapt t h e use of t h e p r o g r a m to t h e idiosyncrasies
of a particular criminal's behavior.
Rigel does n o t score a big w i n every time. For example, Rossmo was
called in o n t h e n o t o r i o u s Beltway Sniper case w h e n , during a three-week
period in O c t o b e r 2002, t e n people w e r e killed and three others critically
injured by w h a t t u r n e d o u t t o b e a pair of serial killers operating in and
a r o u n d t h e Washington, D.C., area. R o s s m o concluded that the sniper's
base w a s s o m e w h e r e in the suburbs t o t h e n o r t h of Washington, b u t it
t u r n e d o u t that t h e t w o killers did n o t live in t h e area and moved t o o
often t o b e located by geographic profiling.
T h e fact that Rigel does n o t always w o r k will n o t c o m e as a surprise
t o anyone familiar w i t h w h a t h a p p e n s w h e n y o u try t o apply m a t h e m a t
ics t o t h e m e s s y real w o r l d of people. M a n y people c o m e away from
their h i g h school experience w i t h m a t h e m a t i c s thinking that there is a
right w a y a n d a w r o n g w a y t o use m a t h to solve a p r o b l e m — i n t o o
m a n y cases w i t h t h e teacher's w a y b e i n g t h e right o n e and their o w n
a t t e m p t s b e i n g t h e w r o n g o n e . But this is rarely t h e case. Mathematics
will always give y o u t h e correct answer (if you d o t h e m a t h right) w h e n
23. Finding the Hot Zone 9
you apply it to very well-defined physical situations, such as calculating
h o w m u c h fuel a j e t needs t o fly from Los Angeles t o N e w York. (That
is, the m a t h will give you t h e right answer provided y o u start w i t h accu
rate data a b o u t t h e total w e i g h t of t h e plane, passengers, a n d cargo, t h e
prevailing winds, a n d so forth. Missing a key piece of i n p u t data t o
incorporate into t h e m a t h e m a t i c a l equations will almost always result
in an inaccurate answer.) But w h e n y o u apply m a t h t o a social p r o b l e m ,
such as a crime, things are rarely so clear-cut.
Setting u p equations that capture elements of s o m e real-life activity is
called constructing a "mathematical m o d e l . " In constructing a physical
m o d e l of something, say an aircraft t o study in a w i n d tunnel, t h e impor
tant thing is t o get everything right, apart from t h e size and t h e materials
used. In constructing a mathematical m o d e l , t h e idea is t o get t h e appro
priate behavior right. For example, to b e useful, a m a t h e m a t i c a l m o d e l of
the w e a t h e r should predict rain for days w h e n it rains and predict sun
shine o n sunny days. Constructing t h e m o d e l in t h e first place is usually
the hard part. "Doing the m a t h " w i t h t h e model—i.e., solving t h e equa
tions that m a k e u p the model—is generally m u c h easier, especially w h e n
using computers. Mathematical models of t h e w e a t h e r often fail because
the w e a t h e r is simply far t o o complicated (in everyday language, it's "too
unpredictable") to b e captured by m a t h e m a t i c s w i t h great accuracy.
As w e shall see in later chapters, t h e r e is usually n o such thing as
"one correct w a y " t o use m a t h e m a t i c s t o solve p r o b l e m s in t h e real
world, particularly p r o b l e m s involving people. To try t o m e e t t h e chal
lenges that confront Charlie in NUMB3RS—locating criminals, tracing
the spread of a disease or of counterfeit money, predicting t h e target
selection of terrorists, and so o n — a m a t h e m a t i c i a n c a n n o t m e r e l y w r i t e
d o w n an equation and solve it. T h e r e is a considerable art t o t h e process
of assembling information and data, selecting m a t h e m a t i c a l variables
that describe a situation, and t h e n m o d e l i n g it w i t h a set of equations.
And once a m a t h e m a t i c i a n has c o n s t r u c t e d a m o d e l , t h e r e is still t h e
m a t t e r of solving it in s o m e way, by approximations or calculations or
c o m p u t e r simulations. Every step in t h e process requires j u d g m e n t a n d
creativity. N o t w o m a t h e m a t i c i a n s w o r k i n g independently, h o w e v e r
brilliant, are likely t o p r o d u c e identical results, if i n d e e d they can
p r o d u c e useful results at all.
24. 10 T H E NUMBERS B E H I N D NUMB3RS
It is n o t surprising, then, that in t h e field of geographic profiling,
R o s s m o has competitors. Dr. Grover M. G o d w i n of t h e Justice Center at
t h e University of Alaska, a u t h o r of t h e b o o k Hunting Serial Predators, has
developed a c o m p u t e r package called Predator that uses a b r a n c h of
m a t h e m a t i c a l statistics called multivariate analysis t o pinpoint a serial
killer's h o m e base b y analyzing t h e locations of crimes, w h e r e the
victims w e r e last seen, a n d w h e r e t h e bodies w e r e discovered. N e d
Levine, a H o u s t o n - b a s e d u r b a n planner, developed a p r o g r a m called
Crimestat for t h e National Institute of Justice, a research b r a n c h of the
U.S. Justice D e p a r t m e n t . It uses s o m e t h i n g called spatial statistics to
analyze serial-crime data, and it can also b e applied t o help agents under
stand such things as p a t t e r n s of a u t o accidents o r disease outbreaks.
A n d David Canter, a professor of psychology at t h e University of
Liverpool in England, a n d t h e director of t h e Centre for Investigative
Psychology there, has developed his o w n c o m p u t e r p r o g r a m , Dragnet,
w h i c h h e has s o m e t i m e s offered free t o researchers. C a n t e r has pointed
o u t t h a t so far n o o n e has p e r f o r m e d a head-to-head comparison of the
various m a t h / c o m p u t e r systems for locating serial criminals based o n
applying t h e m in t h e s a m e cases, and h e has claimed in interviews that
in t h e l o n g r u n , his p r o g r a m and o t h e r s will prove to b e at least as
accurate as Rigel.
ROSSMO'S FORMULA
Finally, let's take a closer l o o k at t h e formulas R o s s m o scribbled d o w n
o n t h a t p a p e r n a p k i n o n t h e bullet train in Japan b a c k in 1991.
c
To u n d e r s t a n d w h a t it m e a n s , i m a g i n e a grid of little squares super
i m p o s e d o n t h e m a p , each square having t w o n u m b e r s that locate it:
w h a t r o w it's in and w h a t c o l u m n it's in, "i" and "j". T h e probability, p..,
that t h e killer's residence is in that square is w r i t t e n o n t h e left side of
25. Finding the Hot Zone 11
the equation, and t h e right side shows h o w t o calculate it. T h e crime
locations are represented by m a p coordinates, ( x ^ ) for t h e first crime,
(x ,y ) for the second crime, a n d so on. W h a t t h e formula says is this:
2 2
To get the probability p.^ for t h e square in r o w "i", c o l u m n "j" of t h e
grid, first calculate h o w far y o u have t o g o t o get from t h e center p o i n t
(x.,y.) of that square t o each crime location ( x , y ) . T h e little "n" h e r e
n n
stands for any o n e of t h e crime l o c a t i o n s — n = l m e a n s "first crime,"
n = 2 m e a n s "second crime," and so on. T h e answer t o t h e question of
h o w far you have t o g o is:
IXi-xJ + ly.-yJ
and this is used in t w o ways.
Reading from left t o right in t h e formula, t h e first way is to p u t that
distance in the d e n o m i n a t o r , w i t h (p in t h e n u m e r a t o r . T h e distance is
raised t o the p o w e r / T h e choice of w h a t n u m b e r t o use for t h i s / w i l l b e
based o n w h a t w o r k s best w h e n t h e formula is checked against data o n
past crime patterns. (If y o u t a k e / = 2, for example, t h e n that p a r t of t h e
formula will resemble t h e "inverse square law" that describes t h e force
of gravity.) This part of t h e formula expresses t h e idea that t h e probabil
ity of crime locations decreases as t h e distance increases, once outside of
the buffer z o n e .
T h e second w a y t h e formula uses t h e "traveling distance" of each
crime involves the buffer z o n e . In t h e second fraction, y o u subtract t h e
distance from 2B, w h e r e B is a n u m b e r t h a t will b e chosen t o describe
the size of t h e buffer z o n e , and y o u use that subtraction result in
the second fraction. T h e subtraction p r o d u c e s smaller answers as t h e
distance increases, so that after raising those answers t o a n o t h e r power,
g, in the d e n o m i n a t o r of t h e second p a r t of t h e formula, y o u get larger
results.
Together, the first and second parts of t h e formula p e r f o r m a sort of
"balancing act," expressing t h e fact that as you m o v e away from t h e
criminal's base, the probability of crimes first increases (as y o u m o v e
t h r o u g h the buffer zone) and t h e n decreases. T h e t w o p a r t s of t h e
formula are c o m b i n e d using a fancy m a t h e m a t i c a l notation, t h e G r e e k
letter Z standing for " s u m (add up) t h e contributions from each of t h e
26. 12 T H E NUMBERS B E H I N D NUMB3RS
crimes t o t h e evaluation of the probability for the 'if grid square." T h e
G r e e k letter (p is u s e d in t h e t w o parts as a way of placing m o r e "weight"
o n o n e p a r t or t h e other. A larger choice of (p p u t s m o r e weight o n the
p h e n o m e n o n of "decreasing probability as distance increases," whereas
a smaller 9 emphasizes t h e effect of t h e buffer z o n e .
O n c e t h e formula is used t o calculate t h e probabilities, p„, of all of
t h e little squares in t h e grid, it's easy t o m a k e a h o t z o n e map. You just
color t h e squares, w i t h t h e highest probabilities bright yellow, slightly
smaller probabilities o r a n g e , t h e n red, and so on, leaving t h e squares
w i t h l o w probability uncolored.
Rossmo's formula is a g o o d example of t h e art of using m a t h e m a t i c s
t o describe i n c o m p l e t e k n o w l e d g e of real-world p h e n o m e n a . Unlike
t h e law of gravity, w h i c h t h r o u g h careful m e a s u r e m e n t s can b e observed
t o o p e r a t e the same way every time, descriptions of t h e behavior of
individual h u m a n beings are at best approximate and uncertain. W h e n
R o s s m o checked o u t his formula o n past crimes, h e h a d to find the
best fit of his formula t o those data b y choosing different possible values
of / a n d g, a n d of B a n d (p. H e t h e n used those findings in analyzing
future crime p a t t e r n s , still allowing for further fine-tuning in each n e w
investigation.
Rossmo's m e t h o d is definitely n o t rocket science—space travel
d e p e n d s crucially o n always getting t h e right answer w i t h great accu
racy. But it is nevertheless science. It does n o t w o r k every time, and the
answers it gives are probabilities. But in crime detection and other
d o m a i n s involving h u m a n behavior, k n o w i n g those probabilities can
s o m e t i m e s m a k e all t h e difference.
27. CHAPTER
2 Fighting Crime with
Statistics 101
THE ANGEL OF DEATH
By 1996, Kristen Gilbert, a thirty-three-year-old divorced m o t h e r of t w o
sons, ages seven and ten, and a nurse in W a r d C at t h e Veteran's Affairs
Medical Center in N o r t h a m p t o n , Massachusetts, h a d built u p quite a
reputation a m o n g her colleagues at the hospital. O n several occasions she
was the first o n e to notice that a patient was going into cardiac arrest and
to sound a "code blue" to bring t h e e m e r g e n c y resuscitation t e a m . She
always stayed calm, and was c o m p e t e n t and efficient in administering to
the patient. Sometimes she w o u l d give t h e patient an injection of t h e
heart-stimulant d r u g epinephrine to a t t e m p t to restart the h e a r t before
the emergency t e a m arrived, occasionally saving t h e patient's life in this
way. T h e other nurses had given h e r the nickname 'Angel of Death."
But that same year, three nurses approached the authorities to express
their growing suspicions that something was not quite right. There had
been just too many deaths from cardiac arrest in that particular ward, they
felt. There had also been several unexplained shortages of epinephrine. T h e
nurses were starting to fear that Gilbert was giving the patients large doses
of the drug to bring o n the heart attacks in the first place, so that she could
play the heroic role of trying to save them. T h e 'Angel of Death" nickname
was beginning to sound m o r e apt than they h a d first intended.
T h e hospital launched an investigation, b u t found nothing untoward. In
particular, the n u m b e r of cardiac deaths at the unit was broadly in line w i t h
the rates at other VA hospitals, they said. Despite t h e findings of t h e initial
28. 14 T H E NUMBERS B E H I N D NUMB3RS
investigation, however, the staff at the hospital remained suspicious, and
eventually a second investigation was begun. This included bringing in a
professional statistician, Stephen Gehlbach of the University of Massachu
setts, to take a closer look at the unit's cardiac arrest and mortality figures.
Largely as a result of Gehlbach's analysis, in 1998 the U.S. Attorney's Office
decided to convene a g r a n d j u r y to hear the evidence against Gilbert.
Part of t h e evidence w a s h e r alleged motivation. In addition to seek
ing t h e excitement of t h e code blue a l a r m and the resuscitation process,
plus t h e recognition for having struggled valiantly to save t h e patient, it
w a s suggested t h a t she s o u g h t t o impress h e r boyfriend, w h o also
w o r k e d at t h e hospital. Moreover, she h a d access t o t h e epinephrine.
But since n o o n e h a d seen h e r administer any fatal injections, the case
against her, while suggestive, was purely circumstantial. Although the
patients involved w e r e mostly middle-aged m e n n o t regarded as poten
tial h e a r t attack victims, it w a s possible that their attacks had occurred
naturally. W h a t tipped t h e balance, and led t o a decision t o indict Gilbert
for multiple m u r d e r , w a s Gehlbach's statistical analysis.
THE SCIENCE OF STATE
Statistics is widely used in law enforcement in m a n y ways and for m a n y
p u r p o s e s . In NUMB3RS, Charlie often carries o u t a statistical analysis,
and t h e use of statistical techniques will appear in m a n y chapters in this
b o o k , often w i t h o u t o u r m a k i n g explicit m e n t i o n of t h e fact. But w h a t
exactly does statistics entail? A n d w h y was t h e w o r d in the singular in
t h a t last sentence?
T h e w o r d "statistics" c o m e s from the Latin t e r m statisticum collegium,
m e a n i n g "council of state" a n d t h e Italian w o r d statista, m e a n i n g "states
m a n , " w h i c h reflects t h e initial uses of the technique. T h e G e r m a n
w o r d Statistik likewise originally m e a n t t h e analysis of data about the
state. Until t h e n i n e t e e n t h century, t h e equivalent English t e r m was
"political arithmetic," after w h i c h t h e w o r d "statistics" was introduced
t o refer t o any collection and classification of data.
Today, "statistics" really has t w o c o n n e c t e d meanings. T h e first is the
collection a n d tabulation of data; t h e second is t h e use of mathematical
and o t h e r m e t h o d s t o d r a w meaningful and useful conclusions from
29. Fighting Crime with Statistics 101 15
tabulated data. S o m e statisticians refer t o t h e f o r m e r activity as "little-s
statistics" and the latter activity as "big-S Statistics". Spelled w i t h a
lower-case s, t h e w o r d is treated as plural w h e n it refers t o a collection
of n u m b e r s . But it is singular w h e n used t o refer t o t h e activity of
collecting and tabulating those n u m b e r s . "Statistics" (with a capital S)
refers t o an activity, and h e n c e is singular.
T h o u g h m a n y sports fans a n d o t h e r kinds of people enjoy collecting
and tabulating numerical data, t h e real value of little-s statistics is t o
provide t h e data for big-S Statistics. M a n y of t h e m a t h e m a t i c a l tech
niques used in big-S Statistics involve t h e b r a n c h of m a t h e m a t i c s k n o w n
as probability theory, which b e g a n in t h e sixteenth a n d seventeenth
centuries as an a t t e m p t t o u n d e r s t a n d t h e likely o u t c o m e s of g a m e s
of chance, in order t o increase t h e likelihood of winning. But w h e r e a s
probability t h e o r y is a definite b r a n c h of m a t h e m a t i c s , Statistics is
essentially an applied science that uses m a t h e m a t i c a l m e t h o d s .
While the law enforcement profession collects a large quantity of little-
s statistics, it is the use of big-S Statistics as a tool in fighting crime that w e
shall focus on. (From n o w o n w e shall drop the "big S", "little s" terminol
ogy and use the w o r d "statistics" the way statisticians do, to m e a n b o t h ,
leaving the reader to determine the intended m e a n i n g from the context.)
Although s o m e applications of statistics in law e n f o r c e m e n t use
sophisticated m e t h o d s , the basic techniques covered in a first-semester
college statistics course are often e n o u g h t o crack a case.
This was certainly t r u e for United States v. Kristen Gilbert. In that case,
a crucial question for the g r a n d j u r y w a s w h e t h e r there w e r e significantly
m o r e deaths in t h e unit w h e n Kristen Gilbert w a s o n duty t h a n at o t h e r
times. T h e key w o r d here is "significantly". O n e or t w o extra deaths o n
her watch could b e coincidence. H o w m a n y deaths w o u l d it take to reach
the level of "significance" sufficient t o indict Gilbert? This is a question
that only statistics can answer. Accordingly, Stephen Gehlbach was asked
to provide the g r a n d j u r y w i t h a s u m m a r y of his findings.
HYPOTHESIS TESTING
Gehlbach's testimony was based o n a f u n d a m e n t a l statistical t e c h n i q u e
k n o w n as hypothesis testing. This m e t h o d uses probability t h e o r y t o
30. 16 THE NUMBERS B E H I N D NUMB3RS
determine whether an observed outcome is so unusual that it is highly
unlikely to have occurred naturally.
One of the first things Gehlbach did was plot the annual number of
deaths at the hospital from 1988 through 1997, broken down by shifts—
midnight to 8:00 AM, 8:00 AM to 4:00 PM, and 4:00 PM to midnight. The
resulting graph is shown in Figure 1. Each vertical bar shows the total
number of deaths in the year during that particular shift.
40
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
Year
• Night (12 A . M . - 8 A.M.) • Day (8 A . M . - 4 P.M.) H Evening (4 P.M.-12 A.M.)
Figure 1 . Total deaths at the hospital, by shift and year.
The graph shows a definite pattern. For the first two years, there were
around ten deaths per year on each shift. Then, for each of the years 1990
through 1995, one of the three shifts shows between 25 and 35 deaths per
year. Finally, for the last two years, the figures drop back to roughly ten
deaths on each of the three shifts. When the investigators examined
Kristen Gilbert's work record, they discovered that she started work in
Ward C in March 1990 and stopped working at the hospital in February
1996. Moreover, for each of the years she worked at the VA, the shift that
showed the dramatically increased number of deaths was the one she
worked. To a layperson, this might suggest that Gilbert was clearly respon
sible for the deaths, but on its own it would not be sufficient to secure a
conviction—indeed, it might not be enough to justify even an indictment.
The problem is that it may be just a coincidence. The job of the statistician
31. Fighting Crime with Statistics 101 17
in this situation is to d e t e r m i n e just h o w unlikely such a coincidence
would be. If the answer is that the likelihood of such a coincidence is, say,
1 in 100, then Gilbert might well b e innocent; and even 1 in 1,000 leaves
some d o u b t as to her guilt; b u t with a likelihood of, say, 1 in 100,000, m o s t
people w o u l d find the evidence against her t o b e pretty compelling.
To see h o w hypothesis testing works, let's start w i t h t h e simple
example of tossing a coin. If t h e coin is perfectly balanced (i.e., unbiased
or fair), t h e n t h e probability of getting heads is 0.5.* Suppose w e toss t h e
coin ten times in a r o w t o see if it is biased in favor of heads. T h e n w e
can get a range of different o u t c o m e s , and it is possible t o c o m p u t e t h e
likelihood of different results. For example, t h e probability of getting at
least six heads is a b o u t 0.38. (The calculation is straightforward b u t a bit
intricate, because there are m a n y possible ways y o u can get six or m o r e
heads in ten tosses, and y o u have t o take a c c o u n t of all of t h e m . ) T h e
figure of 0.38 p u t s a precise numerical value o n t h e fact that, o n an
intuitive level, w e w o u l d n o t b e surprised if t e n coin tosses gave six or
m o r e heads. For at least seven heads, t h e probability w o r k s o u t at 0.17,
a figure that corresponds t o o u r intuition t h a t seven or m o r e heads is
s o m e w h a t u n u s u a l b u t certainly n o t a cause for suspicion t h a t t h e coin
was biased. W h a t w o u l d surprise us is nine or t e n heads, a n d for that t h e
probability w o r k s o u t at a b o u t 0.01, or 1 in 100. T h e probability of get
ting ten heads is a b o u t 0.001, or 1 in 1,000, a n d if t h a t h a p p e n e d w e
w o u l d definitely suspect an unfair coin. T h u s , b y tossing t h e coin ten
times, w e can form a reliable, precise j u d g m e n t , based o n m a t h e m a t i c s ,
of the hypothesis that t h e coin is unbiased.
In the case of the suspicious deaths at t h e Veteran's Affairs Medical
Center, the investigators w a n t e d to k n o w if t h e n u m b e r of deaths that
occurred w h e n Kristen Gilbert was o n d u t y w a s so unlikely that it could
not be merely happenstance. T h e m a t h is a bit m o r e complicated t h a n
for the coin tossing, b u t t h e idea is t h e same. Table 1 gives the data t h e
investigators had at their disposal. It gives n u m b e r s of shifts, classified in
different ways, and covers t h e eighteen-month period ending in February
*Actually, this is not entirely accurate. Because of inertia! properties of a physical
coin, there is a slight tendency for it to resist turning, with the result that, if a perfectly
balanced coin is given a random initial flip, the probability that it will land the same
way up as it started is about 0.51. But we will ignore this caveat in what follows.
32. 18 THE N U M B E R S B E H I N D NUMB3RS
1996, the month when the three nurses told their supervisor of their
concerns, shortly after which Gilbert took a medical leave.
GILBERT PRESENT DEATH O N SHIFT
YES NO TOTAL
YES 40 217 257
NO 34 1,350 1,384
TOTAL 74 1,567 1,641
Table 1. The data for the statistical analysis in the Gilbert case.
Altogether, there were 74 deaths, spread over a total of 1,641 shifts.
If the deaths are assumed to have occurred randomly, these figures
suggest that the probability of a death on any one shift is about 74
out of 1,641, or 0.045. Focusing now on the shifts when Gilbert was on
duty, there were 257 of them. If Gilbert was not killing any of the patients,
we would expect there to be around 0.045 x 257 = 11.6 deaths on her
shifts, i.e., around 11 or 12 deaths. In fact there were more—40 to be pre
cise. How likely is this? Using mathematical methods similar to those for
the coin tosses, statistician Gehlbach calculated that the probability of
having 40 or more of the 74 deaths occur on Gilbert's shifts was less than
1 in 100 million. In other words, it is unlikely in the extreme that Gilbert's
shifts were merely "unlucky" for the patients.
The grand jury decided there was sufficient evidence to indict
Gilbert—presumably the statistical analysis was the most compelling
evidence, but we cannot know for sure, as a grand jury's deliberations
are not public knowledge. She was accused of four specific murders and
three attempted murders. Because the VA is a federal facility, the trial
would be in a federal court rather than a state court, and subject to fed
eral laws. A significant consequence of this fact for Gilbert was that
although Massachusetts does not have a death penalty, federal law does,
and that is what the prosecutor asked for.
STATISTICS IN THE COURTROOM?
An interesting feature of this case is that the federal trial judge ruled
in pretrial deliberations that the statistical evidence should not be
33. Fighting Crime with Statistics 101 19
presented in court. In m a k i n g his ruling, t h e j u d g e t o o k n o t e of a
submission by a second statistician b r o u g h t into t h e case, G e o r g e C o b b
of M o u n t Holyoke College.
Cobb and Gehlbach did n o t disagree o n any of t h e statistical analysis.
(In fact, they ended u p writing a joint article about t h e case.) Rather, their
roles were different, and they w e r e addressing different issues. Gehlbach's
task was to use statistics t o d e t e r m i n e if there w e r e reasonable g r o u n d s t o
suspect Gilbert of multiple murder. More specifically, h e carried o u t an
analysis that showed that the increased n u m b e r s of deaths at t h e hospital
during the shifts w h e n Gilbert was o n duty could n o t have arisen due t o
chance variation. T h a t was sufficient t o cast suspicion o n Gilbert as the
cause of the increase, b u t n o t at all e n o u g h t o prove that she did cause the
increase. W h a t C o b b argued was that the establishment of a statistical
relationship does n o t explain the cause of that relationship. T h e j u d g e in
the case accepted this argument, since the p u r p o s e of the trial was n o t t o
decide if there were g r o u n d s t o m a k e Gilbert a suspect—the g r a n d j u r y
and the state attorney's office h a d d o n e that. Rather, t h e j o b before the
court was to determine w h e t h e r or n o t Gilbert caused the deaths in ques
tion. His reason for excluding the statistical evidence was that, as experi
ences in previous court cases had demonstrated, j u r o r s n o t well versed in
statistical reasoning—and that w o u l d b e almost all jurors—typically have
great difficulty appreciating w h y odds of 1 in 100 million against the suspi
cious deaths occurring by chance does not imply that the odds that Gilbert
did not kill the patients are likewise 1 in 100 million. T h e original odds
could be caused by something else.
Cobb illustrated the distinction by means of a famous example from the
long struggle physicians and scientists had in overcoming the powerful
tobacco lobby to convince governments and the public that cigarette smok
ing causes lung cancer. Table 2 shows the mortality rates for three categories
of people: nonsmokers, cigarette smokers, and cigar and pipe smokers.
Nonsmokers 20.2
Cigarette smokers 20.5
Cigar and pipe smokers 35.3
Table 2. Mortality rates per 1,000 people per year.
34. 20 T H E NUMBERS B E H I N D NUMB3RS
At first glance, t h e figures in Table 2 s e e m t o indicate that cigarette
s m o k i n g is n o t d a n g e r o u s b u t pipe and cigar s m o k i n g are. However, this
is n o t t h e case. T h e r e is a crucial variable lurking behind the data that the
n u m b e r s themselves d o n o t indicate: age. T h e average age of the non-
smokers w a s 54.9, t h e average age of t h e cigarette smokers was 50.5, and
the average age of the cigar and pipe smokers was 65.9. Using statistical
techniques t o m a k e allowance for t h e age differences, statisticians were
able t o adjust t h e figures to p r o d u c e Table 3.
Nonsmokers 20.3
Cigarette smokers 28.3
Cigar and pipe smokers 21.2
Table 3. Mortality rates per 1,000 people per year, adjusted for age.
N o w a very different p a t t e r n emerges, indicating that cigarette s m o k i n g
is highly d a n g e r o u s .
W h e n e v e r a calculation of probabilities is m a d e based o n observa
tional data, t h e m o s t that can generally b e concluded is that there is a
correlation b e t w e e n t w o or m o r e factors. T h a t can m e a n e n o u g h to
spur further investigation, b u t o n its o w n it does n o t establish causation.
T h e r e is always t h e possibility of a hidden variable that lies behind the
correlation.
W h e n a study is m a d e of, say, t h e effectiveness or safety of a n e w
d r u g o r medical p r o c e d u r e , statisticians handle t h e p r o b l e m of hidden
p a r a m e t e r s by relying n o t o n observational data, b u t instead by
c o n d u c t i n g a r a n d o m i z e d , double-blind trial. In such a study, the target
p o p u l a t i o n is divided i n t o t w o g r o u p s by an entirely r a n d o m procedure,
w i t h t h e g r o u p allocation u n k n o w n t o b o t h t h e experimental subjects
a n d t h e caregivers administering t h e d r u g or t r e a t m e n t (hence t h e t e r m
"double-blind"). O n e g r o u p is given t h e n e w d r u g or treatment, the
o t h e r is given a placebo or d u m m y t r e a t m e n t . W i t h such an experiment,
t h e r a n d o m allocation into g r o u p s overrides t h e possible effect o f hid
d e n p a r a m e t e r s , so that in this case a low probability that a positive
result is simply chance variation can indeed b e taken as conclusive
evidence that t h e d r u g or t r e a t m e n t is w h a t caused t h e result.
35. Fighting Crime with Statistics 101 21
In trying t o solve a crime, t h e r e is of course n o choice b u t t o
w o r k w i t h t h e data available. H e n c e , use of t h e hypothesis-testing
procedure, as in the Gilbert case, can b e highly effective in t h e identifica
tion of a suspect, b u t o t h e r m e a n s are generally required t o secure a
conviction.
In United States v. Kristen Gilbert, t h e j u r y was n o t p r e s e n t e d w i t h
Gehlbach's statistical analysis, b u t they did find sufficient evidence t o
convict her o n three c o u n t s of first-degree m u r d e r , o n e c o u n t of sec
ond-degree murder, and t w o c o u n t s of a t t e m p t e d m u r d e r . A l t h o u g h t h e
prosecution asked for t h e d e a t h sentence, t h e j u r y split 8-4 o n t h a t issue,
and accordingly Gilbert w a s sentenced t o life i m p r i s o n m e n t w i t h n o
possibility of parole.
POLICING THE POLICE
Another use of basic statistical techniques in law enforcement concerns
the important matter of ensuring that the police themselves obey the law.
Law enforcement officers are given a considerable a m o u n t of
p o w e r over their fellow citizens, a n d o n e of t h e duties of society is t o
m a k e certain that they d o n o t abuse that power. In particular, police
officers are supposed to treat everyone equally and fairly, free of any
bias based o n gender, race, ethnicity, e c o n o m i c status, age, dress, or
religion.
But d e t e r m i n i n g bias is a tricky business and, as w e saw in o u r previ
ous discussion of cigarette s m o k i n g , a superficial glance at t h e statistics
can s o m e t i m e s lead t o a completely false conclusion. This is illustrated
in a particularly d r a m a t i c fashion by t h e following example, which,
while n o t related t o police activity, clearly indicates t h e n e e d t o a p p r o a c h
statistics w i t h s o m e m a t h e m a t i c a l sophistication.
In t h e 1970s, s o m e b o d y noticed that 44 p e r c e n t of m a l e applicants t o
the g r a d u a t e school of t h e University of California at Berkeley w e r e
accepted, b u t only 35 percent of female applicants w e r e accepted. O n
the face of it, this looked like a clear case of g e n d e r discrimination, and,
n o t surprisingly (particularly at Berkeley, l o n g acknowledged as h o m e
to m a n y leading advocates for g e n d e r equality), t h e r e w a s a lawsuit over
gender bias in admissions policies.
36. 22 T H E NUMBERS B E H I N D NUMB3RS
It turns out that Berkeley applicants do not apply to the graduate
school, but to individual programs of study—such as engineering, phys
ics, or English—so if there is any admissions bias, it will occur within
one or more particular program. Table 4 gives the admission data pro
gram by program:
Major Male apps % admit Female apps % admit
A 825 62 108 82
CD
560 63 25 68
C 325 37 593 34
D 417 33 375 35
E 191 28 393 24
F 373 6 341 7
Table 4. Admission figures from the University of California at Berkeley
on a program-by-program basis.
If you look at each program individually, however, there doesn't
appear to be an advantage in admission for male applicants. Indeed, the
percentage of female applicants admitted to heavily subscribed program
A is considerably higher than for males, and in all other programs the
percentages are fairly close. So how can there appear to be an advantage
for male applicants overall?
To answer this question, you need to look at what programs males
and females applied to. Males applied heavily to programs A and B,
females applied primarily to programs C, D, E, and F. The programs
that females applied to were more difficult to get into than those for
males (the percentages admitted are low for both genders), and this is
why it appears that males had an admission advantage when looking at
the aggregate data.
There was indeed a gender factor at work here, but it had nothing to
do with the university's admissions procedures. Rather, it was one of
self-selection by the applying students, where female applicants avoided
progams A and B.
37. Fighting Crime with Statistics 101 23
T h e Berkeley case was an example of a p h e n o m e n o n k n o w n as
Simpson's paradox, n a m e d for E. H . Simpson, w h o studied this curious
p h e n o m e n o n in a famous 1951 paper.*
HOW DO YOU DETERMINE BIAS?
W i t h the above cautionary example in mind, w h a t should w e m a k e of
the study carried o u t in Oakland, California, in 2003 (by t h e R A N D
Corporation, at t h e request of t h e O a k l a n d Police D e p a r t m e n t ' s Racial
Profiling Task Force), t o d e t e r m i n e if there was systematic racial bias in
the way police stopped motorists?
T h e R A N D researchers analyzed 7,607 vehicle stops recorded b y
Oakland police officers b e t w e e n J u n e and D e c e m b e r 2003, using vari
ous statistical tools t o examine a n u m b e r of variables t o uncover any
evidence that suggested racial profiling. O n e figure they found w a s that
blacks w e r e involved in 56 percent of all traffic stops studied, a l t h o u g h
they m a k e u p just 35 percent of O a k l a n d ' s residential population. D o e s
this finding indicate racial profiling? Well, it might, b u t as s o o n as y o u
look m o r e closely at w h a t o t h e r factors could b e reflected in those
n u m b e r s , the issue is by n o m e a n s clear cut.
For instance, like m a n y inner cities, O a k l a n d has s o m e areas w i t h
m u c h higher crime rates t h a n others, and t h e police patrol those higher
crime areas at a m u c h greater rate t h a n they d o areas having less crime.
As a result, they m a k e m o r e traffic stops in those areas. Since t h e higher
crime areas typically have greater concentrations of m i n o r i t y g r o u p s ,
the higher rate of traffic stops in those areas manifests itself as a higher
rate of traffic stops of minority drivers.
To overcome these uncertainties, t h e R A N D researchers devised a
particularly ingenious way t o look for possible racial bias. If racial profil
ing was occurring, they reasoned, stops of minority drivers w o u l d b e
higher w h e n the officers could d e t e r m i n e the driver's race prior t o mak
ing the stop. Therefore, they c o m p a r e d t h e stops m a d e d u r i n g a period
* E . H. S i m p s o n . " T h e I n t e r p r e t a t i o n o f I n t e r a c t i o n in C o n t i n g e n c y T a b l e s , " Jour
nal of the Royal Statistical Society, Ser. B, 13 (1951) 2 3 8 - 2 4 1 .
38. 24 T H E NUMBERS B E H I N D NUMB3RS
j u s t before nightfall w i t h those m a d e after d a r k — w h e n t h e officers
w o u l d b e less likely t o b e able t o d e t e r m i n e t h e driver's race. T h e figures
s h o w e d that 50 p e r c e n t of drivers stopped d u r i n g the daylight period
w e r e black, c o m p a r e d w i t h 54 p e r c e n t w h e n it was dark. Based o n that
finding, t h e r e does n o t appear to b e systematic racial bias in traffic
stops.
But t h e researchers d u g a little further, and looked at the officers'
o w n reports as t o w h e t h e r they could d e t e r m i n e the driver's race prior
t o m a k i n g t h e stop. W h e n officers r e p o r t e d k n o w i n g the race in advance
of t h e stop, 6 6 p e r c e n t of drivers stopped w e r e black, c o m p a r e d w i t h
only 44 percent w h e n t h e police r e p o r t e d n o t k n o w i n g the driver's race
in advance. This is a fairly s t r o n g indicator of racial bias.*
*Sadly, d e s p i t e m a n y efforts t o e l i m i n a t e t h e p r o b l e m , racial bias b y p o l i c e
s e e m s t o b e a p e r s i s t e n t issue t h r o u g h o u t t h e country. To cite just o n e recent r e p o r t ,
A n Analysis of Traffic Stop Data in Riverside, California, b y Larry K. Gaines of t h e
C a l i f o r n i a State University in San B e r n a r d i n o , p u b l i s h e d in Police Quarterly, 9, 2 ,
J u n e 2 0 0 6 , p p . 2 1 0 - 2 3 3 : " T h e f i n d i n g s f r o m racial p r o f i l i n g or traffic s t o p studies
h a v e b e e n fairly c o n s i s t e n t : M i n o r i t i e s , especially African A m e r i c a n s , are s t o p p e d ,
t i c k e t e d , a n d s e a r c h e d at a h i g h e r rate as c o m p a r e d t o W h i t e s . For e x a m p l e ,
L a m b e r t h (cited in State v. Pedro Soto, 1996) f o u n d t h a t t h e M a r y l a n d State Police
s t o p p e d a n d s e a r c h e d A f r i c a n A m e r i c a n s at a h i g h e r rate as c o m p a r e d t o their
rate o f s p e e d i n g v i o l a t i o n s . Harris (1999) e x a m i n e d c o u r t records in A k r o n , D a y t o n ,
T o l e d o , a n d C o l u m b u s , O h i o , a n d f o u n d t h a t African A m e r i c a n s w e r e c i t e d at a rate
t h a t surpassed t h e i r r e p r e s e n t a t i o n in t h e d r i v i n g p o p u l a t i o n . C o r d n e r , W i l l i a m s , a n d
Z u n i g a (2000) a n d C o r d n e r , W i l l i a m s , a n d Velasco (2002) f o u n d similar t r e n d s in San
D i e g o , C a l i f o r n i a . Zingraff a n d his c o l l e a g u e s (2000) e x a m i n e d s t o p s b y t h e N o r t h
Carolina H i g h w a y Patrol a n d f o u n d t h a t A f r i c a n A m e r i c a n s w e r e o v e r r e p r e s e n t e d in
s t o p s a n d searches."
39. CHAPTER
Data Mining
3 Finding Meaningful
in Masses of Information
Patterns
BRUTUS
Charlie Eppes is sitting in front of a b a n k of c o m p u t e r s and television
monitors. H e is testing a c o m p u t e r p r o g r a m h e is developing to help
police m o n i t o r large crowds, l o o k i n g for u n u s u a l behavior that could
indicate a p e n d i n g criminal or terrorist act. His idea is t o use standard
mathematical equations that describe the flow of fluids—in rivers, lakes,
oceans, tanks, pipes, even blood vessels.* H e is trying o u t t h e n e w sys
t e m at a fund-raising reception for o n e of t h e California state senators.
Overhead cameras m o n i t o r t h e diners as they m o v e a r o u n d t h e r o o m ,
and Charlie's c o m p u t e r p r o g r a m analyzes t h e "flow" of t h e people.
Suddenly t h e test takes o n an u n e x p e c t e d aspect. T h e FBI receives a
telephone w a r n i n g that a g u n m a n is in t h e r o o m , intending t o kill t h e
senator.
T h e software works, and Charlie is able to identify t h e g u n m a n , b u t
D o n and his t e a m are n o t able t o get t o the killer before h e has shot t h e
senator and t h e n t u r n e d t h e g u n o n himself.
T h e dead assassin t u r n s o u t t o b e a Vietnamese i m m i g r a n t , a f o r m e r
Vietcong m e m b e r , w h o , despite having b e e n in prison in California,
* T h e idea is b a s e d o n several real-life p r o j e c t s t o use t h e e q u a t i o n s t h a t d e s c r i b e
f l u i d f l o w s in o r d e r t o analyze v a r i o u s kinds o f c r o w d activity, i n c l u d i n g f r e e w a y traf
fic f l o w , s p e c t a t o r s e n t e r i n g a n d l e a v i n g a large s p o r t s s t a d i u m , a n d e m e r g e n c y
exits f r o m b u r n i n g b u i l d i n g s .
40. 26 T H E NUMBERS B E H I N D NUMB3RS
s o m e h o w m a n a g e d t o obtain U.S. citizenship and b e the recipient of a
regular pension from t h e U.S. Army. H e h a d also taken the illegal d r u g
speed o n t h e evening of t h e assassination. W h e n D o n makes s o m e
enquiries t o find o u t j u s t w h a t is g o i n g on, h e is visited by a CIA agent
w h o asks for help in trying t o prevent t o o m u c h information about the
case leaking out. Apparently t h e dead killer h a d b e e n part of a covert
CIA behavior modification project carried o u t in California prisons dur
ing t h e 1960s t o t u r n i n m a t e s into trained assassins w h o , w h e n activated,
w o u l d carry o u t their assigned task before killing themselves. (Sadly, this
idea is n o less fanciful t h a n t h a t of Charlie using fluid flow equations to
study c r o w d behavior.)
But w h y h a d this particular individual suddenly b e c o m e active and
m u r d e r e d t h e state senator?
T h e picture b e c o m e s m u c h clearer w h e n a second m u r d e r occurs.
T h e victim this t i m e is a p r o m i n e n t psychiatrist, the killer a C u b a n immi
grant. T h e killer h a d also spent t i m e in a California prison, and h e t o o
w a s t h e recipient of regular A r m y pension checks. But o n this occasion,
w h e n the assassin tries to s h o o t himself after killing the victim, the g u n
fails t o g o off and h e has t o flee t h e scene. A fingerprint identification
from the g u n soon leads t o his arrest.
W h e n D o n realizes that t h e dead senator h a d b e e n u r g i n g a repeal of
t h e statewide b a n o n t h e use of behavior modification techniques o n
prison inmates, and that t h e dead psychiatrist h a d b e e n r e c o m m e n d i n g
t h e re-adoption of such techniques t o overcome criminal tendencies, h e
quickly concludes that s o m e o n e has started t o t u r n t h e conditioned
assassins o n t h e very p e o p l e w h o w e r e pressing for the reuse of the
techniques that h a d p r o d u c e d t h e m . But who?
D o n thinks his best line of investigation is to find o u t w h o supplied
t h e g u n s t h a t t h e t w o killers h a d used. H e k n o w s that t h e w e a p o n s orig
inated w i t h a dealer in Nevada. Charlie is able t o provide t h e next step,
w h i c h leads to t h e identification of the individual b e h i n d the t w o assas
sinations. H e obtains data o n all g u n sales involving that particular
dealer and analyzes t h e relationships a m o n g all sales that originated
there. H e explains t h a t h e is e m p l o y i n g m a t h e m a t i c a l techniques similar
t o those used t o analyze calling p a t t e r n s o n t h e t e l e p h o n e n e t w o r k — a n
a p p r o a c h used frequently in real-life law enforcement.
41. Data Mining 27
This is w h a t viewers saw in t h e third-season episode of NUMB3RS
called "Brutus" (the code n a m e for t h e fictitious CIA conditioned-
assassinator project), first aired o n N o v e m b e r 24, 2006. As usual, t h e
m a t h e m a t i c s Charlie uses in the s h o w is based o n real life.
T h e m e t h o d Charlie uses to track t h e g u n distribution is generally
referred to as "link analysis," and is o n e a m o n g m a n y that g o u n d e r
the collective heading of "data mining." D a t a m i n i n g obtains useful
information a m o n g the mass of data that is available—often publicly—
in m o d e r n society.
FINDING MEANING IN INFORMATION
Data mining was initially developed by t h e retail industry to detect cus
t o m e r purchasing patterns. (Ever w o n d e r w h y s u p e r m a r k e t s offer cus
t o m e r s those loyalty cards—sometimes called "club" cards—in exchange
for discounts? In p a r t it's t o e n c o u r a g e c u s t o m e r s t o k e e p s h o p p i n g at
the same store, b u t l o w prices w o u l d d o that. T h e significant factor for t h e
c o m p a n y is that it enables t h e m t o track detailed purchase p a t t e r n s that
they can link to c u s t o m e r s ' h o m e zip codes, information that they can
t h e n analyze using data-mining techniques.)
T h o u g h m u c h of the w o r k in data m i n i n g is d o n e by c o m p u t e r s , for
the m o s t part those c o m p u t e r s d o n o t r u n autonomously. H u m a n
expertise also plays a significant role, and a typical data-mining investi
gation will involve a constant back-and-forth interplay b e t w e e n h u m a n
expert and m a c h i n e .
Many of the c o m p u t e r applications used in data m i n i n g fall u n d e r
the general area k n o w n as artificial intelligence, a l t h o u g h that t e r m can
be misleading, being suggestive of c o m p u t e r s that think a n d act like
people. Although m a n y people believed that w a s a possibility back in
the 1950s w h e n AI first b e g a n t o b e developed, it eventually b e c a m e
clear that this was n o t g o i n g to h a p p e n within t h e foreseeable future,
and m a y well never b e the case. But that realization did n o t prevent the
development of m a n y " a u t o m a t e d reasoning" p r o g r a m s , s o m e of which
eventually found a powerful and i m p o r t a n t use in data mining, w h e r e
the h u m a n expert often provides t h e "high-level intelligence" that guides
the c o m p u t e r p r o g r a m s that d o the bulk of t h e w o r k . In this way, data
42. 28 T H E NUMBERS B E H I N D NUMB3RS
m i n i n g provides an excellent example of t h e p o w e r that results w h e n
h u m a n brains t e a m u p w i t h c o m p u t e r s .
A m o n g t h e m o r e p r o m i n e n t m e t h o d s and tools used in data
m i n i n g are:
• Link analysis—looking for associations and o t h e r forms of
c o n n e c t i o n a m o n g , say, criminals or terrorists
• Geometric clustering—a specific form of link analysis
• Software agents—small, self-contained pieces of c o m p u t e r code
t h a t can monitor, retrieve, analyze, and act o n information
• Machine learning—algorithms that can extract profiles of
criminals a n d graphical m a p s of crimes
• Neural networks—special kinds of c o m p u t e r p r o g r a m s that can
predict t h e probability of crimes and terrorist attacks.
We'll take a brief l o o k at each of these topics in t u r n .
LINK ANALYSIS
N e w s p a p e r s often refer t o link analysis as "connecting the dots." It's the
process of tracking connections b e t w e e n people, events, locations, and
organizations. T h o s e connections could b e family ties, business relation
ships, criminal associations, financial transactions, in-person meetings,
e-mail exchanges, and a host of others. Link analysis can b e particularly
powerful in fighting terrorism, organized crime, m o n e y laundering
("follow t h e m o n e y " ) , and telephone fraud.
Link analysis is primarily a h u m a n - e x p e r t driven process. Mathemat
ics a n d t e c h n o l o g y are used to provide a h u m a n expert w i t h powerful,
flexible c o m p u t e r tools t o uncover, examine, and track possible connec
tions. T h o s e tools generally allow t h e analyst t o represent linked data as
a n e t w o r k , displayed and e x a m i n e d (in w h o l e or in part) o n t h e com
p u t e r screen, w i t h n o d e s representing t h e individuals or organizations
or locations of interest a n d t h e links b e t w e e n those n o d e s representing
relationships or transactions. T h e tools m a y also allow t h e analyst to
43. Data Mining 29
investigate and record details a b o u t each link, a n d t o discover n e w n o d e s
that connect t o existing ones or n e w links b e t w e e n existing n o d e s .
For example, in an investigation into a suspected crime ring, an inves
tigator might carry o u t a link analysis of t e l e p h o n e calls a suspect has
m a d e or received, using t e l e p h o n e c o m p a n y call-log data, l o o k i n g at
factors such as n u m b e r called, t i m e and d u r a t i o n of each call, o r n u m
b e r called next. T h e investigator m i g h t t h e n decide t o p r o c e e d further
along the call n e t w o r k , l o o k i n g at calls m a d e t o or from o n e or m o r e of
the individuals w h o h a d h a d p h o n e conversations w i t h t h e initial sus
pect. This process can b r i n g t o t h e investigator's a t t e n t i o n individuals
n o t previously k n o w n . S o m e m a y t u r n o u t to b e totally innocent, b u t
others could prove to b e criminal collaborators.
A n o t h e r line of investigation m a y b e t o track cash transactions t o
and from domestic and international b a n k accounts.
Still a n o t h e r line m a y b e t o e x a m i n e t h e n e t w o r k of places a n d
people visited by the suspect, using such data as train a n d airline ticket
purchases, points of e n t r y or d e p a r t u r e in a given country, car rental
records, credit card records of purchases, websites visited, a n d t h e like.
Given the difficulty n o w a d a y s of d o i n g almost anything w i t h o u t
leaving an electronic trace, t h e challenge in link analysis is usually n o t
o n e of having insufficient data, b u t r a t h e r of deciding w h i c h of t h e
megabytes of available data t o select for further analysis. Link analysis
w o r k s best w h e n backed u p by o t h e r kinds of information, such as tips
from police informants or from n e i g h b o r s of possible suspects.
Once an initial link analysis has identified a possible criminal or terrorist
network, it m a y b e possible to determine w h o the key players are by
examining which individuals have the m o s t links to others in the network.
GEOMETRIC CLUSTERING
Because of resource limitations, law enforcement agencies generally focus
m o s t of their attention o n major crime, w i t h the result that m i n o r offenses
such as shoplifting or house burglaries get little attention. If, however, a
single person or an organized g a n g c o m m i t s m a n y such crimes o n a regu
lar basis, the aggregate can constitute significant criminal activity that
deserves greater police attention. T h e p r o b l e m facing the authorities,
44. 30 T H E NUMBERS B E H I N D NUMB3RS
then, is to identify within the large n u m b e r s of m i n o r crimes that take
place every day, clusters that are the w o r k of a single individual or gang.
O n e example of a " m i n o r " crime that is often carried o u t o n a regu
lar basis by t w o (and occasionally three) individuals acting together is
t h e so-called bogus official burglary (or distraction burglary). This is w h e r e
t w o people t u r n u p at t h e front d o o r of a h o m e o w n e r (elderly people
are often t h e preferred targets) posing as s o m e form of officials—perhaps
t e l e p h o n e engineers, representatives of a utility company, or local gov
e r n m e n t agents—and, while o n e p e r s o n secures t h e attention of the
h o m e o w n e r , the o t h e r moves quickly t h r o u g h the h o u s e or a p a r t m e n t
taking any cash or valuables that are easily accessible.
Victims of b o g u s official burglaries often file a r e p o r t to the police,
w h o will send an officer t o t h e victim's h o m e t o take a statement. Since
t h e victim will have spent considerable t i m e w i t h o n e of the perpetra
tors (the distracter), t h e s t a t e m e n t will often include a fairly detailed
description—gender, race, height, b o d y type, approximate age, general
facial appearance, eyes, hair color, hair length, hair style, accent, identi
fying physical m a r k s , m a n n e r i s m s , shoes, clothing, unusual jewelry,
etc.—together w i t h t h e n u m b e r of accomplices and their genders. In
principle, this w e a l t h of information m a k e s crimes of this nature ideal
for data mining, and in particular for the technique k n o w n as geometric
clustering, t o identify g r o u p s of crimes carried o u t b y a single gang.
Application of t h e m e t h o d is, however, fraught w i t h difficulties, and to
date t h e m e t h o d appears t o have b e e n restricted to o n e or t w o experi
m e n t a l studies. We'll look at o n e such study, b o t h to s h o w h o w the
m e t h o d w o r k s and t o illustrate s o m e of the p r o b l e m s often faced by the
data-mining practitioner.
T h e following study w a s carried o u t in England in 2000 and 2001 by
researchers at the University of W o l v e r h a m p t o n , together w i t h the
West Midlands Police.* T h e study looked at victim statements from
b o g u s official burglaries in t h e police region over a three-year period.
D u r i n g that period, t h e r e w e r e 800 such burglaries recorded, involving
*Ref. R. A d d e r l e y a n d P. B. M u s g r o v e , G e n e r a l Review o f Police C r i m e R e c o r d i n g
a n d I n v e s t i g a t i o n Systems, Policing: An International Journal of Police Strategies and
Management, 2 4 (1), 2 0 0 1 , p p . 1 1 0 - 1 1 4 .
45. Data Mining 31
1,292 offenders. This proved to b e t o o great a n u m b e r for t h e resources
available for the study, so t h e analysis w a s restricted t o those cases w h e r e
the distracter was female, a g r o u p comprising 89 crimes and 105 offender
descriptions.
T h e first p r o b l e m e n c o u n t e r e d was that the descriptions of t h e p e r p e
trators was for the m o s t part in narrative form, as w r i t t e n by t h e investi
gating officer w h o t o o k the statement from t h e victim. A data-mining
technique k n o w n as text m i n i n g had to b e used to p u t t h e descriptions
into a structured form. Because of the limitations of the text-mining soft
ware available, h u m a n input was required to handle m a n y of the entries;
for instance, to cope w i t h spelling mistakes, ad h o c or inconsistent abbre
viations (e.g., "Bham" or " B ' h a m " for "Birmingham"), and the use of
different ways of expressing t h e same thing (e.g., "Birmingham accent",
"Bham accent", "local accent", "accent: local", etc.).
After s o m e initial analysis, t h e researchers decided t o focus o n eight
variables: age, height, hair color, hair length, build, accent, race, and
n u m b e r of accomplices.
Once the data had b e e n processed into the appropriate structured
format, the next step was t o use g e o m e t r i c clustering to g r o u p t h e
105 offender descriptions into collections that w e r e likely t o refer t o the
same individual. To u n d e r s t a n d h o w this w a s d o n e , let's first consider a
m e t h o d that at first sight might appear t o b e feasible, b u t which soon
proves to have significant weaknesses. T h e n , by seeing h o w those weak
nesses m a y be overcome, w e will arrive at the m e t h o d used in t h e British
study.
First, you code each of t h e eight variables numerically. Age—often a
guess—is likely t o b e recorded either as a single figure or a range; if it is
a range, take the m e a n . G e n d e r (not considered in t h e British Midlands
study because all the cases e x a m i n e d h a d a female distracter) can b e
coded as 1 for male, 0 for female. H e i g h t m a y b e given as a n u m b e r
(inches), a range, or a t e r m such as "tall", " m e d i u m " , or "short"; again,
s o m e m e t h o d has to b e chosen t o convert each of these t o a single
figure. Likewise, schemes have t o b e devised t o represent each of t h e
other variables as a n u m b e r .
W h e n the numerical coding has been completed, each perpetrator
description is then represented by an eight-vector, the coordinates of
46. 32 THE NUMBERS B E H I N D NUMB3RS
a point in eight-dimensional geometric (Euclidean) space. T h e familiar
distance measure of Euclidean g e o m e t r y (the Pythagorean metric) can
then b e used t o measure the geometric distance between each pair of
points. This gives the distance between t w o vectors (x v . . . , x ) and
g
( , . . . , y ) as:
V l 8
2
V[(x -y )2 ...
1 1 + + (x -y ) ]
8 8
Points that are close t o g e t h e r u n d e r this m e t r i c are likely t o correspond
t o p e r p e t r a t o r descriptions that have several features in c o m m o n ; a n d
t h e closer t h e points, t h e m o r e features t h e descriptions are likely t o
have in c o m m o n . ( R e m e m b e r , there are p r o b l e m s w i t h this approach,
w h i c h we'll g e t t o momentarily. For t h e time being, however, let's
suppose that things w o r k m o r e or less as j u s t described.)
T h e challenge n o w is t o identify clusters of points that are close
together. If t h e r e w e r e only t w o variables, this w o u l d b e easy. All t h e
points could b e plotted o n a single x,y-graph a n d visual inspection
w o u l d indicate possible clusters. But h u m a n beings are totally unable t o
visualize eight-dimensional space, n o m a t t e r w h a t assistance t h e soft
w a r e system designers provide b y w a y of data visualization tools. T h e
w a y a r o u n d this difficulty is t o r e d u c e t h e eight-dimensional array of
points (descriptions) t o a two-dimensional array (i.e., a matrix o r table).
T h e idea is t o a r r a n g e t h e data points (that is, t h e vector representatives
of t h e offender descriptions) in a two-dimensional grid in such a
way that:
1. pairs of points t h a t are extremely close t o g e t h e r in t h e eight-
dimensional space are p u t into t h e s a m e grid entry;
2. pairs of points t h a t are n e i g h b o r s in t h e grid are close together in
t h e eight-dimensional space; a n d
3. points t h a t are farther apart in t h e grid are farther apart in t h e
space.
This c a n b e d o n e using a special kind of c o m p u t e r p r o g r a m k n o w n as a
n e u r a l net, in particular, a K o h o n e n self-organizing m a p (or SOM).
47. Data Mining 33
Neural nets (including SOMs) are described later in t h e chapter. For
now, all w e n e e d t o k n o w is that these systems, w h i c h w o r k iteratively,
are extremely g o o d at h o m i n g in (over t h e course of m a n y iterations) o n
patterns, such as g e o m e t r i c clusters of t h e kind w e are interested in, and
thus can indeed take an eight-dimensional array of t h e k i n d described
above and place the points appropriately in a two-dimensional grid.
(Part of the skill required t o use an S O M effectively in a case such as this
is deciding in advance, or by s o m e initial trial and error, w h a t are t h e
optimal dimensions of t h e final grid. T h e SOM n e e d s t h a t information
in order to start work.)
Once the data has b e e n p u t into t h e grid, law enforcement officers can
examine grid squares that contain several entries, which are highly likely
to c o m e from a single g a n g responsible for a series of crimes, a n d can
visually identify clusters o n the grid, w h e r e there is also a likelihood that
they represent g a n g activity. In either case, the officers can examine t h e
corresponding original crime s t a t e m e n t entries, looking for indications
that those crimes are indeed the w o r k of a single gang.
N o w let's see w h a t goes w r o n g w i t h t h e m e t h o d j u s t described, a n d
h o w to correct it.
T h e first p r o b l e m is that t h e original e n c o d i n g of entries as n u m b e r s
is n o t systematic. This can lead t o o n e variable d o m i n a t i n g o t h e r s w h e n
the entries are clustered using g e o m e t r i c distance (the P y t h a g o r e a n
metric) in eight-dimensional space. For example, a d i m e n s i o n that m e a
sures height (which could b e anything b e t w e e n 60 inches and 76 inches)
w o u l d d o m i n a t e t h e e n t r y for g e n d e r (0 or 1). So t h e first step is t o scale
(in mathematical terminology, normalize) t h e eight numerical variables,
so that each o n e varies b e t w e e n 0 and 1.
O n e way to do that w o u l d b e t o simply scale d o w n each variable by a
multiplicative scaling factor appropriate for that particular feature
(height, age, etc.). But that will introduce further p r o b l e m s w h e n t h e
separation distances are calculated; for example, if g e n d e r and height are
a m o n g the variables, then, all o t h e r variables being roughly the same, a
very tall w o m a n w o u l d c o m e o u t close t o a very short m a n (because
female gives a 0 and m a l e gives a 1, whereas tall c o m e s o u t close to 1 and
short close to 0). T h u s , a m o r e sophisticated normalization p r o c e d u r e
has to b e used.
48. 34 THE NUMBERS B E H I N D NUMB3RS
The approach finally adopted in the British Midlands study was to
make every numerical entry binary (just 0 or 1). This meant splitting the
continuous variables (age and height) into overlapping ranges (a few
years and a few inches, respectively), with a 1 denoting an entry in a given
range and a 0 meaning outside that range, and using pairs of binary vari
ables to encode each factor of hair color, hair length, build, accent, and
race. The exact coding chosen was fairly specific to the data being stud
ied, so there is little to be gained from providing all the details here. (The
age and height ranges were taken to be overlapping to account for entries
toward the edges of the chosen ranges.) The normalization process
resulted in a set of 46 binary variables. Thus, the geometric clustering
was done over a geometric space of 46 dimensions.
Another problem was h o w to handle missing data. For example,
what do you do if a victim's statement says nothing about the perpetra
tor's accent? If you enter a 0, that would amount to assigning an accent.
But what will the clustering program do if you leave that entry blank?
(In the British Midlands study, the program would treat a missing entry
as 0.) Missing data points are in fact one of the major headaches for data
miners, and there really is n o universally g o o d solution. If there are only
a few such cases, you could either ignore them or else see what solutions
you get with different values entered.
As mentioned earlier, a key decision that has to be made before the
SOM can be run is the size of the resulting two-dimensional grid. It
needs to be small enough so that the SOM is forced to put some data
points into the same grid squares, and will also result in some non
empty grid squares having non-empty neighbors. The investigators in
the British Midlands study eventually decided to opt for a five-by-seven
grid. With 105 offender descriptions, this forced the SOM to create
several multi-entry clusters.
The study itself concluded with experienced police officers examin
ing the results and comparing them with the original victim statements
and other relevant information (such as geographic proximity of crimes
over a short timespan, which would be another indicator of a gang
activity, not used in the cluster analysis), to determine h o w well the pro
cess performed. T h o u g h all parties involved in the study declared it to
be successful, the significant amount of person-hours required means