Cited by a 2012 TIME Magazine cover story (“One A Day”) suicide, particularly the military, is a severe public health problem: Veteran suicide rates, nearly double those of adults in the general U.S. population. And to date there has been a lack of success so far in military efforts to understand and address the suicide crisis: “No program, outreach or initiative has worked against the surge in Army suicides, and no one knows why nothing works.” (Time) In this talk we will describe how we have built a real time risk assessment framework with the US Veterans Administration. As well as how Hadoop and HBase are being used to build further systems based on our new Bayesian Counters framework to predict realtime risk. Bayesian Counters framework was, in part, developed to predict military mental health risks. Trying to help to solve this complicated puzzle, towards the goal of reducing suicidality among those who have served the nation.
2. Speakers
PATTERNS AND PREDICTIONS
Chris
Principal Investigator, DARPA DCAPS
Poulin-Dartmouth Suicide Prediction Team
Former Co-Director, Dartmouth
Metalearning Working Group (Theoretical
Machine Learning)
Artificial Intelligence Instructor, US Naval
War College
Principal, Patterns and Predictions
(linguistics and prediction of financial events)
… and have now read many suicide notes.
Alex
Principal Solutions Architect at Cloudera
Ph.D. from Stanford University.
Data mining and statistical analysis at SGI,
Hewlett-Packard
3. PATTERNS AND PREDICTIONS
Suicide is a hard societal problem,
but why?
Stigma: Victims are socially outcast (i.e. disconnected)
Negative Topic: Intense negative emotion. And not a 'sexy'
research topic by any means.
Freedom of Choice: Ultimately you cant stop someone from
risky behaviors, or many other activities that risk self harm. And
suicide is the ultimate act of personal risk.
Logistics: Even if you know what to look for, there are not
enough clinicians to help the number of people suffering. Data
privacy issues are as intense, or more so then say banking.
Prediction: Accuracy (proper identification), false positives
(stigmatization), false negatives (malpractice)
Deeper issues?: Recent growth in suicide may be related to
something more systemically wrong. Suicide the symptom of
something else going on.
4. The project is named in honor of Emile Durkheim,
a founding sociologist whose 1897 publication of
Suicide defined early text analysis for suicide risk.
The team is comprised of a multidisciplinary team
of artificial intelligence (machine learning and
computational linguistics), and medical experts
(psychiatrists).
www.durkheimproject.org
PATTERNS AND PREDICTIONS
Durkheim
5. PATTERNS AND PREDICTIONS
Social Problem:
Opt-In is critical
o Clear explanations for consent, no tricky EULAs
Technical Problem: How to build a system that collects, stores, analyzes,
and allows clinicians to react at Internet scale?
Architecture:
1) Opt-In Interface Layer
2) Data Collection Layer
3) Storage Layer
4) Machine Learning, Phase I
5) Machine Learning, Phase II
6) Automated Intervention
Our Approach
6. PATTERNS AND PREDICTIONS
1) Opt-In Interface Layer
We cant overemphasize the role of simplified user participation for consent, and privacy
control, in our interface/interaction design.
7. PATTERNS AND PREDICTIONS
2) Data Collection Layer
The social media component is handled by a content aggregator (Gigya), and populates
a Cassandra database.
8. PATTERNS AND PREDICTIONS
Data Collection Layer, Continued
The Cassandra instances were built and maintained (by Scale Unlimited) to handle high
throughput storage. However, this is not the final destination of the data.
9. PATTERNS AND PREDICTIONS
3) Storage Layer
Eventually, the data is moved to the medical center (behind a HIPAA compliant firewall
at Dartmouth). Here it persists for ongoing research.
10. PATTERNS AND PREDICTIONS
4) Machine Learning, Phase I
In 2011, we initiated a study with the U.S. Department of Veterans Affairs (VA) to study
3 cohorts of 100 subjects each (Non-Psychiatric, Psychiatric, and Suicide Positive).
We developed linguistics-
driven prediction models to
estimate the risk of suicide.
These models were
generated from unstructured
clinical notes
From the clinical notes, we
generated datasets of single
keywords and multi-word
phrases
We were able to predict
suicide with 65% accuracy on
a small dataset.
11. PATTERNS AND PREDICTIONS
5) Machine Learning, Phase II
In 2011, we also initiated a study with Cloudera (Alex Kozlov) on a lightweight machine
learning framework for detecting real-time risk at scale.
We wanted a clean statistical
model for distributed
inference (prediction).
We needed a more
lightweight framework than
Mahout.
We wanted to be able to
tradeoff runtime vs. accuracy.
We wanted the prediction
library to be eventually open
sourced (Apache license) for
the community.
‘‘Alpha’ Build @Alpha’ Build @
http://durkheimproject.org/bcount/http://durkheimproject.org/bcount/
By Alex Kozlov <alexvk@cloudera.com>By Alex Kozlov <alexvk@cloudera.com>
12. What is B-counts today? And Why?
Distributed aggregation of user events
and correlations to fit into RAM of
multiple machines
Smart client: Moves substantial amount of
logic to clients
Time:An explicit time dimension to
support ‘recency analysis’
Based on HBase
Previous analysis (Poulin) had indicated
that words and correlations are a good
predictor of target variable
Need a faster processing/response time
(response time beats accuracy of the
model)
http://www.slideshare.net/Hadoop_Summit/bayesian-http://www.slideshare.net/Hadoop_Summit/bayesian-
counterscounters
13. Time to Answer
Examples
Advertising: if you don’t figure what the
user wants in 5 minutes, you lost him
Intrusion detection: the damage may be
significantly bigger after a few minutes
after break-in
Mental health risk: you need to screen
before negative actions occur
Value vs. time
http://cetas.nethttp://cetas.net
http://www.woopra.comhttp://www.woopra.com
http://www.wibidata.com/http://www.wibidata.com/
14. Solution: Time Stamped Hadoop
•Key: subset of variables with their values + timestamp (variable length)
•Value: count (8 bytes)
KeyKey
11
KeyKey
11
ValuValu
ee
ValuValu
ee
KeyKey
22
KeyKey
22
ValuValu
ee
ValuValu
ee
KeyKey
33
KeyKey
33
ValuValu
ee
ValuValu
ee
KeyKey
44
KeyKey
44
ValuValu
ee
ValuValu
ee
indexindex
Pr(A|B, last 20 minutes)Pr(A|B, last 20 minutes)
Column families are different HFiles (30 min, 2 hours, 24 hours, 5 days, etc.)
What if we want to access more recent
data more often?
What if we want to access more recent
data more often?
15. A Bayesian Counter, in detail
IrisIrisIrisIris
[sepal_width=2;class=0][sepal_width=2;class=0][sepal_width=2;class=0][sepal_width=2;class=0]
15151515
1321038671132103867113210386711321038671
30 mins30 mins30 mins30 mins
2 hours2 hours2 hours2 hours
……
Region (divideRegion (divide
between)between)
ColumnColumn
familyfamily
ColumnColumn
qualifierqualifier
FileFile
ValueValue
(data)(data)
Counter/TaCounter/Ta
bleble
1321038998132103899813210389981321038998
VersionVersion
17. Syntax
nb iris class=2 sepal_length=5;petal_length=1.4 300
Target VariableTarget Variable
PredictorsPredictors
Time (seconds from now)Time (seconds from now)
18. Current Classifier Support (alpha release)
Naïve Bayes: Pr(C|F1, F2, ..., FN) =1/z Pr(C) Πi Pr(Fi|C)
Association rules: Confidence (A -> B): count(A and B)/count(A), Lift (A -> B): count(A and B)/(count(A)
x count(B))
Nearest Neighbor: P(C) for k nearest neighbors, count(C|X) = ΣXi count(C|Xi), where X1, X2, ..., XN are in
the vicinity of X
Clique ranking: I(X;Y)= p(x,y)log(p(x,y)/p(x)p(y),Where x in X and y inY, Using random projection canΣΣ
generalize on two abstract subsets of Z
19. Performance
retail.dat example – 88K transactions over 14,246 items
o Mahout FPGrowth – 0.5 sec per pattern (58,623 patterns with min support 2)
o 10 ms per pattern on a 5 node cluster
20. PATTERNS AND PREDICTIONS
6) Intervention
Automated systems are coming online for potential patients and families seeking
treatment, as well as passive intervention strategies (‘safety plans’).
21. PATTERNS AND PREDICTIONS
What's next?
In 2013, we plan a variety of initiatives including the launch of our clinical observation
study, deployment of Bayesian Counters on live data, and to seek approval for an
automated intervention study.
Launch Data Collection Study
(CPHS #23781)… very soon
Deployment of B-Counts on
live data for live monitoring
Intervention Research
(Clinical Study Approval)
22. PATTERNS AND PREDICTIONS
Conclusion
What is Durkheim? And what is the Bayesian Counters library?
A near real-time classification library,
that, while under development, you’re
free to use.
Hope that some help is coming to
those in need…
23. Team
PATTERNS AND PREDICTIONS
Chris Poulin, Director & Principal Investigator
Paul Thompson, Study Co-Principal Investigator
Thomas W. McAllister, M.D., Key Personnel
Ben Goertzel, Ph.D., Key Personnel
Brian Shiner, MD, Key Personnel
Craig J. Bryan, PsyD, Advisor
Linas Vepstas – Lead Machine Learning Programmer
Brian Nauheimer – Technical Project Manager
Chhean Saur – Lead Web/API Programmer
Kevin Watters – Principal Programmer, Middleware
Ken Krugler – Lead Distributed Systems Expert
Ann Marion – User Experience (UX) Design
Jane Nisselson – User Interface (UI) Design
Andrew Chen – Social Media Applications Developer
Alex Kozlov – Real-time/Distributed Classifier Development
Vivek Magotra – Cassandra Database Developer
the CPU power has reached the limit (in the end, speed of light is finite) Combining storage or processing capabilities across a distributed system of machines is non-trivial RAM is faster than disks (RAM ns, disk ms) There are 1,832,160 feet in 347 miles D isk moves at 50 m/s vs 300,000,000 m/s Can we do at least 1,000 feet (300 m)? Network? There is no “virtual memory”
If we had all the time (the universe is projected to be less than 1000 trillion years) we could (probably) get the exact answer. Some analytical companies: http://cetas.net/ acquired by VMWare http://www.woopra.com analyses traffic to a website real-time http://www.wibidata.com/ our friends
More recent column families are accessed more often