Durkheim Project: Social Media Risk & Bayesian Counters

The Durkheim Project: Social
Media Risk & Bayesian Counters
Hadoop Summit: June 27, 2013
Chris Poulin: PATTERNS AND PREDICTIONS
Alex Kozlov: Cloudera
Disclaimers:
This material is based upon work supported by the Defense Advance Research Project Agency (DARPA), and
Space Warfare Systems Center Pacific under Contract N66001-11-4006. Also supported by, the Intelligence
Advanced Research Projects Activity (IARPA) via the Department of Interior National Business Center contract
number N10PC20221. The opinions, findings and conclusions or recommendations expressed in this material are
those of the authors(s) and do not necessarily reflect the views of the Defense Advance Research Program
Agency (DARPA) and Space, the Naval Warfare Systems Center Pacific, or the IARPA, DOI/NBC, or the U.S.
Government.
© 2013 Patterns and Predictions

Speakers
PATTERNS AND PREDICTIONS
Chris
 Principal Investigator, DARPA DCAPS
Poulin-Dartmouth Suicide Prediction Team
 Former Co-Director, Dartmouth
Metalearning Working Group (Theoretical
Machine Learning)
 Artificial Intelligence Instructor, US Naval
War College
 Principal, Patterns and Predictions
(linguistics and prediction of financial events)
… and have now read many suicide notes.
Alex
Principal Solutions Architect at Cloudera
Ph.D. from Stanford University.
Data mining and statistical analysis at SGI,
Hewlett-Packard

Suicide is a hard societal problem,
but why?
Stigma: Victims are socially outcast (i.e. disconnected)
Negative Topic: Intense negative emotion. And not a 'sexy'
research topic by any means.
Freedom of Choice: Ultimately you cant stop someone from
risky behaviors, or many other activities that risk self harm. And
suicide is the ultimate act of personal risk.
Logistics: Even if you know what to look for, there are not
enough clinicians to help the number of people suffering. Data
privacy issues are as intense, or more so then say banking.
Prediction: Accuracy (proper identification), false positives
(stigmatization), false negatives (malpractice)
Deeper issues?: Recent growth in suicide may be related to
something more systemically wrong. Suicide the symptom of
something else going on.

 The project is named in honor of Emile Durkheim,
a founding sociologist whose 1897 publication of
Suicide defined early text analysis for suicide risk.
 The team is comprised of a multidisciplinary team
of artificial intelligence (machine learning and
computational linguistics), and medical experts
(psychiatrists).
 www.durkheimproject.org
Durkheim

Social Problem:
Opt-In is critical
o Clear explanations for consent, no tricky EULAs
Technical Problem: How to build a system that collects, stores, analyzes,
and allows clinicians to react at Internet scale?
Architecture:
1) Opt-In Interface Layer
2) Data Collection Layer
3) Storage Layer
4) Machine Learning, Phase I
5) Machine Learning, Phase II
6) Automated Intervention
Our Approach

1) Opt-In Interface Layer
We cant overemphasize the role of simplified user participation for consent, and privacy
control, in our interface/interaction design.

2) Data Collection Layer
The social media component is handled by a content aggregator (Gigya), and populates
a Cassandra database.

Data Collection Layer, Continued
The Cassandra instances were built and maintained (by Scale Unlimited) to handle high
throughput storage. However, this is not the final destination of the data.

3) Storage Layer
Eventually, the data is moved to the medical center (behind a HIPAA compliant firewall
at Dartmouth). Here it persists for ongoing research.

4) Machine Learning, Phase I
In 2011, we initiated a study with the U.S. Department of Veterans Affairs (VA) to study
3 cohorts of 100 subjects each (Non-Psychiatric, Psychiatric, and Suicide Positive).
 We developed linguistics-
driven prediction models to
estimate the risk of suicide.
 These models were
generated from unstructured
clinical notes
 From the clinical notes, we
generated datasets of single
keywords and multi-word
phrases
 We were able to predict
suicide with 65% accuracy on
a small dataset.

5) Machine Learning, Phase II
In 2011, we also initiated a study with Cloudera (Alex Kozlov) on a lightweight machine
learning framework for detecting real-time risk at scale.
 We wanted a clean statistical
model for distributed
inference (prediction).
 We needed a more
lightweight framework than
Mahout.
 We wanted to be able to
tradeoff runtime vs. accuracy.
 We wanted the prediction
library to be eventually open
sourced (Apache license) for
the community.
‘‘Alpha’ Build @Alpha’ Build @
http://durkheimproject.org/bcount/http://durkheimproject.org/bcount/
By Alex Kozlov <alexvk@cloudera.com>By Alex Kozlov <alexvk@cloudera.com>

What is B-counts today? And Why?
 Distributed aggregation of user events
and correlations to fit into RAM of
multiple machines
 Smart client: Moves substantial amount of
logic to clients
 Time:An explicit time dimension to
support ‘recency analysis’
 Based on HBase
 Previous analysis (Poulin) had indicated
that words and correlations are a good
predictor of target variable
 Need a faster processing/response time
(response time beats accuracy of the
model)
http://www.slideshare.net/Hadoop_Summit/bayesian-http://www.slideshare.net/Hadoop_Summit/bayesian-
counterscounters

Time to Answer
Examples
 Advertising: if you don’t figure what the
user wants in 5 minutes, you lost him
 Intrusion detection: the damage may be
significantly bigger after a few minutes
after break-in
 Mental health risk: you need to screen
before negative actions occur
Value vs. time
http://cetas.nethttp://cetas.net
http://www.woopra.comhttp://www.woopra.com
http://www.wibidata.com/http://www.wibidata.com/

Solution: Time Stamped Hadoop
•Key: subset of variables with their values + timestamp (variable length)
•Value: count (8 bytes)
KeyKey
11
KeyKey
11
ValuValu
ee
ValuValu
ee
KeyKey
22
KeyKey
22
ValuValu
ee
ValuValu
ee
KeyKey
33
KeyKey
33
ValuValu
ee
ValuValu
ee
KeyKey
44
KeyKey
44
ValuValu
ee
ValuValu
ee
indexindex
Pr(A|B, last 20 minutes)Pr(A|B, last 20 minutes)
Column families are different HFiles (30 min, 2 hours, 24 hours, 5 days, etc.)
What if we want to access more recent
data more often?
What if we want to access more recent
data more often?

A Bayesian Counter, in detail
IrisIrisIrisIris
[sepal_width=2;class=0][sepal_width=2;class=0][sepal_width=2;class=0][sepal_width=2;class=0]
15151515
1321038671132103867113210386711321038671
30 mins30 mins30 mins30 mins
2 hours2 hours2 hours2 hours
……
Region (divideRegion (divide
between)between)
ColumnColumn
familyfamily
ColumnColumn
qualifierqualifier
FileFile
ValueValue
(data)(data)
Counter/TaCounter/Ta
bleble
1321038998132103899813210389981321038998
VersionVersion

Syntax
nb iris class=2 sepal_length=5;petal_length=1.4 300
Target VariableTarget Variable
PredictorsPredictors
Time (seconds from now)Time (seconds from now)

Current Classifier Support (alpha release)
 Naïve Bayes: Pr(C|F1, F2, ..., FN) =1/z Pr(C) Πi Pr(Fi|C)
 Association rules: Confidence (A -> B): count(A and B)/count(A), Lift (A -> B): count(A and B)/(count(A)
x count(B))
 Nearest Neighbor: P(C) for k nearest neighbors, count(C|X) = ΣXi count(C|Xi), where X1, X2, ..., XN are in
the vicinity of X
 Clique ranking: I(X;Y)= p(x,y)log(p(x,y)/p(x)p(y),Where x in X and y inY, Using random projection canΣΣ
generalize on two abstract subsets of Z

Performance
retail.dat example – 88K transactions over 14,246 items
o Mahout FPGrowth – 0.5 sec per pattern (58,623 patterns with min support 2)
o 10 ms per pattern on a 5 node cluster

6) Intervention
Automated systems are coming online for potential patients and families seeking
treatment, as well as passive intervention strategies (‘safety plans’).

What's next?
In 2013, we plan a variety of initiatives including the launch of our clinical observation
study, deployment of Bayesian Counters on live data, and to seek approval for an
automated intervention study.
 Launch Data Collection Study
(CPHS #23781)… very soon
 Deployment of B-Counts on
live data for live monitoring
 Intervention Research
(Clinical Study Approval)

Conclusion
What is Durkheim? And what is the Bayesian Counters library?
A near real-time classification library,
that, while under development, you’re
free to use.
Hope that some help is coming to
those in need…

Team
Chris Poulin, Director & Principal Investigator
Paul Thompson, Study Co-Principal Investigator
Thomas W. McAllister, M.D., Key Personnel
Ben Goertzel, Ph.D., Key Personnel
Brian Shiner, MD, Key Personnel
Craig J. Bryan, PsyD, Advisor
Linas Vepstas – Lead Machine Learning Programmer
Brian Nauheimer – Technical Project Manager
Chhean Saur – Lead Web/API Programmer
Kevin Watters – Principal Programmer, Middleware
Ken Krugler – Lead Distributed Systems Expert
Ann Marion – User Experience (UX) Design
Jane Nisselson – User Interface (UI) Design
Andrew Chen – Social Media Applications Developer
Alex Kozlov – Real-time/Distributed Classifier Development
Vivek Magotra – Cassandra Database Developer

THANK YOU
Chris Poulin, Managing Partner, Patterns and Predictions
chris@patternsandpredictions.net
Alex Kozlov, Principal Solutions Architect, Cloudera
alexvk@cloudera.com
Note: We hope that you have found this talk useful and encouraging. However, if you are
having thoughts of harming yourself, please call the Veterans Crisis Line at 1-800 273-
8255 or 911.
© 2013 Patterns and Predictions

Durkheim Project: Social Media Risk & Bayesian Counters

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (8)

Ähnlich wie Durkheim Project: Social Media Risk & Bayesian Counters

Ähnlich wie Durkheim Project: Social Media Risk & Bayesian Counters (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Durkheim Project: Social Media Risk & Bayesian Counters

Hinweis der Redaktion