SlideShare ist ein Scribd-Unternehmen logo
1 von 24
The Durkheim Project: Social
Media Risk & Bayesian Counters
Hadoop Summit: June 27, 2013
Chris Poulin: PATTERNS AND PREDICTIONS
Alex Kozlov: Cloudera
Disclaimers:
This material is based upon work supported by the Defense Advance Research Project Agency (DARPA), and
Space Warfare Systems Center Pacific under Contract N66001-11-4006. Also supported by, the Intelligence
Advanced Research Projects Activity (IARPA) via the Department of Interior National Business Center contract
number N10PC20221. The opinions, findings and conclusions or recommendations expressed in this material are
those of the authors(s) and do not necessarily reflect the views of the Defense Advance Research Program
Agency (DARPA) and Space, the Naval Warfare Systems Center Pacific, or the IARPA, DOI/NBC, or the U.S.
Government.
© 2013 Patterns and Predictions
Speakers
PATTERNS AND PREDICTIONS
Chris
 Principal Investigator, DARPA DCAPS
Poulin-Dartmouth Suicide Prediction Team
 Former Co-Director, Dartmouth
Metalearning Working Group (Theoretical
Machine Learning)
 Artificial Intelligence Instructor, US Naval
War College
 Principal, Patterns and Predictions
(linguistics and prediction of financial events)
… and have now read many suicide notes.
Alex
Principal Solutions Architect at Cloudera
Ph.D. from Stanford University.
Data mining and statistical analysis at SGI,
Hewlett-Packard
PATTERNS AND PREDICTIONS
Suicide is a hard societal problem,
but why?
Stigma: Victims are socially outcast (i.e. disconnected)
Negative Topic: Intense negative emotion. And not a 'sexy'
research topic by any means.
Freedom of Choice: Ultimately you cant stop someone from
risky behaviors, or many other activities that risk self harm. And
suicide is the ultimate act of personal risk.
Logistics: Even if you know what to look for, there are not
enough clinicians to help the number of people suffering. Data
privacy issues are as intense, or more so then say banking.
Prediction: Accuracy (proper identification), false positives
(stigmatization), false negatives (malpractice)
Deeper issues?: Recent growth in suicide may be related to
something more systemically wrong. Suicide the symptom of
something else going on.
 The project is named in honor of Emile Durkheim,
a founding sociologist whose 1897 publication of
Suicide defined early text analysis for suicide risk.
 The team is comprised of a multidisciplinary team
of artificial intelligence (machine learning and
computational linguistics), and medical experts
(psychiatrists).
 www.durkheimproject.org
PATTERNS AND PREDICTIONS
Durkheim
PATTERNS AND PREDICTIONS
Social Problem:
Opt-In is critical
o Clear explanations for consent, no tricky EULAs
Technical Problem: How to build a system that collects, stores, analyzes,
and allows clinicians to react at Internet scale?
Architecture:
1) Opt-In Interface Layer
2) Data Collection Layer
3) Storage Layer
4) Machine Learning, Phase I
5) Machine Learning, Phase II
6) Automated Intervention
Our Approach
PATTERNS AND PREDICTIONS
1) Opt-In Interface Layer
We cant overemphasize the role of simplified user participation for consent, and privacy
control, in our interface/interaction design.
PATTERNS AND PREDICTIONS
2) Data Collection Layer
The social media component is handled by a content aggregator (Gigya), and populates
a Cassandra database.
PATTERNS AND PREDICTIONS
Data Collection Layer, Continued
The Cassandra instances were built and maintained (by Scale Unlimited) to handle high
throughput storage. However, this is not the final destination of the data.
PATTERNS AND PREDICTIONS
3) Storage Layer
Eventually, the data is moved to the medical center (behind a HIPAA compliant firewall
at Dartmouth). Here it persists for ongoing research.
PATTERNS AND PREDICTIONS
4) Machine Learning, Phase I
In 2011, we initiated a study with the U.S. Department of Veterans Affairs (VA) to study
3 cohorts of 100 subjects each (Non-Psychiatric, Psychiatric, and Suicide Positive).
 We developed linguistics-
driven prediction models to
estimate the risk of suicide.
 These models were
generated from unstructured
clinical notes
 From the clinical notes, we
generated datasets of single
keywords and multi-word
phrases
 We were able to predict
suicide with 65% accuracy on
a small dataset.
PATTERNS AND PREDICTIONS
5) Machine Learning, Phase II
In 2011, we also initiated a study with Cloudera (Alex Kozlov) on a lightweight machine
learning framework for detecting real-time risk at scale.
 We wanted a clean statistical
model for distributed
inference (prediction).
 We needed a more
lightweight framework than
Mahout.
 We wanted to be able to
tradeoff runtime vs. accuracy.
 We wanted the prediction
library to be eventually open
sourced (Apache license) for
the community.
‘‘Alpha’ Build @Alpha’ Build @
http://durkheimproject.org/bcount/http://durkheimproject.org/bcount/
By Alex Kozlov <alexvk@cloudera.com>By Alex Kozlov <alexvk@cloudera.com>
What is B-counts today? And Why?
 Distributed aggregation of user events
and correlations to fit into RAM of
multiple machines
 Smart client: Moves substantial amount of
logic to clients
 Time:An explicit time dimension to
support ‘recency analysis’
 Based on HBase
 Previous analysis (Poulin) had indicated
that words and correlations are a good
predictor of target variable
 Need a faster processing/response time
(response time beats accuracy of the
model)
http://www.slideshare.net/Hadoop_Summit/bayesian-http://www.slideshare.net/Hadoop_Summit/bayesian-
counterscounters
Time to Answer
Examples
 Advertising: if you don’t figure what the
user wants in 5 minutes, you lost him
 Intrusion detection: the damage may be
significantly bigger after a few minutes
after break-in
 Mental health risk: you need to screen
before negative actions occur
Value vs. time
http://cetas.nethttp://cetas.net
http://www.woopra.comhttp://www.woopra.com
http://www.wibidata.com/http://www.wibidata.com/
Solution: Time Stamped Hadoop
•Key: subset of variables with their values + timestamp (variable length)
•Value: count (8 bytes)
KeyKey
11
KeyKey
11
ValuValu
ee
ValuValu
ee
KeyKey
22
KeyKey
22
ValuValu
ee
ValuValu
ee
KeyKey
33
KeyKey
33
ValuValu
ee
ValuValu
ee
KeyKey
44
KeyKey
44
ValuValu
ee
ValuValu
ee
indexindex
Pr(A|B, last 20 minutes)Pr(A|B, last 20 minutes)
Column families are different HFiles (30 min, 2 hours, 24 hours, 5 days, etc.)
What if we want to access more recent
data more often?
What if we want to access more recent
data more often?
A Bayesian Counter, in detail
IrisIrisIrisIris
[sepal_width=2;class=0][sepal_width=2;class=0][sepal_width=2;class=0][sepal_width=2;class=0]
15151515
1321038671132103867113210386711321038671
30 mins30 mins30 mins30 mins
2 hours2 hours2 hours2 hours
……
Region (divideRegion (divide
between)between)
ColumnColumn
familyfamily
ColumnColumn
qualifierqualifier
FileFile
ValueValue
(data)(data)
Counter/TaCounter/Ta
bleble
1321038998132103899813210389981321038998
VersionVersion
Command Line Implementation
Syntax
nb iris class=2 sepal_length=5;petal_length=1.4 300
Target VariableTarget Variable
PredictorsPredictors
Time (seconds from now)Time (seconds from now)
Current Classifier Support (alpha release)
 Naïve Bayes: Pr(C|F1, F2, ..., FN) =1/z Pr(C) Πi Pr(Fi|C)
 Association rules: Confidence (A -> B): count(A and B)/count(A), Lift (A -> B): count(A and B)/(count(A)
x count(B))
 Nearest Neighbor: P(C) for k nearest neighbors, count(C|X) = ΣXi count(C|Xi), where X1, X2, ..., XN are in
the vicinity of X
 Clique ranking: I(X;Y)= p(x,y)log(p(x,y)/p(x)p(y),Where x in X and y inY, Using random projection canΣΣ
generalize on two abstract subsets of Z
Performance
retail.dat example – 88K transactions over 14,246 items
o Mahout FPGrowth – 0.5 sec per pattern (58,623 patterns with min support 2)
o 10 ms per pattern on a 5 node cluster
PATTERNS AND PREDICTIONS
6) Intervention
Automated systems are coming online for potential patients and families seeking
treatment, as well as passive intervention strategies (‘safety plans’).
PATTERNS AND PREDICTIONS
What's next?
In 2013, we plan a variety of initiatives including the launch of our clinical observation
study, deployment of Bayesian Counters on live data, and to seek approval for an
automated intervention study.
 Launch Data Collection Study
(CPHS #23781)… very soon
 Deployment of B-Counts on
live data for live monitoring
 Intervention Research
(Clinical Study Approval)
PATTERNS AND PREDICTIONS
Conclusion
What is Durkheim? And what is the Bayesian Counters library?
A near real-time classification library,
that, while under development, you’re
free to use.
Hope that some help is coming to
those in need…
Team
PATTERNS AND PREDICTIONS
Chris Poulin, Director & Principal Investigator
Paul Thompson, Study Co-Principal Investigator
Thomas W. McAllister, M.D., Key Personnel
Ben Goertzel, Ph.D., Key Personnel
Brian Shiner, MD, Key Personnel
Craig J. Bryan, PsyD, Advisor
Linas Vepstas – Lead Machine Learning Programmer
Brian Nauheimer – Technical Project Manager
Chhean Saur – Lead Web/API Programmer
Kevin Watters – Principal Programmer, Middleware
Ken Krugler – Lead Distributed Systems Expert
Ann Marion – User Experience (UX) Design
Jane Nisselson – User Interface (UI) Design
Andrew Chen – Social Media Applications Developer
Alex Kozlov – Real-time/Distributed Classifier Development
Vivek Magotra – Cassandra Database Developer
THANK YOU
Chris Poulin, Managing Partner, Patterns and Predictions
chris@patternsandpredictions.net
Alex Kozlov, Principal Solutions Architect, Cloudera
alexvk@cloudera.com
Note: We hope that you have found this talk useful and encouraging. However, if you are
having thoughts of harming yourself, please call the Veterans Crisis Line at 1-800 273-
8255 or 911.
© 2013 Patterns and Predictions
PATTERNS AND PREDICTIONS

Weitere ähnliche Inhalte

Was ist angesagt?

20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Past, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software DataPast, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software DataJeongwhan Choi
 

Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...

Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...
Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...

Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...Giorgio Di Nunzio
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538Krishna Sankar
 
Using Deep Learning for Product Innovation in the Personalized Care Space
Using Deep Learning for Product Innovation in the Personalized Care Space Using Deep Learning for Product Innovation in the Personalized Care Space
Using Deep Learning for Product Innovation in the Personalized Care Space Quantified Skin
 
Polong Lin(林伯龍)/how to approach data science problems from start to end
Polong Lin(林伯龍)/how to approach data science problems from start to endPolong Lin(林伯龍)/how to approach data science problems from start to end
Polong Lin(林伯龍)/how to approach data science problems from start to end台灣資料科學年會
 

Was ist angesagt? (8)

20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Past, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software DataPast, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software Data
 

Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...

Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...
Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...

Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
 
Reproducibility for IR evaluation
Reproducibility for IR evaluationReproducibility for IR evaluation
Reproducibility for IR evaluation
 
Using Deep Learning for Product Innovation in the Personalized Care Space
Using Deep Learning for Product Innovation in the Personalized Care Space Using Deep Learning for Product Innovation in the Personalized Care Space
Using Deep Learning for Product Innovation in the Personalized Care Space
 
Polong Lin(林伯龍)/how to approach data science problems from start to end
Polong Lin(林伯龍)/how to approach data science problems from start to endPolong Lin(林伯龍)/how to approach data science problems from start to end
Polong Lin(林伯龍)/how to approach data science problems from start to end
 

Ähnlich wie Durkheim Project: Social Media Risk & Bayesian Counters

Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Artificial Intelligence Institute at UofSC
 
Jdb code biology and ai final
Jdb code biology and ai finalJdb code biology and ai final
Jdb code biology and ai finalJoachim De Beule
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming DatacentricTimothy Cook
 
Ai for life sciences - are we ready
Ai for life sciences  - are we readyAi for life sciences  - are we ready
Ai for life sciences - are we readyJack C Crawford
 
The Rising Tide Raises All Boats: The Advancement of Science of Cybersecurity
The Rising Tide Raises All Boats:  The Advancement of Science of CybersecurityThe Rising Tide Raises All Boats:  The Advancement of Science of Cybersecurity
The Rising Tide Raises All Boats: The Advancement of Science of Cybersecuritylaurieannwilliams
 
Data science Innovations January 2018
Data science Innovations January 2018Data science Innovations January 2018
Data science Innovations January 2018suresh sood
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodKarry Lu
 
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrank Rybicki
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Machine Learning & Artificial Intelligence: Beyond Diagnosis
Machine Learning & Artificial Intelligence: Beyond Diagnosis Machine Learning & Artificial Intelligence: Beyond Diagnosis
Machine Learning & Artificial Intelligence: Beyond Diagnosis SMARTMD
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data scienceJordan Engbers
 
Data science innovations
Data science innovations Data science innovations
Data science innovations suresh sood
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
 
Computational Social Science:The Collaborative Futures of Big Data, Computer ...
Computational Social Science:The Collaborative Futures of Big Data, Computer ...Computational Social Science:The Collaborative Futures of Big Data, Computer ...
Computational Social Science:The Collaborative Futures of Big Data, Computer ...Academia Sinica
 
The State of AI in Insights and Research 2024: Results and Findings
The State of AI in Insights and Research 2024: Results and FindingsThe State of AI in Insights and Research 2024: Results and Findings
The State of AI in Insights and Research 2024: Results and FindingsRay Poynter
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical DataPaul Agapow
 
Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1Jeff Jonas
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 

Ähnlich wie Durkheim Project: Social Media Risk & Bayesian Counters (20)

Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
 
Jdb code biology and ai final
Jdb code biology and ai finalJdb code biology and ai final
Jdb code biology and ai final
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
Ai for life sciences - are we ready
Ai for life sciences  - are we readyAi for life sciences  - are we ready
Ai for life sciences - are we ready
 
The Rising Tide Raises All Boats: The Advancement of Science of Cybersecurity
The Rising Tide Raises All Boats:  The Advancement of Science of CybersecurityThe Rising Tide Raises All Boats:  The Advancement of Science of Cybersecurity
The Rising Tide Raises All Boats: The Advancement of Science of Cybersecurity
 
Data science Innovations January 2018
Data science Innovations January 2018Data science Innovations January 2018
Data science Innovations January 2018
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
From byte to mind
From byte to mindFrom byte to mind
From byte to mind
 
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Machine Learning & Artificial Intelligence: Beyond Diagnosis
Machine Learning & Artificial Intelligence: Beyond Diagnosis Machine Learning & Artificial Intelligence: Beyond Diagnosis
Machine Learning & Artificial Intelligence: Beyond Diagnosis
 
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
Data science innovations
Data science innovations Data science innovations
Data science innovations
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
Computational Social Science:The Collaborative Futures of Big Data, Computer ...
Computational Social Science:The Collaborative Futures of Big Data, Computer ...Computational Social Science:The Collaborative Futures of Big Data, Computer ...
Computational Social Science:The Collaborative Futures of Big Data, Computer ...
 
The State of AI in Insights and Research 2024: Results and Findings
The State of AI in Insights and Research 2024: Results and FindingsThe State of AI in Insights and Research 2024: Results and Findings
The State of AI in Insights and Research 2024: Results and Findings
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical Data
 
Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Kürzlich hochgeladen (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Durkheim Project: Social Media Risk & Bayesian Counters

  • 1. The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov: Cloudera Disclaimers: This material is based upon work supported by the Defense Advance Research Project Agency (DARPA), and Space Warfare Systems Center Pacific under Contract N66001-11-4006. Also supported by, the Intelligence Advanced Research Projects Activity (IARPA) via the Department of Interior National Business Center contract number N10PC20221. The opinions, findings and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the Defense Advance Research Program Agency (DARPA) and Space, the Naval Warfare Systems Center Pacific, or the IARPA, DOI/NBC, or the U.S. Government. © 2013 Patterns and Predictions
  • 2. Speakers PATTERNS AND PREDICTIONS Chris  Principal Investigator, DARPA DCAPS Poulin-Dartmouth Suicide Prediction Team  Former Co-Director, Dartmouth Metalearning Working Group (Theoretical Machine Learning)  Artificial Intelligence Instructor, US Naval War College  Principal, Patterns and Predictions (linguistics and prediction of financial events) … and have now read many suicide notes. Alex Principal Solutions Architect at Cloudera Ph.D. from Stanford University. Data mining and statistical analysis at SGI, Hewlett-Packard
  • 3. PATTERNS AND PREDICTIONS Suicide is a hard societal problem, but why? Stigma: Victims are socially outcast (i.e. disconnected) Negative Topic: Intense negative emotion. And not a 'sexy' research topic by any means. Freedom of Choice: Ultimately you cant stop someone from risky behaviors, or many other activities that risk self harm. And suicide is the ultimate act of personal risk. Logistics: Even if you know what to look for, there are not enough clinicians to help the number of people suffering. Data privacy issues are as intense, or more so then say banking. Prediction: Accuracy (proper identification), false positives (stigmatization), false negatives (malpractice) Deeper issues?: Recent growth in suicide may be related to something more systemically wrong. Suicide the symptom of something else going on.
  • 4.  The project is named in honor of Emile Durkheim, a founding sociologist whose 1897 publication of Suicide defined early text analysis for suicide risk.  The team is comprised of a multidisciplinary team of artificial intelligence (machine learning and computational linguistics), and medical experts (psychiatrists).  www.durkheimproject.org PATTERNS AND PREDICTIONS Durkheim
  • 5. PATTERNS AND PREDICTIONS Social Problem: Opt-In is critical o Clear explanations for consent, no tricky EULAs Technical Problem: How to build a system that collects, stores, analyzes, and allows clinicians to react at Internet scale? Architecture: 1) Opt-In Interface Layer 2) Data Collection Layer 3) Storage Layer 4) Machine Learning, Phase I 5) Machine Learning, Phase II 6) Automated Intervention Our Approach
  • 6. PATTERNS AND PREDICTIONS 1) Opt-In Interface Layer We cant overemphasize the role of simplified user participation for consent, and privacy control, in our interface/interaction design.
  • 7. PATTERNS AND PREDICTIONS 2) Data Collection Layer The social media component is handled by a content aggregator (Gigya), and populates a Cassandra database.
  • 8. PATTERNS AND PREDICTIONS Data Collection Layer, Continued The Cassandra instances were built and maintained (by Scale Unlimited) to handle high throughput storage. However, this is not the final destination of the data.
  • 9. PATTERNS AND PREDICTIONS 3) Storage Layer Eventually, the data is moved to the medical center (behind a HIPAA compliant firewall at Dartmouth). Here it persists for ongoing research.
  • 10. PATTERNS AND PREDICTIONS 4) Machine Learning, Phase I In 2011, we initiated a study with the U.S. Department of Veterans Affairs (VA) to study 3 cohorts of 100 subjects each (Non-Psychiatric, Psychiatric, and Suicide Positive).  We developed linguistics- driven prediction models to estimate the risk of suicide.  These models were generated from unstructured clinical notes  From the clinical notes, we generated datasets of single keywords and multi-word phrases  We were able to predict suicide with 65% accuracy on a small dataset.
  • 11. PATTERNS AND PREDICTIONS 5) Machine Learning, Phase II In 2011, we also initiated a study with Cloudera (Alex Kozlov) on a lightweight machine learning framework for detecting real-time risk at scale.  We wanted a clean statistical model for distributed inference (prediction).  We needed a more lightweight framework than Mahout.  We wanted to be able to tradeoff runtime vs. accuracy.  We wanted the prediction library to be eventually open sourced (Apache license) for the community. ‘‘Alpha’ Build @Alpha’ Build @ http://durkheimproject.org/bcount/http://durkheimproject.org/bcount/ By Alex Kozlov <alexvk@cloudera.com>By Alex Kozlov <alexvk@cloudera.com>
  • 12. What is B-counts today? And Why?  Distributed aggregation of user events and correlations to fit into RAM of multiple machines  Smart client: Moves substantial amount of logic to clients  Time:An explicit time dimension to support ‘recency analysis’  Based on HBase  Previous analysis (Poulin) had indicated that words and correlations are a good predictor of target variable  Need a faster processing/response time (response time beats accuracy of the model) http://www.slideshare.net/Hadoop_Summit/bayesian-http://www.slideshare.net/Hadoop_Summit/bayesian- counterscounters
  • 13. Time to Answer Examples  Advertising: if you don’t figure what the user wants in 5 minutes, you lost him  Intrusion detection: the damage may be significantly bigger after a few minutes after break-in  Mental health risk: you need to screen before negative actions occur Value vs. time http://cetas.nethttp://cetas.net http://www.woopra.comhttp://www.woopra.com http://www.wibidata.com/http://www.wibidata.com/
  • 14. Solution: Time Stamped Hadoop •Key: subset of variables with their values + timestamp (variable length) •Value: count (8 bytes) KeyKey 11 KeyKey 11 ValuValu ee ValuValu ee KeyKey 22 KeyKey 22 ValuValu ee ValuValu ee KeyKey 33 KeyKey 33 ValuValu ee ValuValu ee KeyKey 44 KeyKey 44 ValuValu ee ValuValu ee indexindex Pr(A|B, last 20 minutes)Pr(A|B, last 20 minutes) Column families are different HFiles (30 min, 2 hours, 24 hours, 5 days, etc.) What if we want to access more recent data more often? What if we want to access more recent data more often?
  • 15. A Bayesian Counter, in detail IrisIrisIrisIris [sepal_width=2;class=0][sepal_width=2;class=0][sepal_width=2;class=0][sepal_width=2;class=0] 15151515 1321038671132103867113210386711321038671 30 mins30 mins30 mins30 mins 2 hours2 hours2 hours2 hours …… Region (divideRegion (divide between)between) ColumnColumn familyfamily ColumnColumn qualifierqualifier FileFile ValueValue (data)(data) Counter/TaCounter/Ta bleble 1321038998132103899813210389981321038998 VersionVersion
  • 17. Syntax nb iris class=2 sepal_length=5;petal_length=1.4 300 Target VariableTarget Variable PredictorsPredictors Time (seconds from now)Time (seconds from now)
  • 18. Current Classifier Support (alpha release)  Naïve Bayes: Pr(C|F1, F2, ..., FN) =1/z Pr(C) Πi Pr(Fi|C)  Association rules: Confidence (A -> B): count(A and B)/count(A), Lift (A -> B): count(A and B)/(count(A) x count(B))  Nearest Neighbor: P(C) for k nearest neighbors, count(C|X) = ΣXi count(C|Xi), where X1, X2, ..., XN are in the vicinity of X  Clique ranking: I(X;Y)= p(x,y)log(p(x,y)/p(x)p(y),Where x in X and y inY, Using random projection canΣΣ generalize on two abstract subsets of Z
  • 19. Performance retail.dat example – 88K transactions over 14,246 items o Mahout FPGrowth – 0.5 sec per pattern (58,623 patterns with min support 2) o 10 ms per pattern on a 5 node cluster
  • 20. PATTERNS AND PREDICTIONS 6) Intervention Automated systems are coming online for potential patients and families seeking treatment, as well as passive intervention strategies (‘safety plans’).
  • 21. PATTERNS AND PREDICTIONS What's next? In 2013, we plan a variety of initiatives including the launch of our clinical observation study, deployment of Bayesian Counters on live data, and to seek approval for an automated intervention study.  Launch Data Collection Study (CPHS #23781)… very soon  Deployment of B-Counts on live data for live monitoring  Intervention Research (Clinical Study Approval)
  • 22. PATTERNS AND PREDICTIONS Conclusion What is Durkheim? And what is the Bayesian Counters library? A near real-time classification library, that, while under development, you’re free to use. Hope that some help is coming to those in need…
  • 23. Team PATTERNS AND PREDICTIONS Chris Poulin, Director & Principal Investigator Paul Thompson, Study Co-Principal Investigator Thomas W. McAllister, M.D., Key Personnel Ben Goertzel, Ph.D., Key Personnel Brian Shiner, MD, Key Personnel Craig J. Bryan, PsyD, Advisor Linas Vepstas – Lead Machine Learning Programmer Brian Nauheimer – Technical Project Manager Chhean Saur – Lead Web/API Programmer Kevin Watters – Principal Programmer, Middleware Ken Krugler – Lead Distributed Systems Expert Ann Marion – User Experience (UX) Design Jane Nisselson – User Interface (UI) Design Andrew Chen – Social Media Applications Developer Alex Kozlov – Real-time/Distributed Classifier Development Vivek Magotra – Cassandra Database Developer
  • 24. THANK YOU Chris Poulin, Managing Partner, Patterns and Predictions chris@patternsandpredictions.net Alex Kozlov, Principal Solutions Architect, Cloudera alexvk@cloudera.com Note: We hope that you have found this talk useful and encouraging. However, if you are having thoughts of harming yourself, please call the Veterans Crisis Line at 1-800 273- 8255 or 911. © 2013 Patterns and Predictions PATTERNS AND PREDICTIONS

Hinweis der Redaktion

  1. the CPU power has reached the limit (in the end, speed of light is finite) Combining storage or processing capabilities across a distributed system of machines is non-trivial RAM is faster than disks (RAM ns, disk ms)  There are 1,832,160 feet in 347 miles D isk moves at 50 m/s vs 300,000,000 m/s Can we do at least 1,000 feet (300 m)? Network? There is no “virtual memory”
  2. If we had all the time (the universe is projected to be less than 1000 trillion years) we could (probably) get the exact answer. Some analytical companies: http://cetas.net/ acquired by VMWare http://www.woopra.com analyses traffic to a website real-time http://www.wibidata.com/ our friends
  3. More recent column families are accessed more often