Understanding users’ latent intents behind search queries is essential for satisfying a user’s search needs. Search intent mining can help search engines to enhance its ranking of search results, enabling new search features like instant answers, personalization, search result diversification, and the recommendation of more relevant ads. Consequently, there has been increasing attention on studying how to effectively mine search intents by analyzing search engine query logs. While state-of-the-art techniques can identify the domain of the queries (e.g. sports, movies, health), identifying domain-specific intent is still an open problem. Among all the topics available on the Internet, health is one of the most important in terms of impact on the user and it is one of the most frequently searched areas. This dissertation presents a knowledge-driven approach for domain-specific search intent mining with a focus on health-related search queries.
First, we identified 14 consumer-oriented health search intent classes based on inputs from focus group studies and based on analyses of popular health websites, literature surveys, and an empirical study of search queries. We defined the problem of classifying millions of health search queries into zero or more intent classes as a multi-label classification problem. Popular machine learning approaches for multi-label classification tasks (namely, problem transformation and algorithm adaptation methods) were not feasible due to the limitation of label data creations and health domain constraints. Another challenge in solving the search intent identification problem was mapping terms used by laymen to medical terms. To address these challenges, we developed a semantics-driven, rule-based search intent mining approach leveraging rich background knowledge encoded in Unified Medical Language System (UMLS) and a crowd sourced encyclopedia (Wikipedia). The approach can identify search intent in a disease-agnostic manner and has been evaluated on three major diseases.
While users often turn to search engines to learn about health conditions, a surprising amount of health information is also shared and consumed via social media, such as public social platforms like Twitter. Although Twitter is an excellent information source, the identification of informative tweets from the deluge of tweets is the major challenge. We used a hybrid approach consisting of supervised machine learning, rule-based classifiers, and biomedical domain knowledge to facilitate the retrieval of relevant and reliable health information shared on Twitter in real time. Furthermore, we extended our search intent mining algorithm to classify health-related tweets into health categories. Finally, we performed a large-scale study to compare health search intents and features that contribute in the expression of search intent from 100+ million search queries from smarts devices (smartphones/tablets) and personal computers (desktops/laptops)
6. Web Search Intent
6
Search intent is a significant object/topic that
represents abstraction of users’ information needs.
Search Goals*
Search Topics
Why
WhatWhat
7. 7
Search Intent
Mining
Search
Goals
Search Topics
Session dataClick-through Query log
Manual
Unsupervised
Supervised
Ontology-based
Knowledge
driven
My Work
Related Work
Broder 2002
Beeferma 2003
Rose and
Levinson 2004
Baeza-Yates 2006
Hu et al. 2009
Sadikov 2010
Nanda 2014
Ustinovskiy 2013
White 2010
Joachims 2002
Lee 2005
Fujita 2010
Hu 2012
Broder 2007
Radlinski 2010
Celikyilmaz 2011
Shen 2006
Biomedical KB – UMLS
Crowd-sourced KB – Wikipedia
Dictionaries – Hunspell,
OpenMedspell
Techniques
8. 8
Search is shifting toward understanding intent
and serving objects
-Li et al., ACL, 2010
10. 10
Health MovieSports TechnologyPhysics
Health
Diseases
Symptoms
Causes
Medications
Treatments
Prevention
11.
Web Search for Health Information
Among all topics available on the
Internet, health is one of the most
important in terms of impact on the user
11
12. • Major Challenges
Ø Consumers’ lack of
medical knowledge to
formulate health search
queries
Ø Search engines’ failure to
understand users’ health
search intents
12
Challenges in Health Information Search
• Health information search is a “trial-and error” process.
13. • Health search intent mining applications:
– Personalized health information interventions
– To get better understanding of consumers’ health
information needs
– Targeted advertisements
Motivation: Real-world Applications
13
Research Problem: Domain specific search intent mining
14. 14
Thesis Statement
Rich background knowledge from biomedical knowledge
bases and Wikipedia enables development of effective
methods for:
I. Intent mining from health-related search queries in
disease agnostic manner
II. Efficient browsing of informative health information
shared on social media.
15. • Focus: Consumer-oriented health search intent
• Challenge: No standardized list of consumer-oriented health
intent classes
• Approach:
– Qualitative study (published in JMIR, impact factor 4.7)
Health Search Intent
15
16. • Focus: Consumer-oriented health search intent
• Challenge: No standardized list of consumer-oriented health
intent classes
• Approach:
– Qualitative study (published in JMIR, impact factor 4.7)
Health Search Intent
16
Three focus groups
Study questions:
• Motivation for using internet for health information seeking
• What do they search? (search intent)
• How do they search?
• What are the challenges in the search
17. • Focus: Consumer-oriented health search intent
• Challenge: No standardized list of consumer-oriented health
intent classes
• Approach:
– Qualitative study (published in JMIR, impact factor 4.7)
– Health categories on popular health websites
– Review of online health information seeking literature
– Empirical data analysis
The intent classes and the classification scheme is reviewed and
validated by the Mayo Clinic clinicians and domain experts
Health Search Intent
17
Selection criteria:
• Google PageRank, Alexa ranking,
• Medical Library Association’s ranking (CAPHIS - Consumer
and Patient Health Information Section)
Selected websites:
Mayo Clinic, WebMD, MedlinePlus, CDC, HealthFinder.gov,
and Familydoctor.org.
18. • Focus: Consumer-oriented health search intent
• Challenge: No standardized list of consumer-oriented health
intent classes
• Approach:
– Qualitative study (published in JMIR, impact factor 4.7)
– Health categories on popular health websites
– Review of online health information seeking literature
– Empirical data analysis
The intent classes and the classification scheme is reviewed and
validated by the Mayo Clinic clinicians and domain experts
Health Search Intent
18
19. • Focus: Consumer-oriented health search intent
• Challenge: No standardized list of consumer-oriented health
intent classes
• Approach:
– Qualitative study (published in JMIR, impact factor 4.7)
– Health categories on popular health websites
– Review of online health information seeking literature
– Empirical data analysis
The intent classes and the classification scheme is reviewed and
validated by the Mayo Clinic clinicians and domain experts
Health Search Intent
19
Intent Classes Intent Classes
1 Symptoms 8 Living with
2 Causes 9 Prevention
3 Risks & Complications 10 Side effects
4 Drugs and Medications 11 Medical devices
5 Treatments 12 Diseases and conditions
6 Tests and Diagnosis 13 Age-group References
7 Food and Diet 14 Vital signs
20. • Focus: Consumer-oriented health search intent
• Challenge: No standardized list of consumer-oriented health
intent classes
• Approach:
– Qualitative study (published in JMIR, impact factor 4.7)
– Health categories on popular health websites
– Review of online health information seeking literature
– Empirical data analysis
The intent classes and the classification scheme is reviewed and
validated by the Mayo Clinic clinicians and domain experts
Health Search Intent
20
Intent Classes Intent Classes
1 Symptoms 8 Living with
2 Causes 9 Prevention
3 Risks & Complications 10 Side effects
4 Drugs and Medications 11 Medical devices
5 Treatments 12 Diseases and conditions
6 Tests and Diagnosis 13 Age-group References
7 Food and Diet 14 Vital signs
21. • Allows the instances to be associated with more than one
class
• Problem transformation methods (fit data to algorithm)
– Transform the multi-label classification problem either into one or
more single-label classification problems.
– e.g., Binary Relevance, Label Power, and RAKEL-RAndom k-LabELsets
• Algorithm adaptation methods (fit algorithm to data)
– Extend specific learning algorithms in order to handle multi-label
data directly.
– e.g., Tree-based boosting - AdaBoost.MR, ML-kNN, and Rank-SVM
21
Multi-label Classification
Both these methods follow underlying principles of the
supervised learning approach and depend on training data.
22. • Manual, time consuming and labor intensive process
• May require domain experts
• Limited coverage
– Training data should be a representative sample of the dataset
– Very difficult to create a training dataset that can cover all
aspects (discriminative features) of the dataset
• Generalization problem
– Poor performance on unseen data
Challenges with Training Data Creation
22
These challenges get amplified for multi-label
classification problems
23. In the context of health search intent mining problem
• Training data for 14 intent classes
• Need domain experts to label dataset
Supervised Classification Limitations
23
Domain constraint: A classifier trained for one
disease may not work for other diseases
These challenges make supervised learning-
based approaches infeasible for our problem
25. 25
Knowledge Driven Approach
Machine Processable Knowledge
Ontologies
Taxonomies
Dictionaries
Knowledge-
bases
Ontology
Timeframe: early 2000
First patent on Semantic Web
More information at blog
28. Unified Medical Language System
• UMLS (Unified Medical Language System)
– Collection of over 100 controlled vocabularies such as
MeSH, SNOMED_CT, NCI, and RxNorm
Biomedical Knowledge Base
28
Metathesaurus
Collection of
concepts
Semantic Network
Semantic Types and
Semantic Relationships
SPECIALIST Lexicon
Biomedical terms and
their variants
29. • Concept identification consists of two primary tasks:
– Concept recognition and concept mapping
– Example : what are the medications for stomach pain?
Concepts: medication, stomach pain
Challenges
• Lexical or orthographic variants e.g., (diet, dieting), (ICD9, ICD-9)
• Misspelling, e.g., (pneumonia, neumonia)
• Synonyms, e.g., (heart attack, myocardial infarction)
• Abbreviations, e.g., (myocardial infarction, MI)
• Identifying concept boundary e.g., (pain in stomach, stomach pain)
• Contextual meanings, e.g., (discharge from hospital, discharge from
wound)
Concept Identification
29
30. • Medical concept identification tools
– UMLS MetaMap, cTAKES, MedLEE, NCBO Annotator
• UMLS MetaMap
– Identifies ULMS Metathesaurus concepts from text
– Semantic Type (e.g., disease or syndrome)
– UMLS Concept (e.g., blood pressure and heart rate)
• Example (UMLS Concept) [Sematic Type]
– Phrase query: red wine heart attack
• Red wine (Red wine) [Food]
• Heart Attack (Myocardial Infarction) [Disease or Syndrome]
30
Concept Identification
31. • Phrase query: water on the brain
– Water (Drinking Water) [Substance]
– Brain (Brain) [Body Part, Organ, or Organ Component]
• Actual Mapping should be
– Water on the brain (Hydrocephalus) [Disease or
Syndrome]
Concept Identification Challenges
31
32. Concept Identification Approach
32
• Advanced text analytics
– Word Sense Disambiguation (WSD)
• Process of identifying the meaning of a term in context
• With the WSD advancement, concepts are identified by
considering the surrounding text
– Maximal phase detection
• Process each input record as a single phrase in order to
identify more complex Metathesaurus terms
• Consumer Health Vocabulary (CHV)
33. • Consumer Health Vocabulary (CHV)
– Maps terms used by layman to medical terms
– E.g. hair loss => Alopecia
• Problem: CHV in UMLS is incomplete
• Example: water on the knee
Water thick-knee (Burhinus vermiculatus) [Bird]
• Actual Mapping should be
– Water on the knee(Knee effusion ) [Disease or
Syndrome]
Consumer Health Vocabulary
33
34. • Consumer Health Vocabulary (CHV)
– Maps terms used by layman to medical terms
– E.g. hair loss => Alopecia
• Problem: CHV in UMLS is incomplete
• Example: water on the knee
Water thick-knee (Burhinus vermiculatus) [Bird]
• Actual Mapping should be
– Water on the knee(Knee effusion ) [Disease or
Syndrome]
Consumer Health Vocabulary
34
Major challenge for health search intent mining problem
35. • Traditional approach
– Identification of consumer-oriented terms from Medline search
log, PatientsLikeMe forum data
– Manual review by healthcare professionals
Approach: leverage knowledge from Wikipedia
• One of the most-used online health resources
• Continuously updated with emerging health terms
• Links consumer-oriented terms with health
professionals terms using semantic relationships
Consumer Health Vocabulary Generation
35
36. • Traditional approach
– Identification of consumer-oriented terms from Medline search
log, PatientsLikeMe forum data
– Manual review by healthcare professionals
Approach: leverage knowledge from Wikipedia
• One of the most-used online health resources
• Continuously updated with emerging health terms
• Links consumer-oriented terms with health
professionals terms using semantic relationships
Consumer Health Vocabulary Generation
36
39. • Wikipedia: Crowd sourced encyclopedia
Consumer Health Vocabulary Generation
39
Health-related
Wikipedia
articles
Health
Category
Candidate
subcategories
Articles tagged
with candidate
subcategories
Step 1: Identification of health-related Wikipedia articles
40. Snippet 2: Knee effusion or swelling of the knee (colloquially
known as water on the knee) occurs when excess synovial
fluid accumulates in or around the knee joint.
Snippet 1: Hair loss, also known as alopecia or baldness,
refers to a loss of hair from the head or body.
40
Consumer Health Vocabulary Generation
Step 2: Extraction of candidate pairs
41. 41
Consumer Health Vocabulary Generation
Step 2: Extraction of candidate pairs
Snippet 2: Knee effusion or swelling of the knee (colloquially
known as water on the knee) occurs when excess synovial
fluid accumulates in or around the knee joint.
Snippet 1: Hair loss, also known as alopecia or baldness,
refers to a loss of hair from the head or body.
42. 42
Consumer Health Vocabulary Generation
Step 2: Extraction of candidate pairs
Pairs Terms
Semantic
Relationship
Terms
1 hair loss also known as alopecia
2 hair loss also known as baldness
3 knee effusion
colloquially known
as
water on the
knee
4
swelling of the
knee
colloquially known
as
water on the
knee
5 knee effusion same as
swelling of the
knee
43. 43
Consumer Health Vocabulary Generation
Step 2: Extraction of candidate pairs
Wikipedia Patterns
also called commonly called colloquially known as
also known as commonly known as sometimes called
also referred to as commonly termed sometimes known as
also termed previously known as sometimes termed
commonly referred to
as
colloquially referred
to as
sometimes referred
to as
Pattern-based information extractor
44. 44
Consumer Health Vocabulary Generation
Step 3: Identification of CHV and medical terms from the
candidate pairs
Map terms from the candidate pairs to UMLS Metathesaurus
using MetaMap
• Scenario 1:
- Both terms are present in the UMLS Metathesaurus
- e.g., {hair loss, alopecia}
• Scenario 2:
- Both terms are not present in the UMLS Metathesaurus
- e.g., {hospital trust, acute trust}
• Scenario 3:
- Only one term is present in the UMLS Metathesaurus
- e.g., {knee effusion, water on the knee}
45. • Data:
– Cardiovascular disease (CVD) related search queries
– Limited to the United States
• Data timeframe:
– September 2011 to August 2013
• Data collection tool:
– IBM NetInsight On Demand
(Web Analytics tool)
• Dataset size:
– 10.4 million CVD related search queries
– Significantly large dataset for a
single class of diseases. 45
Dataset
46. • Preprocessing
– Stop word removal
– Misspelling correction (using Hunspell spell checker)
• Dictionaries: Hunspell dictionary, and its medical version,
OpenMedSpell
– Replace all CHV terms from the search queries with medical
terms
• UMLS MetaMap
– Usage challenge: Significantly slow for millions of search queries
Data Processing
46
47. • Preprocessing
– Stop word removal
– Misspelling correction (using Hunspell spell checker)
• Dictionaries: Hunspell dictionary, and its medical version,
OpenMedSpell
– Replace all CHV terms from the search queries with medical
terms
• UMLS MetaMap
– Usage challenge: Significantly slow for millions of search queries
Data Processing
47
Solution: Developed a scalable MetaMap implementation
using a Hadoop-MapReduce framework
48. • Gold standard dataset
– Two domain experts annotated randomly selected search queries
by labeling one search query with zero or more intent classes
– Gold standard dataset is further divided into training and testing
• Evaluation Matrics
– Macro Average Precision Recall
– Average of the precision and recall of the classification algorithm
on different classes
– To identify classification performance at class-level
48
Evaluation
50. 50
Classification : Evaluation Results
Rules Precision Recall F1 Score
ST (baseline approach) 0.5432 0.6203 0.5791
ST+SC 0.6534 0.6822 0.6674
ST+SC+KW 0.6722 0.6923 0.6821
ST+SC+KW-ST* 0.7383 0.7344 0.7363
ST+SC+KW-ST*-SC* 0.7601 0.7930 0.7762
ST+SC+KW-ST*-SC*+AdvTA 0.8539 0.8382 0.8459
ST+SC+KW-ST*-SC*+AdvTA+CHV 0.8842 0.8607 0.8723
ST = Semantic type SC = Semantic (UMLS) concepts KW = keyword
AdvTA = Advanced Text Analytic CHV = Consumer Health Vocabulary
For Drug and medication Intent Class
Correctly classified Wrongly classified
• ibuprofen heart rate
• dextromethorphan blood
pressure
• medications for pulmonary
hypertension
• alcohol heart disease
• meds for acid reflux
51. 51
Classification : Evaluation Results
Rules Precision Recall F1 Score
ST (baseline approach) 0.5432 0.6203 0.5791
ST+SC 0.6534 0.6822 0.6674
ST+SC+KW 0.6722 0.6923 0.6821
ST+SC+KW-ST* 0.7383 0.7344 0.7363
ST+SC+KW-ST*-SC* 0.7601 0.7930 0.7762
ST+SC+KW-ST*-SC*+AdvTA 0.8539 0.8382 0.8459
ST+SC+KW-ST*-SC*+AdvTA+CHV 0.8842 0.8607 0.8723
ST = Semantic type SC = Semantic (UMLS) concepts KW = keyword
AdvTA = Advanced Text Analytic CHV = Consumer Health Vocabulary
For Drug and medication Intent Class
Correctly classified Wrongly classified
• ibuprofen heart rate
• dextromethorphan blood
pressure
• medications for pulmonary
hypertension
• alcohol heart disease
• meds for acid reflux
52. 52
Classification : Evaluation Results
Rules Precision Recall F1 Score
ST (baseline approach) 0.5432 0.6203 0.5791
ST+SC 0.6534 0.6822 0.6674
ST+SC+KW 0.6722 0.6923 0.6821
ST+SC+KW-ST* 0.7383 0.7344 0.7363
ST+SC+KW-ST*-SC* 0.7601 0.7930 0.7762
ST+SC+KW-ST*-SC*+AdvTA 0.8539 0.8382 0.8459
ST+SC+KW-ST*-SC*+AdvTA+CHV 0.8842 0.8607 0.8723
ST = Semantic type SC = Semantic (UMLS) concepts KW = keyword
AdvTA = Advanced Text Analytic CHV = Consumer Health Vocabulary
For Drug and medication Intent Class
Correctly classified Wrongly classified
• ibuprofen heart rate
• dextromethorphan blood pressure
• medications for pulmonary hypertension
• meds for acid reflux
• alcohol heart
disease
53. 53
Classification : Evaluation Results
Rules Precision Recall F1 Score
ST (baseline approach) 0.5432 0.6203 0.5791
ST+SC 0.6534 0.6822 0.6674
ST+SC+KW 0.6722 0.6923 0.6821
ST+SC+KW-ST* 0.7383 0.7344 0.7363
ST+SC+KW-ST*-SC* 0.7601 0.7930 0.7762
ST+SC+KW-ST*-SC*+AdvTA 0.8539 0.8382 0.8459
ST+SC+KW-ST*-SC*+AdvTA+CHV 0.8842 0.8607 0.8723
ST = Semantic type SC = Semantic (UMLS) concepts KW = keyword
AdvTA = Advanced Text Analytic CHV = Consumer Health Vocabulary
For Drug and medication Intent Class
Correctly classified
• ibuprofen heart rate
• meds for acid reflux
• alcohol heart disease
• medications for pulmonary
hypertension
• dextromethorphan blood pressure
54. 54
Classification : Evaluation Results
Rules Precision Recall F1 Score
ST (baseline approach) 0.5432 0.6203 0.5791
ST+SC 0.6534 0.6822 0.6674
ST+SC+KW 0.6722 0.6923 0.6821
ST+SC+KW-ST* 0.7383 0.7344 0.7363
ST+SC+KW-ST*-SC* 0.7601 0.7930 0.7762
ST+SC+KW-ST*-SC*+AdvTA 0.8539 0.8382 0.8459
ST = Semantic type SC = Semantic (UMLS) concepts KW = keyword
AdvTA = Advanced Text Analytic CHV = Consumer Health Vocabulary
• Phrase query: water on the brain
– Water (Drinking Water) [Substance]
– Brain (Brain) [Body Part, Organ, or Organ Component]
• Actual Mapping should be
– Water on the brain (Hydrocephalus) [Disease or Syndrome]
• Advanced Text Analytics
– Word sense disambiguation, maximal phrase detection, CHV from
UMLS
55. 55
Classification : Evaluation Results
Rules Precision Recall F1 Score
ST 0.5432 0.6203 0.5791
ST+SC 0.6534 0.6822 0.6674
ST+SC+KW 0.6722 0.6923 0.6821
ST+SC+KW-ST* 0.7383 0.7344 0.7363
ST+SC+KW-ST*-SC* 0.7601 0.7930 0.7762
ST+SC+KW-ST*-SC*+AdvTA 0.8539 0.8382 0.8459
ST+SC+KW-ST*-SC*+AdvTA+CHV 0.8842 0.8607 0.8723
ST = Semantic type SC = Semantic (UMLS) concepts KW = keyword
AdvTA = Advanced Text Analytic CHV = Consumer Health Vocabulary
• Generating CHV from Wikipedia
• Example: water on the knee
Water thick-knee (Burhinus vermiculatus) [Bird]
• Actual Mapping should be
– Water on the knee(Knee effusion ) [Disease or Syndrome]
56. • Macro Average
– Precision:0.8842, Recall: 0.8607 and F-Score: 0.8723
56
Classification : Evaluation Results
To check the performance of the classification approach for
individual intent classes
57. No Intent Classes Total Queries
Percentage
Distribution
1 Diseases 4,232,398 40.66
2 Vital signs 3,455,809 33.20
3 Symptoms 1,422,826 13.67
4 Living with 1,178,756 11.32
5 Treatments 955,701 9.18
6 Food and Diet 779,949 7.49
7 Med Devices 665,484 6.39
8 Drugs and Medications 603,905 5.80
9 Causes 599,895 5.76
10 Tests & Diagnosis 344,747 3.31
11 Risks and Complication 277,294 2.66
12 Prevention 136,428 1.31
13 Age-group References 87,929 0.84
14 Side effects 25,655 0.25
Total 14,766,776 141.87
57
Classification: Results
58. 8%
48%
40%
4%
0%
Distribution of search queries by number of intent
classes in which they are classified
0
1
2
3
4 and more
58
Classification: Results
61. 61
• Hello,
For the past 10 hours I've been expierencing a semi sharp pain in
my upper right chest just below my armpit. This pain appears
anywhere from every two and a half minutes to ten or fifteen
minutes. I also have some stomach ache and dry mouth. I monitor
my blood pressure is averages 130/90 with a average heart rate of
80. My cardiologist has been treating me since 1 year for high
colesterol, gout and hypertension with great success. Also I have
diabetes and I am taking Metformin and mevacor. I have an
appointment with my cardiologist after 2 weeks. However I am
wondering should I go to ER? BTW I am 69 years old male.
Scenario in Clinical Decision Support System
Source: DailyStrength forum
62. 62
Demographic
Information
dry mouth => Xerostomia
Drugs and Medication
Misspellings
Diseases and
Conditions
Symptom
Consumer Health
Vocabulary
expierencing => experiencing
colesterol => cholesterol
chest pain
stomach ache
Xerostomia (dry mouth)
Age: 69
Gender: Male
Metformin
Mevacor
Gout
Hypertension
Diabetes
Blood pressure: 130/90
Heart rate: 80Vital Signs
63. • Primary Symptom
– Chest pain
• upper side
• Right side
• Other symptoms
– Stomach ache
– Dry mouth
• Current diseases
– Hypertension
– Gout
– Diabetes
• Vital Signs
– Blood pressure = normal
– Heart rate = normal
63
1. Diges2on-‐Related
Causes
2. Cardiovascular
Problems
3. Viral
Infec2ons
4. Gallbladder
Infec2on
5. Pancreas
Inflamma2on
6. Liver
Inflamma2on
7. Pleurisy
8. Lung
Diseases
Symptoms for
CVD
64. • Primary Symptom
– Chest pain
• upper side
• Right side
• Other symptoms
– Stomach ache
– Dry mouth
• Current diseases
– Hypertension
– Gout
– Diabetes
• Vital Signs
– Blood pressure = normal
– Heart rate = normal
64
1. Diges2on-‐Related
Causes
2. Cardiovascular
Problems
3. Viral
Infec2ons
4. Gallbladder
Infec2on
5. Pancreas
Inflamma2on
6. Liver
Inflamma2on
7. Pleurisy
8. Lung
Diseases
Symptoms for
CVD
65. 65
Thesis Statement
Rich background knowledge from biomedical knowledge
bases and Wikipedia enables development of effective
methods for:
I. Intent mining from health-related search queries in a
disease agnostic manner
II. Efficient browsing of informative health information
shared on social media.
66. • Intentional information seeking
– Web search
• Accidental information discovery
66
Information Acquisition
NASA’s Curiosity Rover on Mars
Accidentally bumping into (useful or
personal interest related) information
67. • In many cases, the phenomenon of accidental information
discovery is facilitated by users prior actions – serendipity
• Currently Twitter has thousands of health-centric accounts,
which are followed by millions of users to keep up with health
information
67
Health Information Acquisition
68. • Everyday millions of tweets shared
• Most of these tweets are highly personal
and contextual
• Only around 12% posts are informative
• User has to manually identify informative
tweets
68
Research Problem: How to automate
the identification of signals (informative
tweets) from noise (Twitter stream)
Information Overload on Twitter
69. • Informativeness of a tweet
depends upon reader’s
– Intent
– Knowledge about the information in the
tweet or novelty in the information
– Interest in the subject
– Who is the author (expert in a domain,
personal connection)
69
Informativeness of a Tweet is Subjective
Objectively what makes a tweet informative?
76. 76
Search and Explore
X Controls Cancer
X = diet, treatment, exercise
(Pattern-based Approach
leveraging domain
semantics)
Top Health News
Faceted search (based on intent
classification algorithm)
Learn about disease
Source: Mayo Clinic
Search &
Explore
Top Health
News
Tweet
Traffic
Learn about
Disease
Home
Tweet
Traffic
79. 79
Twitris: Social Media Analytics Platform
• Core component of around $6+ million research funding
(NFS, NIH, AFRL)
80. • NIH-R01 proposal (Mayo Clinic and Kno.e.sis, Wright State) ($2 Million)
– Modeling Social Behavior for Healthcare Utilization and Outcomes in Depression
•
• Air Force Research Lab (AFRL)
– Geo-Social mash-up for situational awareness in a disaster response situation
• Funded project: 2010-2011, Real-time Twitris
•
– Social media analysis for situational awareness (Funded: 2011-2012)
•
– WBI's Tec^Edge Innovation and Collaboration Center (Tec^Edge ICC)
• Funded project: Summer 2010, Summer 2011
• Mayo Clinic Meritorious Award
– Healthcare trend surveillance using social networks and health search queries
(funded 2013)
– What makes a health-related tweet informative (funded 2014)
Research Grants and Proposals
80
81. • NIH-R01 proposal (Mayo Clinic and Kno.e.sis, Wright State) ($2 Million)
– Modeling Social Behavior for Healthcare Utilization and Outcomes in Depression
•
• Air Force Research Lab (AFRL)
– Geo-Social mash-up for situational awareness in a disaster response situation
• Funded project: 2010-2011, Real-time Twitris
•
– Social media analysis for situational awareness (Funded: 2011-2012)
•
– WBI's Tec^Edge Innovation and Collaboration Center (Tec^Edge ICC)
• Funded project: Summer 2010, Summer 2011
• Mayo Clinic Meritorious Award
– Healthcare trend surveillance using social networks and health search queries
(funded 2013)
– What makes a health-related tweet informative (funded 2014)
Research Grants and Proposals
81
83. 83
Conclusion
Health Search Intent Mining
Identified consumer-
oriented intent classes
Multi-label Classification
Problem (L=14)
Supervised ML Knowledge-driven Approach
84. Semantics-based
Intent Classification
- Based on UMLS
semantic types and
concepts
- Advanced text analytics
- Consumer Health
Vocabulary
Consumer Health
Vocabulary
Generation
- Leveraged
Knowledge from
Wikipedia
- Maps CHV terms to
medical terms
84
Conclusion
Knowledge Driven Approach for Health Search Intent Mining
Concept
Identification
- UMLS MetaMap
- Advanced text
analytics
- Consumer Health
Vocabulary
Personalized eHealth Interventions
85. 85
Conclusion
Information overload
on Twitter
Subjectivity
Adapted search intent mining algorithm to
enable efficient browsing of the health
information on Social Health Signals
Objectively what makes a tweet informative?
86. Publications
• Analysis of Online Information Searching for Cardiovascular Diseases on a Consumer
Health Information Portal A Jadhav et al. AMIA Annual Symposium 2014
• Comparative Analysis of Online Health Queries Originating From Personal Computers and
Smart Devices on a Consumer Health Information Portal A Jadhav et al. Journal of
Medical Internet Research JMIR (Impact factor 4.7)
• Evaluating the Process of Online Health Information Searching: A Qualitative Approach to
Exploring Consumer Perspectives A Fiksdal, A Kumbamu, A Jadhav et al. Journal of
Medical Internet Research JMIR (Impact factor 4.7)
• Online Information Seeking for Cardiovascular Diseases: A Case Study from Mayo Clinic
A Jadhav et al. 25th European Medical Informatics Conference (MIE 2014)
• Empowering Personalized Medicine with Big Data and Semantic Web Technology:
Promises, Challenges, Pitfalls, and Use Cases M Panahiazar, V Taslimi, A Jadhav et al.
IEEE International Conference on Big Data (IEEE BigData 2014)
• Comparative Analysis of Online Health Information Search by Device Type A Jadhav et
al. AMIA TBI/CRI 2014
• An Analysis of Mayo Clinic Search Query Logs for Cardiovascular Diseases A Jadhav et
al. AMIA Annual Symposium 2014
• What Information about Cardiovascular Diseases do People Search Online? A Jadhav et
al. 25th European Medical Informatics Conference (MIE 2014)
86
87. Publications
87
• Twitris- a System for Collective Social Intelligence A Sheth, A Jadhav et al., Springer,
Encyclopedia of Social Network Analysis and Mining (ESNAM), 2014
• Twitris: Socially Influenced Browsing A Jadhav et al. Semantic Web Challenge,
International Semantic Web Conference ISWC 2009
• Twitris 2.0: Semantically Empowered System for Understanding Perceptions From
Social Data A Jadhav et al. Semantic Web Challenge, International Semantic Web
Conference ISWC 2010
• Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data - Challenges and
Experiences M Nagarajan, K Gomadam, A Sheth, A Ranabahu, R Mutharaju A Jadhav
Web Information Systems Engineering (WISE 2009)
• Understanding Events Through Analysis Of Social Media A Sheth, H Purohit, A Jadhav,
et al., Technical Report, Kno.e.sis Center, 2010
• Twitris+: Social Media Analytics Platform for Effective Coordination A. Smith, A.
Sheth, A. Jadhav, et al. NSF SoCS Symposium, 2012
• Patent on Context-Aware Information Recommendation, filed in January 2013
– Patent filled based on HP summer 2011 internship work
– Ashutosh Jadhav, Hamid Motahari, Susan Spence, Claudio Bartolini
88. • Shen, D., Pan, R., Sun, J.-T., Pan, J. J., Wu, K., Yin, J., and Yang, Q. 2006. Query enrichment for
web-query classification. ACM Transactions on Information Systems (TOIS) 24, 3,320-352.
• Shen, D., Sun, J.-T., Yang, Q., and Chen, Z. 2006. Building bridges for web query classification.
In Proceedings of the 29th annual international ACM SIGIR conference on Research and
development in information retrieval. ACM, 131-138.
• Sadikov, E., Madhavan, J., Wang, L., and Halevy, A. 2010. Clustering query refinements by user
intent. In Proceedings of the 19th international conference on World wide web. ACM, 841-850.
• Radlinski, F., Szummer, M., and Craswell, N. 2010. Inferring query intent from reformulations and
clicks. In Proceedings of the 19th international conference on World wide web. ACM, 1171-1172.
• Rose, D. E. and Levinson, D. 2004. Understanding user goals in web search. In Proceedings of
the 13th international conference on World Wide Web. ACM, 13-19.
• Nanda, A., Omanwar, R., and Deshpande, B. 2014. Implicitly learning a user interest profile for
personalization of web search using collaborative filtering. In Web Intelligence (WI) and Intelligent
Agent Technologies (IAT), 2014 IEEE/WIC/ACM International Joint Conferences on. Vol. 2. IEEE
• Soni, S. 2015. Domain specic document retrieval framework on near real-time social health data.
Thesis, Wright State University
• Naaman, M., Boase, J., and Lai, C.-H. 2010. Is it really about me?: message content in social
awareness streams. In Proceedings of the 2010 ACM conference on Computer supported
cooperative work. ACM, 189-192.
• White, R. W. and Horvitz, E. 2014. From health search to healthcare: explorations of intention
and utilization via query logs and user surveys. JAMIA
• Celikyilmaz, A., Hakkani-T ur, D., and T ur, G. 2011. Leveraging web query logs to learn user
intent via bayesian discrete latent variable model. In Proceedings of ICML.
• Amit Sheth 15 years of Semantic Search and Ontology-enabled Semantic Applications 88
References
89. • Sheth A, Avant D, Bertram C, inventors; Taalee, Inc., assignee. System and method for creating a
semantic web and its applications in browsing, searching, profiling, personalization and
advertising. United States patent US 6,311,194. 2001 Oct 30.
• Lu, C.-J. 2012. Accidental discovery of information on the user-defined social web: A mixed-
method study. Ph.D. thesis, University of Pittsburgh.
• Li, X. 2010. Understanding the semantic structure of noun phrase queries. In Proceedings of the
48th Annual Meeting of the Association for Computational Linguistics. Association for
Computational Linguistics, 1337-1345.
• Keselman, A., Smith, C. A., Divita, G., Kim, H., Browne, A. C., Leroy, G., and Zeng- Treitler, Q.
2008. Consumer health concepts that do not map to the umls: where do they fit? Journal of the
American Medical Informatics Association 15, 4, 496-505.
• Hu, J., Wang, G., Lochovsky, F., Sun, J.-t., and Chen, Z. 2009. Understanding user's query intent
with wikipedia. In Proceedings of the 18th international conference on World wide web. ACM,
• Hu, Y., Qian, Y., Li, H., Jiang, D., Pei, J., and Zheng, Q. 2012. Mining query subtopics from
search log data. In Proceedings of the 35th international ACM SIGIR conference on Research
and development in information retrieval. ACM, 305-314
• Fox, S. 2014. Pew internet & american life project report. 2013. Pew Internet: Health URL: http://
www. pewinternet. org/fact-sheets/health-fact-sheet/
• Broder, A. Z., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., and Zhang, T. 2007. Robust
classification of rare queries using web knowledge. In Proceedings of the 30th annual
international ACM SIGIR conference on Research and development in information retrieval. ACM
Broder, A. 2002. A taxonomy of web search. In ACM Sigir forum. Vol. 36. ACM, 3-10.
• Baeza-Yates, R., Calderon-Benavides, L., and Gonzalez-Caro, C. 2006. The intention behind
web queries. In String processing and information retrieval. Springer, 98-109. 89
References