Amit Sheth, "Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data,"
WSU & AFRL Window-on-Science Seminar on Data Mining, August 05, 2009.
http://wiki.knoesis.org/index.php/Seminar_on_Data_Mining#Semantics_empowered_Understanding.2C_Analysis_and_Mining_of_Nontraditional_and_Unstructured_Data
2. Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data WSU & AFRL Window-on-Science Seminar on Data Mining Amit P. Sheth, LexisNexis Ohio Eminent Scholar Director, Kno.e.sis center, Wright State University knoesis.org Thanks: K. Gomadam, M. Nagarajan, C. Thomas, C. Henson, C. Ramakrishnan, P. Jain and Kno.e.sis Researchers
3. Data & Knowledge Ecosystem 3 Situational Awareness Decision Support Insight Knowledge Discovery Analysis (eg Patterns) Understanding & Perception Data Mining Integration Search Browsing Multimedia Data Structured, Semistructured Unstructured Data Textual Data: Scientific Literature, Web Pages, News, Blogs, Reports, Wiki, Forums, Comments, Tweets Experimental Data Observational Data Transactional Data
4. Some examples of R&D we have done Semantic Search & Ranking of Stories and Reports – connecting the dots applications (insider threat, financial risk analysis) Mining of biomedical (scientific) literature (extraction of entities and relationships) – discovering hidden public knowledge Semantic Integration, Analysis and Decision Support over Sensor Data Extracting taxonomy/domain model from Wikipedia Discovering Hidden Relationships (insights) in Community Created Content (Wikipedia) 4
8. Search Integration Analysis Discovery Question Answering Situational Awareness Domain Models Patterns / Inference / Reasoning RDB Relationship Web Meta data / Semantic Annotations Metadata Extraction Multimedia Content and Web data Text Sensor Data Structured and Semi-structured data
11. 9 What Knowledge Discovery is NOT Search Keyword-in-document-out Keywords are fully specified features of expected outcome Searching for prospective mining sites Mining Know where to look Underspecified characteristics of what is sought are available Patterns CarticRamakrishnan
12. 10 What is knowledge discovery? “knowledge discovery is more like sifting through a warehouse filled with small gears, levers, etc., none of which is particularly valuable by itself. After appropriate assembly, however, a Rolex watch emerges from the disparate parts.” – James Caruther “discovery is often described as more opportunistic search in a less well-defined space, leading to a psychological element of surprise” – James Buchanan Opportunistic search over an ill-defined space leading to surprising but useful emergent knowledge CarticRamakrishnan
13. Element of surprise – Swanson’s discoveries Stress ? Swanson’s Discoveries Magnesium Migraine Calcium Channel Blockers Spreading Cortical Depression 11 possible associations found PubMed Associations Discovered based on keyword searches followed by manually analysis of text to establish possible relevant relationships 11
14. Knowledge Discovery over text Text Assigning interpretation to text Semantic metadata in the form of semi-structured data Extraction of Semantics from text Semantic Metadata Guided Knowledge Explorations Semantic Metadata Guided Knowledge Discovery Triple-based Semantic Search Semantic browser Subgraph discovery 12 CarticRamakrishnan
15. Information Extraction via Ontology assisted text mining – Relationship extraction 4733 documents 9284 documents 5 documents UMLS Semantic Network complicates Biologically active substance affects causes causes Disease or Syndrome Lipid affects instance_of instance_of ??????? Fish Oils Raynaud’s Disease MeSH PubMed 13 CarticRamakrishnan
16. Background knowledge and Data used UMLS – A high level schema of the biomedical domain 136 classes and 49 relationships Synonyms of all relationship – using variant lookup (tools from NLM) 49 relationship + their synonyms = ~350 verbs MeSH 22,000+ topics organized as a forest of 16 trees Used to query PubMed PubMed Over 16 million abstract Abstracts annotated with one or more MeSH terms 14
20. Entities can also occur as composites of 2 or more other entities
21. “adenomatous hyperplasia” and “endometrium” occur as “adenomatous hyperplasia of the endometrium”(TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) ) 15 CarticRamakrishnan
22. Method – Identify entities and relationships in Parse Tree Modifiers TOP Modified entities Composite Entities S VP UMLS ID T147 NP VBZ induces NP PP NP NP NN estrogen IN by JJ excessive PP DT the ADJP NN stimulation MeSHID D004967 IN of JJ adenomatous NN hyperplasia NP JJ endogenous JJ exogenous CC or MeSHID D006965 NN endometrium DT the MeSHID D004717 16
31. magnesium can suppressplatelet aggregabilityData sets generated using these entities (marked red above) as boolean keyword queries against pubmed Bidirectional breadth-first search used to find paths in resulting RDF
32. Paths between Migraine and Magnesium Paths are considered interesting if they have one or more named relationship Other thanhasPart or hasModifiers in them 19 CarticRamakrishnan
33.
34. Our definition of compound and modified entities are critical for identifying both implicit and explicit relationships
36. Unsupervised Joint Extraction of Compound Entities and Relationship Cartic Ramakrishnan, Pablo N. Mendes, Shaojun Wang and Amit P. Sheth "Unsupervised Discovery of Compound Entities for Relationship Extraction" EKAW 2008 - 16th International Conference on Knowledge Engineering and Knowledge Management Knowledge Patterns
56. A powerful new era in Information dissemination had taken firm ground
57. Making it possible for us to create a global network of citizens Citizen Sensors – Citizens observing, processing, transmitting, reporting
58. Geocoder (Reverse Geo-coding) Address to location database 18 Hormusji Street, Colaba VasantVihar Image Metadata latitude: 18° 54′ 59.46″ N, longitude: 72° 49′ 39.65″ E Structured Meta Extraction Nariman House Income Tax Office Identify and extract information from tweets Spatio-Temporal Analysis
59. Research Challenge #1 Spatio Temporal and Thematic analysis What else happened “near” this event location? What events occurred “before” and “after” this event? Any message about “causes” for this event?
62. Giving us Tweets originated from an address near 18.916517°N, 72.827682°E during time interval27th Nov 2008 between 11PM to 12PM?
63. Research Challenge #2:Understanding and Analyzing Casual Text Casual text Microblogs are often written in SMS style language Slangs, abbreviations
64. Understanding Casual Text Not the same as news articles or scientific literature Grammatical errors Implications on NL parser results Inconsistent writing style Implications on learning algorithms that generalize from corpus
65. Nature of Microblogs Additional constraint of limited context Max. of x chars in a microblog Context often provided by the discourse Entity identification and disambiguation Pre-requisite to other sophisticated information analytics
66. NL understanding is hard to begin with.. Not so hard “commando raid appears to be nigh at Oberoinow” Oberoi = Oberoi Hotel, Nigh = high Challenging new wing, live fire @ taj 2nd floor on iDesi TV stream Fire on the second floor of the Taj hotel, not on iDesi TV
67. Research Opportunities NER, disambiguation in casual, informal text is a budding area of research Another important area of focus: Combining information of varied quality from a corpus (statistical NLP), domain knowledge (tags, folksonomies, taxonomies, ontologies), social context (explicit and implicit communities)
68. Social Context surrounding content Social context in which a message appears is also an added valuable resource Post 1: “Hareemane Househostages said by eyewitnesses to be Jews. 7 Gunshots heard by reporters at Taj” Follow up post that is Nariman House, not (Hareemane)
69. Understanding content … informal text I say: “Your music is wicked” What I really mean: “Your music is good” 54
70. Urban Dictionary Sentiment expression: Rocks Transliterates to: cool, good Structured text (biomedical literature) Semantic Metadata: Smile is a Track Lil transliterates to Lilly Allen Lilly Allen is an Artist MusicBrainz Taxonomy Informal Text (Social Network chatter) Artist: Lilly Allen Track: Smile Your smile rocks Lil Multimedia Content and Web data Web Services
71. Example: Pulse of a Community Imagine millions of such informal opinions Individual expressions to mass opinions “Popular artists” lists from MySpace comments Lilly Allen Lady Sovereign Amy Winehouse Gorillaz Coldplay Placebo Sting Kean Joss Stone
72. What Drives the Spatio-Temporal-Thematic Analysis and Casual Text Understanding Semantics with the help of Domain Models Domain Models Domain Models(ontologies, folksonomies)
73. Domain Knowledge: A key driver Places that are nearby ‘Nariman house’ Spatial query Messages originated around this place Temporal analysis Messages about related events / places Thematic analysis
74. Research Challenge #3But Where does the Domain Knowledge come from? Expert and committee based ontology creation … works in some domains (e.g., biomedicine, health care,…) Community driven knowledge extraction How to create models that are “socially scalable”? How to organically grow and maintain this model?
77. Games with a purpose Get humans to give their solitaire time Solve real hard computational problems Image tagging, Identifying part of an image Tag a tune, Squigl, Verbosity, and Matchin Pioneered by Luis Von Ahn
83. Semantic Sensor ML – Adding Ontological Metadata Domain Ontology Person Company Spatial Ontology Coordinates Coordinate System Temporal Ontology Time Units Timezone 67 Mike Botts, "SensorML and Sensor Web Enablement," Earth System Science Center, UAB Huntsville
84. 68 Semantic Query Semantic Temporal Query Model-references from SML to OWL-Time ontology concepts provides the ability to perform semantic temporal queries Supported semantic query operators include: contains: user-specified interval falls wholly within a sensor reading interval (also called inside) within: sensor reading interval falls wholly within the user-specified interval (inverse of contains or inside) overlaps: user-specified interval overlaps the sensor reading interval Example SPARQL query defining the temporal operator ‘within’
92. Extracting Social Signals what are the important topics of discussions and concerns in different parts of the world on a particular day how different cultures or countries are reacting to the same event or situation (eg Mumbai Attack) how a situation such as financial crisis is evolving over a period of time in terms of key topics of discussion and issues of concern (eg subprime mortgages and foreclosures, followed by troubled banks and credit freeze, followed by massive government intervention and borrowing, and so on). Twitris Demo 76
93. A few more things Use of background knowledge Event extraction from text time and location extraction Such information may not be present Someone from Washington DC can tweet about Mumbai Scalable semantic analytics Subgraph and pattern discovery Meaningful subgraphs like relevant and interesting paths Ranking paths
94. The Sum of the Parts Spatio-Temporal analysis Find out where and when + Thematic What and how + Semantic Extraction from text, multimedia and sensor data - tags, time, location, concepts, events + Semantic models & background knowledge Making better sense of STT Integration + Semantic Sensor Web The platform = Situational Awareness
95. KNO.E.SIS as a case study of world class research based higher education environment http://knoesis.org 79
96.
97. Exceptional students Six of the senior PhD students: 84 papers, 43 program committees, contributed to winning NIH and NSF grants. Successfully competed with two Stanford PhDs, 1000+ citations in 2 years of his graduation. “BTW, Meena is an absolute find. If all of your other students are as talented, you are very lucky. … I’d definitely like to work with more interns of her caliber, ... ”[Dr. Kevin Haas, Director of Search at Yahoo!] “It has been a few years since I visited Dayton (Wright AFB). However, it is clear that Wright State has transformed itself. Congratulations on your success with the KnoesisCenter.” [Dr. AlpersCaglayan – looking to hire Kno.e.sis grads]
98. Funding, Collaboration, etc UGA, Stanford, CCHMC, SAIC, HP, IBM, Yahoo! NIH, NSF, AFRL-HE, AFRL-Sensor, HP, IBM, Microsoft, Google 70% Federal, 19% State, 11% Industry Students intern at the bestIndustry labs & national labs Graduates very successful 83
99. Interested in more background? Semantics-Empowered Social Computing Semantic Sensor Web Traveling the Semantic Web through Space, Theme and Time Relationship Web: Blazing Semantic Trails between Web Resources Text Mining, Workflow Management, Semantic Web Services, Cloud Computing with application to healthcare, biomedicine, defense/intelligence, energy Contact/more details: amit @ knoesis.org Special thanks: Karthik Gomadam, MeenaNagarajan, Christopher Thomas Partial Funding: NSF (Semantic Discovery: IIS: 071441, Spatio Temporal Thematic: IIS-0842129), AFRL and DAGSI (Semantic Sensor Web), Microsoft Research and IBM Research (Analysis of Social Media Content),and HP Research (Knowledge Extraction from Community-Generated Content).
Hinweis der Redaktion
Microblogs are one of the most powerful ways of talking of CSD
Implicit social context created by people responding to other messages. In this example we are showing how the system can identify that its is Nariman and not Hareemane
In the scenario, what techniques and technlologies are being brought together? Semantic + Social Computing + Mobile Web
Users are shown two images along with labels. Labels gotten from GI or similar data source. Users add relationships. When 2 users agree, the labels are tagged with this relationship. Multiple relationships, using ML techniques, the system will learn .