Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

From Web Data to Knowledge: on the Complementarity of Human and Artificial Intelligence

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 33 Anzeige

From Web Data to Knowledge: on the Complementarity of Human and Artificial Intelligence

Herunterladen, um offline zu lesen

Inaugural lecture at Heinrich-Heine-University Düsseldorf on 28 May 2019.

Abstract:
When searching the Web for information, human knowledge and artificial intelligence are in constant interplay. On the one hand, human online interactions such as click streams, crowd-sourced knowledge graphs, semi-structured web markup or distributional semantic models built from billions of Web documents are informing machine learning and information retrieval models, for instance, as part of the Google search engine. On the other hand, the very same search engines help users in finding relevant documents, facts, or data for particular information needs, thereby helping users to gain knowledge. This talk will give an overview of recent work in both of the aforementioned areas. This includes 1) research on mining structured knowledge graphs of factual knowledge, claims and opinions from heterogeneous Web documents as well as 2) recent work in the field of interactive information retrieval, where supervised models are trained to predict the knowledge (gain) of users during Web search sessions in order to personalise rankings. Both streams of research are converging as part of online platforms and applications to facilitate access to data(sets), information and knowledge.

Inaugural lecture at Heinrich-Heine-University Düsseldorf on 28 May 2019.

Abstract:
When searching the Web for information, human knowledge and artificial intelligence are in constant interplay. On the one hand, human online interactions such as click streams, crowd-sourced knowledge graphs, semi-structured web markup or distributional semantic models built from billions of Web documents are informing machine learning and information retrieval models, for instance, as part of the Google search engine. On the other hand, the very same search engines help users in finding relevant documents, facts, or data for particular information needs, thereby helping users to gain knowledge. This talk will give an overview of recent work in both of the aforementioned areas. This includes 1) research on mining structured knowledge graphs of factual knowledge, claims and opinions from heterogeneous Web documents as well as 2) recent work in the field of interactive information retrieval, where supervised models are trained to predict the knowledge (gain) of users during Web search sessions in order to personalise rankings. Both streams of research are converging as part of online platforms and applications to facilitate access to data(sets), information and knowledge.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie From Web Data to Knowledge: on the Complementarity of Human and Artificial Intelligence (20)

Anzeige

Weitere von Stefan Dietze (20)

Aktuellste (20)

Anzeige

From Web Data to Knowledge: on the Complementarity of Human and Artificial Intelligence

  1. 1. Backup 29/05/19 1Stefan Dietze From (Web) Data to Knowledge: on the Complementarity of Human and Artificial Intelligence Prof. Dr. Stefan Dietze Inaugural Lecture, 28 May 2019 Heinrich-Heine-Universität Düsseldorf
  2. 2. Finding “things” on the Web • Resources • Facts • Claims • Opinions 29/05/19 2Stefan Dietze
  3. 3. Finding “things” on the Web • Resources • Facts • Claims • Opinions 29/05/19 3Stefan Dietze
  4. 4. Finding “things” on the Web • Resources • Facts • Claims • Opinions 29/05/19 4Stefan Dietze
  5. 5. Finding “things” on the Web • Resources • Facts • Claims • Opinions We‘ll try to use AI to „answer“ that question at the end of the talk. 29/05/19 5Stefan Dietze
  6. 6. Finding social sciences research data on the Web 29/05/19 6Stefan Dietze
  7. 7. Human/Crowd Intelligence Artificial Intelligence „Supervising AI“ with user- generated data & knowledge („making machines smarter“) Artificial vs human intelligence: a simplistic Web search perspective  Information retrieval (crawling, indexing, ranking etc)  Natural language processing  (Hyperlink) graph analysis (e.g. PageRank et al.)  Statistics and (deep) learning from user interactions o Query interpretation & intent prediction o Classification of users, documents, queries o Reranking & personalisation o …. Facilitating search, retrieval & knowledge gain of users „making humans smarter“ 29/05/19 7Stefan Dietze
  8. 8. Part I Symbolic & subsymbolic AI on the Web – a brief introduction Part II Extracting machine-interpretable knowledge („making machines smarter“) Part III Facilitating search, retrieval & knowledge gain of users („making humans smarter“) Overview 29/05/19 8Stefan Dietze
  9. 9. Symbols, data & knowledge on the Web dbr:Tim_Berners-Lee dbo:Person „Tim Berners-Lee“@en 1955-06-08^^xsd:date dbr:MIT dbr:Washington_DC dbr:WWW_Foundation dbo:Organisation dbo:keyPersonOf rdf:type rdfs:subClassOf foaf:name dbo:birthDate dbo:workplaces yago:LegalActor dbo:Scientist Unstructured data e.g. web pages, user interactions/behavior, clickstreams, sensor data Machine-interpretable knowledge e.g. Knowledge graphs, Web markup dbr:Jakarta dbo:location rdf:type DBpedia (eng.) 200 million facts Google KG: 18 billion facts 29/05/19 9Stefan Dietze
  10. 10. Symbolic AI • AI = manipulation and interpretation of symbols (eventually: “knowledge”) • Top-down: knowledge representation, logics, inference, knowledge graphs • “strong AI hypothesis” or “Physical Symbol System Hypothesis” (Newell & Simon, 1976), “GOFAI” Subsymbolic AI • AI = emulating/engineering human intelligence, e.g. through cognitive computing (“perceptron”, Frank Rosenblatt 1957) • Bottom up: neural networks, machine/deep learning, distributional semantics • Also called: “weak AI hypothesis” (Russel & Norwig, 1995) Symbolic vs subsymbolic AI Knowledge Information Data Symbols Horse ⊓ ¬RockingHorse ⊑ Animal ⊓ ∀(=4)hasLegs „Intelligence is ten million rules“ (Douglas Lenat, founder of Cyc) 29/05/19 10Stefan Dietze
  11. 11. Subsymbolic AI & deep learning for language understanding Percentage of deep learning papers in major NLP conferences (Source: Young et al., Recent Trends in Deep Learning Based Natural Language Processing) • Distributional semantics & embeddings: predicting low- dimensional vector representations of words & text, e.g. Word2Vec [Mikolov et al., 2013] • Efficient RNN/CNN architectures in encoder/decoder settings (e.g. for machine translation) [Vaswani et al., 2017] • Pretraining language models for task-specific transfer learning, e.g., BERT - Bidirectional Encoder Representations from Transformers [Devlin et al., 2018] T. Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality, NIPS (2013) J. Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) A. Vaswani et al. Attention is all you need, NIPS (2017) 29/05/19 11Stefan Dietze
  12. 12. Source: https://techcrunch.com/2016/03/24/microsoft-silences-its-new-a-i-bot-tay-after-twitter-users-teach-it-racism/ • Biases in human interactions can be learned and elevated by ML models • Meaning / semantics are crucial to facilitate interpretation by/of machines & ML models [N-word] Learning without semantics 29/05/19 12Stefan Dietze
  13. 13. Semantics and knowledge: a brief (and incomplete) history • Deductive reasoning, syllogism & categorisation (Aristotele, 384 BC – 322 BC) • Formal logic & calculus rationicator (reasoning, symbol manipulation) (G.W. Leibniz 1646 - 1716) • „Begriffschrift“, technically: predicate logic (Gottlob Frege, 1848 – 1925) • Frames for representing stereotyped situations (Marvin Minsky, 1974) • Rules & expert systems • Ontologies (Leibniz, Kant, Gruber 1994) • Description Logics (Baader & Hollunder, 1991 et al.) • Semantic Web (Berners-Lee, Hendler, Lassila, 2001) & Linked Data & Knowledge Graphs 29/05/19 13Stefan Dietze
  14. 14. Symbolic & subsymbolic AI: e.g. linking Web documents & KGs  Robust methods for named entity disambiguation (NED), e.g. Ambiverse [Hoffart et al., 2011], Babelfy [Ferragina et al., 2010], TagMe [Moro et al., 2014]  Time- and corpus-specific entity relatedness; prior probabilities and meaning of entities change over time, e.g. “Deutschland” during World Cup [DL4KGS 2018]  Meta-EL: supervised ensemble learner exploiting results of different NED systems [SAC19, CIKM19] o Considers features of terms, mentions/occurrences, dynamics/temporal drift etc o Outperforms individual NED systems across diverse documents/corpora  Problem: “Completeness” & coverage of KGs? Fafalios, P., Joao, R.S., Dietze, S., Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty, ACM SAC19 Mohapatra, N., Iosifidis, V., Ekbal, A., Dietze, S., Fafalios, P., Time- Aware and Corpus-Specific Entity Relatedness, DL4KGS at ESWC2018. dbr:Tim_Berners-Lee 29/05/19 14
  15. 15. Overview Part I Symbolic & subsymbolic AI on the Web – a brief introduction Part II Extracting machine-interpretable knowledge („making machines smarter“) Part III Facilitating search, retrieval & knowledge gain of users („making humans smarter“) 29/05/19 15Stefan Dietze
  16. 16. Knowledge about: facts, claims, stances & opinions on the Web Facts & claims Stances, opinions, interactions <„Tim Berners-Lee“ s:founderOf „Solid“> 29/05/19 16Stefan Dietze
  17. 17. Mining (long-tail) facts from the Web? <„Tim Berners-Lee“ s:founderOf „Solid“>  Obtaining verified facts (or knowledge graph) for a given entity?  Application of NLP (e.g. NER, relation extraction) at Web-scale (Google index: 50 trn pages)?  Exploiting entity-centric embedded Web page markup (schema.org), prevalent in roughly 40% off Web pages (44 Bn „facts“ in Common Crawl 2016/3.2 Bn Web pages)  Challenges o Errors. Factual errors, annotation errors (see also [Meusel et al, ESWC2015]) o Ambiguity & coreferences. e.g. 18.000 entity descriptions of “iPhone 6” in Common Crawl 2016 & ambiguous literals (e.g. „Apple“>) o Redundancies & conflicts vast amounts of equivalent or conflicting statements 29/05/19 17Stefan Dietze
  18. 18.  0. Noise: data cleansing (node URIs, deduplication etc)  1.a) Scale: Blocking (BM25 entity retrieval) on markup index  1.b) Relevance: supervised coreference resolution  2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level KnowMore: data fusion on markup 1. Blocking & coreference resolution 2. Fusion / Fact selection New Query Entities BBC Audio, type:(Organization) Chapman & Hall, type:(Publisher) Put Out More Flags, type:(Book) (supervised) Entity Description author Evelyn Waugh priorWork Put Out More Flags ISBN 978031874803074 copyrightHolder Evelyn Waugh releaseDate 1945 … … Query Entity Brideshead Revisited, type:(Book) Candidate Facts node1 publisher Chapman & Hall node1 releaseDate 1945 node1 publishDate 1961 node2 country UK node2 publisher Black Bay Books node3 country US node3 copyrightHolder Evelyn Waugh … …. …. Web page markup Web crawl (Common Crawl, 44 bn facts) approx. 5000 facts for „Brideshead Revisited“ (compare: 125.000 facts for „iPhone6“) Yu, R., [..], Dietze, S., KnowMore-Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2019 (SWJ2019) Tempelmeier, N., Demidova, S., Dietze, S., Inferring Missing Categorical Information in Noisy and Sparse Web Markup, The Web Conf. 2018 (WWW2018) 20 correct/non-redundant facts for „Brideshead Rev.“ 18Stefan Dietze Fusion performance  Baselines: BM25, CBFS [ESWC2015], PreRecCorr [Pochampally et. al., ACM SIGMOD 2014], strong variance across types Knowledge Graph Augmentation  Experiments on books, movies, products  New facts (wrt DBpedia, Wikidata, Freebase):  On average 60% - 70% of all facts for books & movies new (across KBs)  100% new facts for long-tail entities (e.g. products)  Additional experiments on learning new categorical features (e.g. product categories or movie genres) [WWW2018]
  19. 19. Beyond facts: claims, opinions and misinformation on the Web  Investigations into misinformation and opinion forming received massive attention across a wide range of disciplines and industries (e.g. [Vousoughi et al. 2018])  Insights, mostly (computational) social sciences, e.g. o Spreading of claims and misinformation o Effect of biased and fake news on public opinions o Reinforcement of biases and echo chambers  Methods, mostly in computer science, e.g. for o Claim/fact detection and verification („fake news detection“), e.g. CLEF 2018 Fact Checking Lab (http://alt.qcri.org/clef2018-factcheck/) o Stance detection, e.g. Fake News Challenge (FNC) http://www.fakenewschallenge.org/  Some recent work o Large-scale public research corpora for replicating/improving methods/insights o TweetsKB: 9 Bn annotated tweets o ClaimsKG: 30 K annotated claims & truth ratings o ML models for stance detection of Web documents (towards given claims) 19Stefan Dietze
  20. 20. Stance detection of Web documents Motivation  Problem: detecting stance of documents (Web pages) towards a given claim (unbalanced class distribution)  Motivation: stance of documents (in particular disagreement) useful (a) as signal for fake news detection and (b) Website classification Approach  Cascading binary classifiers: addressing individual issues (e.g. misclassification costs) per step  Features, e.g. textual similarity (Word2Vec etc), sentiments, LIWC, etc.  Best-performing models: 1) SVM with class-wise penalty, 2) CNN, 3) SVM with class-wise penalty  Experiments on FNC-1 dataset (and FNC baselines) Results  Minor overall performance improvement  Improvement on disagree class by 27% (but still far from robust) A. Roy, A. Ekbal, S. Dietze, P. Fafalios, Step-by-Step: A three- stage Pipeline for Stance Classification of Documents towards Claims, CIKM19 under review. 20Stefan Dietze
  21. 21. http://dbpedia.org/resource/Tim_Berners-Lee wna:positive-emotion onyx:hasEmotionIntensity "0.75" onyx:hasEmotionIntensity "0.0" Mining opinions & interactions (the case of Twitter)  Heterogenity: multimodal, multilingual, informal, “noisy” language  Context dependence: interpretation of tweets/posts (entities, sentiments) requires consideration of context (e.g. time, linked content), “Dusseldorf” => City or Football team  Dynamics & scale: e.g. 6000 tweets per second, plus interactions (retweets etc) and context (e.g. 25% of tweets contain URLs)  Evolution and temporal aspects: evolution of interactions over time crucial for many social sciences questions  Representativity and bias: demographic distributions not known a priori in archived data collections http://dbpedia.org/resource/Solid wna:negative-emotion P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
  22. 22. P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18. Mining knowledge about opinions & interactions: TweetsKB http://l3s.de/tweetsKB  Harvesting & archiving of 9 Bn tweets over 5 years (permanent collection from Twitter 1% sample since 2013)  Information extraction pipeline (distributed via Hadoop Map/Reduce) o Entity linking with knowledge graph/DBpedia (Yahoo‘s FEL [Blanco et al. 2015]) (“president”/“potus”/”trump” => dbp:DonaldTrump), to disambiguate text and use background knowledge (eg US politicians? Republicans?), high precision (.85), low recall (.39) o Sentiment analysis/annotation using SentiStrength [Thelwall et al., 2012], F1 approx. .80 o Extraction of metadata and lifting into established schemas (SIOC, schema.org), publication using W3C standards (RDF/SPARQL) Use cases  Aggregating sentiments towards topics/entities, e.g. about CDU vs SPD politicians in particular time period  Temporal analytics: evolution of popularity of entities/topics over time (e.g. for detecting events or trends, such as rise of populist parties)  Twitter archives as general corpus for understanding temporal entity relatedness (e.g. “austerity” & “Greece” 2010-2015) Limitations  Bias & representativity: demographic distributions of users (not known a priori and not representative)  Cf. use case at the end of the talk -0.40000 -0.30000 -0.20000 -0.10000 0.00000 0.10000 0.20000 0.30000 0.40000 Cologne Düsseldorf
  23. 23. Overview Part I Symbolic & subsymbolic AI on the Web – a brief introduction Part II Extracting machine-interpretable knowledge („making machines smarter“) Part III Facilitating search, retrieval & knowledge gain of users („making humans smarter“) 23Stefan Dietze
  24. 24. Knowledge (gain) while searching the Web (“Search As Learning”)? Challenges & results  Detecting coherent search missions?  Detecting learning throughout search? detecting “informational” search missions (as opposed to “transactional” or “navigational” missions [Broder, 2002]) o Search mission classification with average F1 score 75%  How competent is the user? – Predict/understand knowledge state of users based on in-session behavior/interactions  How well does a user achieve his/her learning goal/information need? - Predict knowledge gain throughout search missions o Correlation of user behavior (queries, browsing, mouse traces, etc) & user knowledge gain/state in search [CHIIR18] o Prediction of knowledge gain/state through supervised models [SIGIR18] 24Stefan Dietze
  25. 25. Understanding knowledge gain/state of user during search? Data collection  Crowdsourced collection of search session data  10 search topics (e.g. “Altitude sickness”, “Tornados”), incl. pre- and post-tests  Approx. 1000 distinct crowd workers & 100 sessions per topic  Tracking of user behavior through 76 features in 5 categories (session, query, SERP – search engine result page, browsing, mouse traces) Some results  70% of users exhibited a knowledge gain (KG)  Negative relationship between KG of users and topic popularity (avg. accuracy of workers in knowledge tests) (R= -.87)  Amount of time users actively spent on web pages describes 7% of the variance in their KG  Query complexity explains 25% of the variance in the KG of users  Topic-dependent behavior: search behavior correlates stronger with search topic than with KG/KS Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018. 25Stefan Dietze
  26. 26. 26Stefan Dietze Predicting knowledge gain/state of user during search?  Stratification into classes: user knowledge state (KS) and knowledge gain (KG) into {low, moderate, high} using (low < (mean ± 0.5 SD) < high)  Supervised multiclass classification (Naive Bayes, Logistic regression, SVM, random forest, multilayer perceptron)  KG prediction performance results (after 10-fold cross-validation)  Feature importance (KG prediction) Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
  27. 27. Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018. Predicting knowledge gain/state of user during search? 29/05/19 27Stefan Dietze  Stratification into classes: user knowledge state (KS) and knowledge gain (KG) into {low, moderate, high} using (low < (mean ± 0.5 SD) < high)  Supervised multiclass classification (Naive Bayes, Logistic regression, SVM, random forest, multilayer perceptron)  KG prediction performance results (after 10-fold cross-validation)  Feature importance (KG prediction) Shortcomings & future work  Lab studies to obtain more reliable data (controlled environment, longer sessions) & additional features (eye- tracking)  Resource features (complexity, analytic/emotional language, multimodality etc) as additional signals [CIKM2019, under review]  Improving ranking/retrieval in Web search or other archives (SALIENT project, Leibniz Cooperative Excellence)
  28. 28. Applications: social sciences research data on the Web 28Stefan Dietze Improving findability of (social science) research data Mining novel (social science) research data from the Web http://l3s.de/tweetsKB https://data.gesis.org/claimskg
  29. 29. Finally: can we use AI & the Web to answer THE question? 29Stefan Dietze
  30. 30. 30Stefan Dietze P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18. http://dbpedia.org/resource/Tim_Berners-Lee wna:positive-emotion onyx:hasEmotionIntensity "0.75" onyx:hasEmotionIntensity "0.0" Recap: “Web-mined opinions” in Tweets KB http://l3s.de/tweetsKB http://dbpedia.org/resource/Solid wna:negative-emotion Total # tweets mentioning (K, D) in 1.5 bn tweets: • # dbp:Cologne: 89.564 • # dbp:Dusseldorf: 4723 • Opinions in terms of expressed sentiments? • „Happiness (X) = mean of sentiment score delta (positive - negative) of all Tweets mentioning X“
  31. 31. -0.40000 -0.30000 -0.20000 -0.10000 0.00000 0.10000 0.20000 0.30000 0.40000 Cologne Düsseldorf Mean sentiment scores (2013-2017): • Happiness(Cologne) = 0.09281 • Happiness(Dusseldorf) = 0.04056 • Positive (Cologne) = 0.17297 • Positive (Dusseldorf) = 0.1245 • Negative (Cologne) = 0.07948 • Negative (Dusseldorf) = 0.09030 Key Findings • Cologne happier (no significance testing yet) • Cologne & Dusseldorf happy overall (positive sentiments) Limitations • Bias: Twitter users not representative • Bias: Cologne cathedral=> distribution of tourists & residents among Twitter users likely different for both cities January 2016, Cologne NYE 2015/2016 aftermath Cologne vs Dusseldorf: a pseudoscientific “answer” using TweetsKB March 2017, Axe attack in D? Happiness(dbp:Cologne) Happiness(dbp:Dusseldorf) 31Stefan Dietze Source: https://theculturetrip.com/europe/germany/articles/8-fascinating-things-didnt-know-colognes-cathedral/© freedom100m
  32. 32. Acknowledgements Co-authors • Katarina Boland (GESIS, Germany) • Elena Demidova (L3S, Germany) • Asif Ekbal (IIT Patna, India) • Pavlos Fafalios (L3S, Germany) • Ujwal Gadiraju (L3S, Germany) • Peter Holtz (IWM, Germany) • Eirini Ntoutsi (LUH, Germany) • Vasilis Iosifidis (L3S, Germany) • Markus Rokicki (L3S, Germany) • Arjun Roy (IIT Patna, India) • Renato Stoffalette Joao (L3S, Germany) • Davide Taibi (CNR, ITD, Italy) • Nicolas Tempelmeier (L3S, Germany) • Konstantin Todorov (LIRMM, France) • Ran Yu (GESIS, Germany) • Benjamin Zapilko (GESIS, Germany) 32Stefan Dietze
  33. 33. From (Web) Data to Knowledge: on the Complementarity of Human and Artificial Intelligence Prof. Dr. Stefan Dietze Heinrich-Heine-Universität Düsseldorf GESIS Leibniz Institute for the Social Sciences

×