Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science Methods and Research Questions

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 21 Anzeige

Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science Methods and Research Questions

Herunterladen, um offline zu lesen

Research colloquium at Tübingen University held on 24 September 2020, introducing methods and research facilitated by the Web.

Research colloquium at Tübingen University held on 24 September 2020, introducing methods and research facilitated by the Web.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science Methods and Research Questions (20)

Anzeige

Weitere von Stefan Dietze (20)

Anzeige

Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science Methods and Research Questions

  1. 1. 1Stefan Dietze Backup Human-in-the-Loop: the Web as Foundation for interdisciplinary Data Science Methods and Research Questions Stefan Dietze GESIS - Leibniz Institute for the Social Sciences, Heinrich-Heine-University Düsseldorf, L3S Research Center
  2. 2. 2Stefan Dietze Interdisciplinary research facilitated by the Web  Rapidly growing interdisciplinary research exploiting the Web for investigating online behavior, e.g. with respect to knowledge construction and exchange, network effects, or virality of disinformation (e.g. Vousoughi et al. 2018)  Focused on gaining insights (e.g. social sciences, psychology) by understanding Web data with the help of computational methods Understanding & interpreting user behaviour & interactions  Behaviour and interactions with online platforms (e.g. Web search engines and social media platforms) & online content (eg Tweets)  Signals: click-through data, queries, shares, likes, behavioral traces (mouse movements, navigation, eye tracking etc) Machine & representation learning, information retrieval, NLP and knowledge-based approaches for: Understanding & intepreting (user-generated) Web content  Content: web pages, social media posts, comments etc  Extraction, verification, disambiguation of topics, entities, stances, opinions, sentiments (semantics)  Understanding language complexity, structure or modality of online resources
  3. 3. 3Stefan Dietze Overview  Understanding competence, information needs, knowledge gain of users from behavioral traces  Scenarios: Web search, microtask crowdsourcing  Extraction & verification of factual knowledge & claims  Stance detection of websites  Understanding discourse/opinions/trends (Twitter) Part IIPart I Understanding & interpreting user behaviour & interactions  Behaviour and interactions with online platforms (e.g. Web search engines and social media platforms) & online content (eg Tweets)  Signals: click-through data, queries, shares, likes, behavioral traces (mouse movements, navigation, eye tracking etc) Understanding & intepreting (user-generated) Web content  Content: web pages, social media posts, comments etc  Extraction, verification, disambiguation of topics, entities, stances, opinions, sentiments (semantics)  Understanding language complexity, structure or modality of online resources
  4. 4. 4Stefan Dietze Extraction of "long-tail" factual knowledge on the web ? <"Tim Berners-Lee" s:founderOf "Solid">  How can entity-centric factual knowledge be extracted from websites?  Application of NLP/information extraction methods on 60 billion Web pages (Google index)?  Widespread adoption of embedded web markup (Microdata/RDFa, schema.org): about 40% of all Common Crawl web pages (3.2 billion Web pages) contain markup (about 44 billion "facts")  Challenges o Errors. Annotation errors and factual errors [Meusel et al, ESWC2015] o Ambiguity and co-references. e.g. 18,000 markup instances of "iPhone 6" in Common Crawl 2016 & ambiguous literals (e.g. "Apple") o Redundancies & conflicts. large proportion of equivalent or directly conflicting statements
  5. 5. 5Stefan Dietze KnowMore: data fusion on Web Markup  0. Noise: data cleansing (URIs, deduplication etc)  1.a) Scale: blocking with BM25 entity retrieval on Lucene index of markup data  1.b) Relevance: supervised resolution of coreferences  2.) Quality & Redundancy: Data Fusion with supervised classifier for all facts (SVM, knn, CNN, RF, LR, NB), uses various feature sets (authority, relevance etc) of source (e.g. PageRank), entity description or facts 1. Blocking & coreference resolution 2. Fusion / fact selection (supervised) Web page markup Web crawl (Common Crawl, 44 bn facts) Yu, R., [..], Dietze, S., KnowMore-Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2019 (SWJ2019) Tempelmeier, N., Demidova, S., Dietze, S., Inferring Missing Categorical Information in Noisy and Sparse Web Markup, The Web Conf. 2018 (WWW2018) New Query Entities BBC Audio, type:(Organization) Chapman & Hall, type:(Publisher) Put Out More Flags, type:(Book) Entity Description author Evelyn Waugh priorWork Put Out More Flags ISBN 978031874803074 copyrightHolder Evelyn Waugh releaseDate 1945 … … Query Entity Brideshead Revisited, type:(Book) Candidate Facts node1 publisher Chapman & Hall node1 releaseDate 1945 node1 publishDate 1961 node2 country UK node2 publisher Black Bay Books node3 country US node3 copyrightHolder Evelyn Waugh … …. …. About 5000 facts for "Brideshead Revisited (125.000 facts for "iPhone6") 20 correct & non-redundant facts for "Brideshead Rev.
  6. 6. 6Stefan Dietze KnowMore: data fusion on Web Markup  0. Noise: data cleansing (URIs, deduplication etc)  1.a) Scale: blocking with BM25 entity retrieval on Lucene index of markup data  1.b) Relevance: supervised resolution of coreferences  2.) Quality & Redundancy: Data Fusion with supervised classifier for all facts (SVM, knn, CNN, RF, LR, NB), uses various feature sets (authority, relevance etc) of source (e.g. PageRank), entity description or facts 1. Blocking & coreference resolution 2. Fusion / fact selection (supervised) Web page markup Web crawl (Common Crawl, 44 bn facts) Yu, R., [..], Dietze, S., KnowMore-Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2019 (SWJ2019) Tempelmeier, N., Demidova, S., Dietze, S., Inferring Missing Categorical Information in Noisy and Sparse Web Markup, The Web Conf. 2018 (WWW2018) New Query Entities BBC Audio, type:(Organization) Chapman & Hall, type:(Publisher) Put Out More Flags, type:(Book) Entity Description author Evelyn Waugh priorWork Put Out More Flags ISBN 978031874803074 copyrightHolder Evelyn Waugh releaseDate 1945 … … Query Entity Brideshead Revisited, type:(Book) Candidate Facts node1 publisher Chapman & Hall node1 releaseDate 1945 node1 publishDate 1961 node2 country UK node2 publisher Black Bay Books node3 country US node3 copyrightHolder Evelyn Waugh … …. …. About 5000 facts for "Brideshead Revisited (125.000 facts for "iPhone6") 20 correct & non-redundant facts for "Brideshead Rev. Data fusion performance  Experiments for books, films, products  Baselines: BM25, CBFS [ESWC2015], PreRecCorr [Pochampally et. al., ACM SIGMOD 2014], vary widely between types Enriching knowledge graphs / finding new facts?  On average 60% - 70% of all facts are new (compared to knowledge graphs like WikiData, Freebase, Wikipedia/DBpedia)  Experiments for learning categorical characteristics (e.g. film genres or product categories) [WWW2018].
  7. 7. 7Stefan Dietze Understanding discourse & opinions on Twitter http://dbpedia.org/resource/Tim_Berners-Lee wna:positive-emotion onyx:hasEmotionIntensity "0.75 onyx:hasEmotionIntensity "0.0  Heterogeneity: multimodal, multilingual, informal, "noisy" language  Context dependency: interpretation of short tweets requires consideration of context (e.g. time, linked content), "Dusseldorf" => city or football team  Representativity & bias: demographic distributions in Twitter archives not known  Dynamics & scale: e.g. 8000 tweets per second, plus interactions (retweets etc) & context (e.g. 25% of all tweets contain URLs)  Evolution & temporal aspects: Evolution of interactions over time important for most research questions http://dbpedia.org/resource/Solid wna:negative-emotion P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
  8. 8. 8Stefan Dietze TweetsKB: a knowledge base of Web mined societal discourse P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18. https://data.gesis.org/tweetskb/  Collection & archiving of 10 billion tweets over 7 years (permanent crawl of Twitter 1% API since 2013)  Information extraction using NLP methods to extract entities and sentiments (distributed batch processing with Hadoop Map/Reduce) o Entity linking with Wikipedia/DBpedia (Yahoo's FEL [Blanco et al. 2015]) ("president"/"potus"/"trump" => dbp:DonaldTrump), to disambiguate tweets and link to background knowledge (e.g. US politicians? Republicans?), high precision (.85), poor recall (. 39) o Sentiment analysis with SentiStrength [Thelwall et al., 2017], F1 approx. . 80 o Extraction of metadata and lifting into established formats and schemas (SIOC, schema.org), publication using W3C standards (RDF/SPARQL)
  9. 9. 10Stefan Dietze TweetsCOV19: a knowledge graph of societal discourse on COVID19 Dimitrov, D., Baran, E., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 -- A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020. https://data.gesis.org/tweetscov19/  COVID19 discourse as foundation for interdisciplinary research on solidarity behaviour & societal changes during the pandemic  8.1 million tweets since October 2019 (continuously updated), extracted using COVID-19 specific seed list & TweetsKB pipeline  Used as corpus for CIKM2020 AnalytiCup & by interdisciplinary partners, e.g. with the Federal Statistical Office, Media & Communication Studies @ Heinrich-Heine-University, University of Hildesheim, etc.
  10. 10. 11Stefan Dietze Understanding claims & stances on the Web
  11. 11. 12Stefan Dietze Stance, Trustworthiness of the claim? Stance, Trustworthiness of the claim? Understanding claims & stances on the Web
  12. 12. 14Stefan Dietze A hierarchical stance detection classifier Motivation  Problem: identifying stance of web documents (web pages, tweets) on a specific claim (class distribution highly unbalanced)  Applications: stance of documents (especially disagreement) important (a) as signal correctness of statement and (b) for the classification of sources (Twitter users, PLDs) Roy, A. Ekbal, S. Dietze, P. Fafalios, Exploiting stance hierarchies for cost- sensitive stance detection of Web documents, preprint/Arxiv. A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K. Todorov, ClaimsKG - A Live Knowledge Graph of fact-checked Claims, ISWC2019
  13. 13. 15Stefan Dietze Motivation  Problem: identifying stance of web documents (web pages, tweets) on a specific claim (class distribution highly unbalanced)  Applications: stance of documents (especially disagreement) important (a) as signal correctness of statement and (b) for the classification of sources (Twitter users, PLDs) Approach  Cascading binary classifiers to address problems at each step (e.g. cost of misclassification)  Features, e.g. text similarity (Word2Vec etc), sentiments, LIWC  Best models per step: 1) SVM with class-wise penalty, 2) CNN, 3) SVM with class-wise penalty  Experiments with Fake News Challenge Benchmark Dataset & baselines Results  Minor overall performance improvement  27% improvement for disagree class A hierarchical stance detection classifier Roy, A. Ekbal, S. Dietze, P. Fafalios, Exploiting stance hierarchies for cost- sensitive stance detection of Web documents, preprint/Arxiv. A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K. Todorov, ClaimsKG - A Live Knowledge Graph of fact-checked Claims, ISWC2019
  14. 14. 16Stefan Dietze  Extraction & verification of factual knowledge & claims  Stance detection of websites  Extraction of opinions/trends (Twitter) Overview Understanding & intepreting (user-generated) Web content  Content: web pages, social media posts, etc  Extraction, verification, disambiguation of topics, entities, stances, opinions, sentiments (semantics)  Understanding language complexity, structure or modality of online resources  Understanding competence, information needs, knowledge gain of users from behavioral traces  Scenarios: Web search, microtask crowdsourcing Part IIPart I Understanding & interpreting user behaviour & interactions  Behaviour and interactions with online platforms (e.g. Web search engines and social media platforms) & online content (eg Tweets)  Signals: click-through data, queries, shares, likes, behavioral traces (mouse movements, navigation, eye tracking etc)
  15. 15. 17Stefan Dietze Competence & knowledge acquisition of web users Prediction from in-session behavior?  Research questions: Is it possible to predict the competence and knowledge acquisition of users on the basis of user interactions such as browsing, scrolling, or behavioral traces (mouse movements, keystrokes, eye tracking)?  Approach: Studies and machine learning models in two scenarios: (a) Web Search and (b) Microtask Crowdsourcing like Amazon Mechanical Turk  Applications: e.g. for the classification of web users, improvement of search results or the adaptation in learning and assessment environments Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys, ACM CHI2015. Gadiraju, U., Demartini, G., Kawase, R., Dietze, S., Crowd Anatomy Beyond the Good and Bad: Behavioral Traces for Crowd Worker Modeling and Pre-selection, Computer Supported Cooperative Work 28(5): 815-841 (2019)
  16. 16. 18Stefan Dietze Acquisition of knowledge during web search? Challenges & results  Identifying coherent search missions?  Identification of "learning" during search: identification of "informational sessions" (as opposed to "transactional" or "navigational" search [Broder, 2002]) o Classification with approx. F1 score 75% based on user interactions  How competent is the user? - Predicting and understanding the competence / knowledge level of users based on "in-session" behaviour  How well does a user achieve his/her learning objective or information need? - Predicting the knowledge state/gain during a session o Correlation of user behaviour (queries, browsing, mouse movements etc) & knowledge state/gain [CHIIR18] o Prediction of knowledge state/gain using supervised ML methods [SIGIR18].
  17. 17. 19Stefan Dietze Knowledge level & growth vs user behaviour in web search Data & experimental setup  Crowdsourcing of behavioral data in search sessions  10 topics/information needs (e.g. "altitude sickness", "tornados") plus pre- and post-tests to determine knowledge state and knowledge gain (KS, KG)  Approx. 1000 crowd workers; 100 sessions per topic  Monitoring of user behavior along 76 features in 5 categories: session, query, SERP - search engine result page, browsing, mouse traces Results  70% of users show knowledge gain (KG)  Negative correlation between KG & topic popularity (avg. accuracy of workers in knowledge tests) (R= -.87)  Time spent actively on websites explains 7% of knowledge gain  Query complexity explains 25% of knowledge gain  Search behavior correlates more strongly with search topic than with KG/KS Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.
  18. 18. 20Stefan Dietze ML models to predict KG/KS during Web search  Categorisation of the sessions along knowledge state (KS) & knowledge gain (KG) in {low, moderate, high} with (low < (mean ± 0.5 SD) < high)  Supervised multiclass classification (Naive Bayes, Logistic Regression, SVM, Random Forest, Multilayer Perceptron)  KG prediction performance (after 10-fold cross-validation)  Feature impact (KG prediction) Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
  19. 19. 21Stefan Dietze ML models to predict KG/KS during the search  Categorisation of the sessions along knowledge state (KS) & knowledge gain (KG) in {low, moderate, high} with (low < (mean ± 0.5 SD) < high)  Supervised multiclass classification (Naive Bayes, Logistic Regression, SVM, Random Forest, Multilayer Perceptron)  KG predicition performance (after 10-fold cross-validation)  Feature impact (KG prediction) Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018. Ongoing work  Lab studies necessary for more reliable data (controlled environment, longer sessions) [completed]  Additional behavioral features (eye tracking) [CHIIR2020, CHI2020]  Ressource features (e.g. complexity, analytic/emotional language, multimodality etc) as additional signals [IR Journal, under review]  Improve ranking/retrieval in web search or in digital archives (SALIENT Project, Leibniz Cooperative Excellence; GESIS Data Search platforms)
  20. 20. 22Stefan Dietze Other features to predict competence? Expertise & the "Dunning-Kruger Effect  Incompetence in a particular task reduces the ability to recognise one's own incompetence in the task (David Dunning. 2011 The Dunning-Kruger Effect: On Being Ignorant of One's Own Ignorance. Advances in experimental social psychology 44 (2011), 247.) Research questions  Self-assessment as an additional feature to predict competence?  Application in microtask crowdsourcing for the classification of "workers" or in online learning for the classification of learners Some results  Self-assessment as a reliable feature for predicting competence/future performance;  More reliable than prior performance in the task alone  The tendency to overestimate one's own competence grows with increasing task difficulty Performance ("accuracy") of users classified as "competent" according to (1) prior performance and (2) performance plus self-assessment Gadiraju, U., Fetahu, B., Kawase, R., Siehndel, P., Dietze, S., Using Worker Self-Assessments for Competence-based Pre- Selection in Crowdsourcing Microtasks. In: ACM Transactions on Computer-Human Interaction (ACM TOCHI), Vol. 24, Issue 4, August 2017.
  21. 21. 23Stefan Dietze Knowledge Technologies for the Social Sciences (WTS) https://www.gesis.org/en/institute/departments/knowledge-technologies-for- the-social-sciences/ Data & Knowledge Engineering @ HHU https://www.cs.hhu.de/en/research-groups/data-knowledge-engineering.html @stefandietze http://stefandietze.net Acknowledgements • Erdal Baran (GESIS, Germany) • Katarina Boland (GESIS, Germany) • Stefan Conrad (HHU, Germany) • Gianluca Demartini (Brisbane Uni, Australia) • Elena Demidova (L3S, Germany) • Dimitar Dimitrov (GESIS, Germany) • Ujwal Gadiraju (Delft University, NL) • Asif Ekbal (IIT Patna, India) • Pavlos Fafalios (FORTH ICS, Greece) • Peter Holtz (IWM, Tübingen) • Ricardo Kawase (Mobile.de, Germany) • Vasileios Iosifidis (L3S, Germany) • Eirini Ntoutsi (LUH, Germany) • Vasilis Iosifidis (L3S, Germany) • Markus Rokicki (L3S, Germany) • Arjun Roy (IIT Patna, India) • Patrick Siehndel (L3S, Germany) • Nicolas Tempelmeier (L3S, Germany) • Konstantin Todorov (LIRMM, France) • Ran Yu (GESIS, Germany) • Benjamin Zapilko (GESIS, Germany) • Matthäus Zloch (GESIS, Germany) • Xiaofei Zhu (Chongqing University, China)

×