Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.
In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.
Content + Signals: The value of the entire data estate for machine learning
1. Content + Signals
The value of the entire data estate for machine learning
Prof. Paul Groth | @pgroth | pgroth.com | indelab.org
Thanks to
Corey Harper, Çağatay Demiralp, Marieke van Erp
ConTech Live 2021
2. Outline
• Where I’m coming from
• The Success of Machine Learning
• The Need for Data
• Reducing (Training) Data Acquisition Costs
• Implications + Actions
3. • A national federation of AI
research labs
• One ICAI head office
• Science Park Amsterdam
• Five ICAI locations
• Currently:
• Amsterdam (2)
• Delft
• Nijmegen
• Utrecht
ING AI for Fintech
Partnering with Industry
8. DEEP NEURAL NETWORKS
Adams Wei Yu, David Dohan,
Minh-Thang Luong, Rui Zhao,
Kai Chen, Mohammad Norouzi,
Quoc V. Le: QANet: Combining
Local Convolution with Global
Self-Attention for Reading
Comprehension. ICLR (Poster)
2018
9. Source: Sharir, Or, Barak Peleg, and
Yoav Shoham. "The Cost of Training
NLP Models: A Concise Overview." arXiv
preprint arXiv:2004.08900 (2020).
10. THE NEED FOR DATA
Lin, T. Y., Maire, M., Belongie, S.,
Hays, J., Perona, P., Ramanan, D., ...
& Zitnick, C. L. (2014, September).
Microsoft coco: Common objects in
context. In European conference on
computer vision (pp. 740-755).
Springer, Cham.
11. THE NEED FOR ANNOTATED DATA
Zhang, Yuhao, et al. "Position-aware attention and supervised data improve slot filling."
Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing. 2017.
16. Reduce the Cost of Annotated Data
http://ai.stanford.edu/blog/weak-supervision/
17. Transfer Learning
Source Symeonidou, Anthi, Viachaslau Sazonau, and Paul Groth. "Transfer Learning for
Biomedical Named Entity Recognition with BioBERT." SEMANTICS Posters&Demos. 2019.
20. Source:
Stephen H. Bach et al. 2019. Snorkel DryBell: A Case Study in Deploying Weak
Supervision at Industrial Scale. In Proceedings of the 2019 International Conference on
Management of Data (SIGMOD '19). ACM, New York, NY, USA, 362-375. DOI:
https://doi.org/10.1145/3299869.3314036
https://ai.googleblog.com/2019/03/harnessing-
organizational-knowledge-for.html
Weak Supervision
21. The really long tail - smell extraction
Ryan Brate, Paul Groth and Marieke van Erp (2020) Towards
Olfactory Information Extraction from Text: A Case Study on
Detecting Smell Experiences in Novels. LaTeCH-CLfL 2020
22. Weak Supervision as Data Programming
http://ai.stanford.edu/blog/weak-supervision/
23. Supervision Sources / Signals
• Heuristics and rules: e.g. existing human-authored rules about the target
domain.
• Topic models, taggers, and classifiers: e.g. machine learning models about
the target domain or a related domain.
• Aggregate statistics: e.g. tracked metrics about the target domain.
• Knowledge or entity graphs: e.g. databases of facts about the target
domain.
https://ai.googleblog.com/2019/03/harnessing-organizational-knowledge-for.html
24. Multi-modal Data
Source:
Dunnmon, J. A., Ratner, A. J., Saab, K., Khandwala,
N., Markert, M., Sagreiya, H., ... & Ré, C. (2020).
Cross-modal data programming enables rapid medical
machine learning. Patterns, 100019.
25. End user data programming
Source:
Data Programming by Demonstration: A
Framework for Interactively Learning
Labeling Functions.
S. Evensen, C. Ge, D. Choi, Ç. Demiralp
Findings of EMNLP (Ruler), 2020.
26. Supervision with Observation
Source:
Wang, Xin, Nicolas Thome, and Matthieu Cord. "Gaze latent
support vector machine for image classification improved by
weakly supervised region selection." Pattern Recognition 72
(2017): 59-71.
27. Implications
Premise Consequence
Improving ability to use expertise Expertise is a critical resource
Improving ability to use more and
different signals
Signal capture becomes imperative
Multiple content sources buttress
each other
Understanding and use the entire
data estate
Machine learning SOTA is
accessible
Problem formulation is fundamental
29. Source:
Michael Lauruhn and Paul Groth.
“Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016).
Action 1: Make a map
31. Conclusion
• Powerful ML models are available today
• Data is the essential the driver
• Don’t overlook your resources:
• your content, your expertise your customer insight
Paul Groth | p.groth@uva.nl | @pgroth | pgroth.com | indelab.org
Hinweis der Redaktion
330K images (>200K labeled)
1.5 million object instances