Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

•

1 like•1,706 views

Talk "Learning with the web: spotting named entities on the intersection of nerd and machine learning" event during #MSM'13 (WWW'13), Rio de Janeiro, Brazil Microposts shared on social platforms instantaneously report facts, opinions or emotions. In these posts, entities are often used but they are continuously changing depending on what is currently trending. In such a scenario, recognising these named entities is a challenging task, for which off-the-shelf approaches are not well equipped. We propose NERD-ML, an approach that unifies the benefits of a crowd entity recognizer through Web entity extractors combined with the linguistic strengths of a machine learning classifier.

Technology Education

Learning with the Web: SpottingLearning with the Web: Spotting
Named Entities on the intersectionNamed Entities on the intersection
of NERD and Machine Learningof NERD and Machine Learning
Marieke van Erp, Giuseppe Rizzo, Raphaël Troncy
@giusepperizzo

May 13, 2013 2/13Making Sense of Microposts (#MSM2013)
NERD-ML @ MSM'13

May 13, 2013 3/13Making Sense of Microposts (#MSM2013)
Preprocessing
➢
Dataset is converted in CoNLL IOB
format
➢
Applied 10 cross-fold validation
➢
Chunked the set of tweets in 50KB parts
in order to comply with NERD filesize
limitations

May 13, 2013 4/13Making Sense of Microposts (#MSM2013)
NERD extractors
➢
Retrieves named entities from 10 extractors (Web
APIs)
➢
Harmonizes the classification according to the
NERD Ontology v0.5
http://nerd.eurecom.fr/ontology
➢
75 entity classes mapped to 4 MSM'13 classes
http://nerd.eurecom.fr

May 13, 2013 5/13Making Sense of Microposts (#MSM2013)
Ritter et al. (2011)
➢
Off-the-shelf tool tailored to a Twitter
stream based on:
– LabelledLDA (+CRF)
– Textual features (POS,Capitalization,Suffix, etc.)
– Freebase gazetters (names of PER, ORG, LOC)
➢
10 entity classes mapped to 4 classes
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An
Experimental Study. In: Empirical Methods in Natural Language Processing
(EMNLP’11) (2011)

May 13, 2013 6/13Making Sense of Microposts (#MSM2013)
Stanford CRF
➢
Re-trained on the MSM'13 corpora
➢
Parameters based on
english.conll.4class.distsim.crf.ser.gz
properties file provided with the
Stanford distribution
➢
Baseline of our approach
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating Non-local
Information into Information Extraction Systems by Gibbs Sampling. In: 43nd Annual
Meeting of the Association for Computational Linguistics (ACL'05) (2005)

May 13, 2013 7/13Making Sense of Microposts (#MSM2013)
Textual features
➢
POS
➢
Capitalisation information
– initial capital
– all capitalized
– proportion of token capitals
➢
Prefix (first three letters of the token)
➢
Suffix (last three letters of the token)
➢
Whether token is at the beginning of at the
end of the micropost
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An Experimental
Study. In: Empirical Methods in Natural Language Processing (EMNLP’11) (2011)

May 13, 2013 8/13Making Sense of Microposts (#MSM2013)
ML settings
Run01: 7 textual features (POS, initial capital,
proportion of capitals, prefix, sufix, end/start token); 0
extractor; ML=k-NN, k =1, Euclidean distance
Run02: 0 textual feature; 12 extractors (AlchemyAPI,
DBpedia Spotlight, Extractiv, Lupedia, OpenCalais,
Saplo, Yahoo, Textrazor, Wikimeta, Zemanta,
Stanford NER, Ritter et al.); ML=SVM, polynomial
kernel, SMO
Run03: 4 textual features (POS, initial capital, suffix,
Proportion of Capitals); 8 extractors (AlchemyAPI,
DBpedia Spotlight, Extractiv, Opencalais, Textrazor,
Wikimeta, Stanford NER, Ritter et al.); ML=SVM,
polynomial kernel, SMO

May 13, 2013 9/13Making Sense of Microposts (#MSM2013)
Precision – MSM'13 training,
10 cross-fold validation

May 13, 2013 10/13Making Sense of Microposts (#MSM2013)
Recall - MSM'13 training,
10 cross-fold validation

May 13, 2013 11/13Making Sense of Microposts (#MSM2013)
F1 – MSM'13 training,
10 cross-fold validation

May 13, 2013 12/13Making Sense of Microposts (#MSM2013)
Lessons learned
➢
MISC class is ambiguously defined
➢
8.1% of the named entities from the
training data occurs in the test data
➢
Best Run03: not all extractors and some
textual features
➢
For the next challenge what about
entity linking?

May 13, 2013 13/13Making Sense of Microposts (#MSM2013)
Thanks for your time and attention
http://www.slideshare.net/giusepperizzo
N ERD-ML
http://github.com/giusepperizzo/nerdml

Viewers also liked

Prã©sentation c arwidi 3 mai 2010

Javier Ruiz

15 sep 11 bt property 2011_makings of a choice location

John Tan Yi Shin

Savannah Problem Solving (Unit 2 2011)

douglasgreig

act4_fortitude

trince1803

5jun n as

epaper

Habits of mind launch

douglasgreig

Aprendizaje colaborativo

laurafrencia

11jun aceh

epaper

ICS Overview Cycle 1 6 9 1 10

SteveLSwanson

Edisi Medan

epaper

The Publishers - Ch 9 and 10

280909aceh

Edisi 26 Maret Aceh

Binder20

Expo navigation revision

Geoff Adams

Yoshitaka Fujii - MMR vaccines and autism

Guilherme Marinho Elias Silva

Edisi 29 Maret Aceh

epaper

Sinatra Heroku You And You - PDF Format

Adam Lowe

Viewers also liked (18)

Prã©sentation c arwidi 3 mai 2010

15 sep 11 bt property 2011_makings of a choice location

Savannah Problem Solving (Unit 2 2011)

act4_fortitude

5jun n as

Habits of mind launch

Aprendizaje colaborativo

11jun aceh

ICS Overview Cycle 1 6 9 1 10

Edisi Medan

The Publishers - Ch 9 and 10

280909aceh

Edisi 26 Maret Aceh

Binder20

Expo navigation revision

Yoshitaka Fujii - MMR vaccines and autism

Edisi 29 Maret Aceh

Sinatra Heroku You And You - PDF Format

Similar to Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

Nltk natural language toolkit overview and application @ PyCon.tw 2012

Jimmy Lai

GATE, HLT and Machine Learning, Sheffield, July 2003

butest

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Seonghyun Kim

Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...

Robert McDermott

Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...

Robert McDermott

Large Scale Text Processing

Suneel Marthi

Natural Language Processing (NLP) practitioners often have to deal with analyzing large corpora of unstructured documents and this is often a tedious process. Python tools like NLTK do not scale to large production data sets and cannot be plugged into a distributed scalable framework like Apache Spark or Apache Flink. The Apache OpenNLP library is a popular machine learning based toolkit for processing unstructured text. Combining a permissive licence, a easy-to-use API and set of components which are highly customize and trainable to achieve a very high accuracy on a particular dataset. Built-in evaluation allows to measure and tune OpenNLP’s performance for the documents that need to be processed. From sentence detection and tokenization to parsing and named entity finder, Apache OpenNLP has the tools to address all tasks in a natural language processing workflow. It applies Machine Learning algorithms such as Perceptron and Maxent, combined with tools such as word2vec to achieve state of the art results. In this talk, we’ll be seeing a demo of large scale Name Entity extraction and Text classification using the various Apache OpenNLP components wrapped into Apache Flink stream processing pipeline and as an Apache NiFI processor. NLP practitioners will come away from this talk with a better understanding of how the various Apache OpenNLP components can help in processing large reams of unstructured data using a highly scalable and distributed framework like Apache Spark/Apache Flink/Apache NiFi.

Large Scale Processing of Unstructured Text

DataWorks Summit

Santhosh_Resume Current

Santhosh Kumar Manavasi Lakshminarayanan

"Fun with Functional Programming in Clojure" by John Stevenson. Clojure is a simple, powerful and fun language. With a small syntax its quick to learn, meaning you can focus on functional design concepts and quickly build up confidence. There are also a wide range of Clojure libraries to build any kind of apps or services quickly. With a focus on Immutability, Persistent data structures & lazy evaluation, you will quickly feel confident about the Functional Programming (FP) approach to coding. Discover Clojure in action as we write & evaluate Clojure using the REPL (interactive run-time environment), giving instant feedback on what the code is doing.

Fun with Functional Programming in Clojure

Codemotion

The classic mutual exclusion problem in distributed systems occurs when only one process should access a shared resource. Various algorithms are proposed in order to solve this problem. When using a permission based approach which consist in exchanging permission messages to grant access to the critical resource, less messages should be sent over the network because bandwidth consumption and synchronization delay should be reduced. Richa, shikha and Pooja proposed an algorithm using nodes logically organized in a complete binary tree. This algorithm called NTBCBT requires 4log2(N) messages per access to critical section and a synchronization delay of 3log2(N) for a set of N nodes competing for the critical ressource. In this paper, we study NTBCBT and we show that this algorithm has problems related with safety, liveness and scheduling. We improve this algorithm by correcting these weaknesses. Moreover, our algorithm requires 3log(N) messages per access to critical section and a synchronization delay of 2log(N). This improvement is due to the removal of useless messages, a reorganization of instructions on each node and an insertion of access requests using their timestamp.

PBCBT: AN IMPROVEMENT OF NTBCBT ALGORITHM

ijp2p

PBCBT: AN IMPROVEMENT OF NTBCBT ALGORITHM

ijp2p

Pbcbt an improvement of ntbcbt algorithm

ijp2p

Pbcbt an improvement of ntbcbt algorithm

ijp2p

Clojure is a simple, powerful and fun language. With a small syntax its quick to learn, meaning you can focus on functional design concepts and quickly build up confidence. There are also a wide range of Clojure libraries to build any kind of apps or services quickly. With a focus on Immutability, Persistent data structures & lazy evaluation, you will quickly feel confident about the Functional Programming (FP) approach to coding. Discover Clojure in action as we write & evaluate Clojure using the REPL (interactive run-time environment), giving instant feedback on what the code is doing.

Fun with Functional Programming in Clojure - John Stevenson - Codemotion Amst...

Codemotion

Most of the Arabic Named Entity Recognition (NER) systems depend massively on external resources and handmade feature engineering to achieve state-of-the-art results. To overcome such limitations, we proposed, in this paper, to use deep learning approach to tackle the Arabic NER task. We introduced a neural network architecture based on bidirectional Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) and experimented with various commonly used hyperparameters to assess their effect on the overall performance of our system. Our model gets two sources of information about words as input: pre-trained word embeddings and character-based representations and eliminated the need for any task-specific knowledge or feature engineering. We obtained state-of-the-art result on the standard ANERcorp corpus with an F1 score of 90.6%.

Arabic named entity recognition using deep learning approach

IJECEIAES

Macros in nemerle

Kota Mizushima

Scc2012 Scala

steccami

Distributional Semantic word representation allows Natural Language Processing systems to extract and model an immense amount of information about a language. This technique maps words into a high dimensional continuous space through the use of a single-layer neural network. This process has allowed for advances in many Natural Language Processing research areas and tasks. These representation models are evaluated with the use of analogy tests, questions of the form ``If a is to a' then b is to what?'' are answered by composing multiple word vectors and searching the vector space. During the neural network training process, each word is examined as a member of its context. Generally, a word's context is considered to be the elements adjacent to it within a sentence. While some work has been conducted examining the effect of expanding this definition, very little exploration has been done in this area. Further, no inquiry has been conducted as to the specific linguistic competencies of these models or whether modifying their contexts impacts the information they extract. In this paper we propose a thorough analysis of the various lexical and grammatical competencies of distributional semantic models. We aim to leverage analogy tests to evaluate the most advanced distributional model across 14 different types of linguistic relationships. With this information we will then be able to investigate as to whether modifying the training context renders any differences in quality across any of these categories. Ideally we will be able to identify approaches to training that increase precision in some specific linguistic categories, which will allow us to investigate whether these improvements can be combined by joining the information used in different training approaches to build a single, improved, model.

Automatic Personality Prediction with Attention-based Neural Networks

Jinho Choi

Bench4BL: Reproducibility Study on the Performance of IR-Based Bug Localization

Dongsun Kim

Media analysts have to deal with with analyzing high volumes of real-time news feeds and social media streams which is often a tedious process because they need to write search profiles for entities. Python tools like NLTK do not scale to large production data sets and cannot be plugged into a distributed scalable frameworks like Apache Flink. Apache Flink being a streaming first engine is ideally suited for ingesting multiple streams of news feeds, social media, blogs etc.. and for being able to do streaming analytics on the various feeds. Natural Language Processing tools like Apache OpenNLP can be plugged into Flink streaming pipelines so as to be able to perform common NLP tasks like Named Entity Recognition (NER), Chunking, and text classification. In this talk, we’ll be building a real-time media analyzer which does Named Entity Recognition (NER) on the individual incoming streams, calculates the co-occurrences of the named entities and aggregates them across multiple streams; index the results into a search engine and being able to query the results for actionable insights. We’ll also be showing as to how to handle multilingual documents for calculating co-occurrences. NLP practitioners will come away from this talk with a better understanding of how the various Apache OpenNLP components can help in processing large streams of data feeds and can easily be plugged into a highly scalable and distributed framework like Apache Flink.

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...

Apache OpenNLP

Similar to Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning (20)

Nltk natural language toolkit overview and application @ PyCon.tw 2012

GATE, HLT and Machine Learning, Sheffield, July 2003

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...

Large Scale Text Processing

Large Scale Processing of Unstructured Text

Santhosh_Resume Current

Fun with Functional Programming in Clojure

PBCBT: AN IMPROVEMENT OF NTBCBT ALGORITHM

Pbcbt an improvement of ntbcbt algorithm

Fun with Functional Programming in Clojure - John Stevenson - Codemotion Amst...

Arabic named entity recognition using deep learning approach

Macros in nemerle

Scc2012 Scala

Automatic Personality Prediction with Attention-based Neural Networks

Bench4BL: Reproducibility Study on the Performance of IR-Based Bug Localization

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...

Recently uploaded

BooK Now Call us at +918448380779 to hire a gorgeous and seductive call girl for sex. Take a Delhi Escort Service. The help of our escort agency is mostly meant for men who want sexual Indian Escorts In Delhi NCR. It should be noted that any impersonator will get 100 attention from our Young Girls Escorts in Delhi. They will assume the position of reliable allies. VIP Call Girl With Original Photos Book Tonight +918448380779 Our Cheap Price 1 Hour not available 2 Hours 5000 Full Night 8000 TAG: Call Girls in Delhi, Noida, Gurgaon, Ghaziabad, Connaught Place, Greater Kailash Delhi, Lajpat Nagar Delhi, Mayur Vihar Delhi, Chanakyapuri Delhi, New Friends Colony Delhi, Majnu Ka Tilla, Karol Bagh, Malviya Nagar, Saket, Khan Market, Noida Sector 18, Noida Sector 76, Noida Sector 51, Gurgaon Mg Road, Iffco Chowk Gurgaon, Rajiv Chowk Gurgaon All Delhi Ncr Free Home Deliver

08448380779 Call Girls In Friends Colony Women Seeking Men

Delhi Call girls

Scaling API-first – The story of a global engineering organization

Radu Cotescu

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

Real Time Object Detection Using Open CV

Khem

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Delhi Call girls

Presentation on how to chat with PDF using ChatGPT code interpreter

naman860154

Slack Application Development 101 Slides

praypatel2

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

Histor y of HAM Radio presentation slide

vu2urc

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

GenCyber Cyber Security Day Presentation

Michael W. Hawkins

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

A Call to Action for Generative AI in 2024

Results

What are drone anti-jamming systems? The drone anti-jamming systems and anti-spoof technology protect against interference, jamming, and spoofing of the UAVs. To protect their security, countries are beginning to research drone anti-jamming systems, also known as drone strike weapons. The anti-jam and anti-spoof technology protects against interference, jamming and spoofing. A drone strike weapon is a drone attack weapon that can attack and destroy enemy drones. So what is so unique about this amazing system?

What Are The Drone Anti-jamming Systems Technology?

Antenna Manufacturer Coco

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men

Scaling API-first – The story of a global engineering organization

A Year of the Servo Reboot: Where Are We Now?

Real Time Object Detection Using Open CV

Boost PC performance: How more available memory can improve productivity

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Finology Group – Insurtech Innovation Award 2024

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Presentation on how to chat with PDF using ChatGPT code interpreter

Slack Application Development 101 Slides

Driving Behavioral Change for Information Management through Data-Driven Gree...

Histor y of HAM Radio presentation slide

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

GenCyber Cyber Security Day Presentation

🐬 The future of MySQL is Postgres 🐘

A Domino Admins Adventures (Engage 2024)

A Call to Action for Generative AI in 2024

What Are The Drone Anti-jamming Systems Technology?

Axa Assurance Maroc - Insurer Innovation Award 2024

Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

1. Learning with the Web: SpottingLearning with the Web: Spotting Named Entities on the intersectionNamed Entities on the intersection of NERD and Machine Learningof NERD and Machine Learning Marieke van Erp, Giuseppe Rizzo, Raphaël Troncy @giusepperizzo

2. May 13, 2013 2/13Making Sense of Microposts (#MSM2013) NERD-ML @ MSM'13

3. May 13, 2013 3/13Making Sense of Microposts (#MSM2013) Preprocessing ➢ Dataset is converted in CoNLL IOB format ➢ Applied 10 cross-fold validation ➢ Chunked the set of tweets in 50KB parts in order to comply with NERD filesize limitations

4. May 13, 2013 4/13Making Sense of Microposts (#MSM2013) NERD extractors ➢ Retrieves named entities from 10 extractors (Web APIs) ➢ Harmonizes the classification according to the NERD Ontology v0.5 http://nerd.eurecom.fr/ontology ➢ 75 entity classes mapped to 4 MSM'13 classes http://nerd.eurecom.fr

5. May 13, 2013 5/13Making Sense of Microposts (#MSM2013) Ritter et al. (2011) ➢ Off-the-shelf tool tailored to a Twitter stream based on: – LabelledLDA (+CRF) – Textual features (POS,Capitalization,Suffix, etc.) – Freebase gazetters (names of PER, ORG, LOC) ➢ 10 entity classes mapped to 4 classes Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An Experimental Study. In: Empirical Methods in Natural Language Processing (EMNLP’11) (2011)

6. May 13, 2013 6/13Making Sense of Microposts (#MSM2013) Stanford CRF ➢ Re-trained on the MSM'13 corpora ➢ Parameters based on english.conll.4class.distsim.crf.ser.gz properties file provided with the Stanford distribution ➢ Baseline of our approach Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In: 43nd Annual Meeting of the Association for Computational Linguistics (ACL'05) (2005)

7. May 13, 2013 7/13Making Sense of Microposts (#MSM2013) Textual features ➢ POS ➢ Capitalisation information – initial capital – all capitalized – proportion of token capitals ➢ Prefix (first three letters of the token) ➢ Suffix (last three letters of the token) ➢ Whether token is at the beginning of at the end of the micropost Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An Experimental Study. In: Empirical Methods in Natural Language Processing (EMNLP’11) (2011)

8. May 13, 2013 8/13Making Sense of Microposts (#MSM2013) ML settings Run01: 7 textual features (POS, initial capital, proportion of capitals, prefix, sufix, end/start token); 0 extractor; ML=k-NN, k =1, Euclidean distance Run02: 0 textual feature; 12 extractors (AlchemyAPI, DBpedia Spotlight, Extractiv, Lupedia, OpenCalais, Saplo, Yahoo, Textrazor, Wikimeta, Zemanta, Stanford NER, Ritter et al.); ML=SVM, polynomial kernel, SMO Run03: 4 textual features (POS, initial capital, suffix, Proportion of Capitals); 8 extractors (AlchemyAPI, DBpedia Spotlight, Extractiv, Opencalais, Textrazor, Wikimeta, Stanford NER, Ritter et al.); ML=SVM, polynomial kernel, SMO

9. May 13, 2013 9/13Making Sense of Microposts (#MSM2013) Precision – MSM'13 training, 10 cross-fold validation

10. May 13, 2013 10/13Making Sense of Microposts (#MSM2013) Recall - MSM'13 training, 10 cross-fold validation

11. May 13, 2013 11/13Making Sense of Microposts (#MSM2013) F1 – MSM'13 training, 10 cross-fold validation

12. May 13, 2013 12/13Making Sense of Microposts (#MSM2013) Lessons learned ➢ MISC class is ambiguously defined ➢ 8.1% of the named entities from the training data occurs in the test data ➢ Best Run03: not all extractors and some textual features ➢ For the next challenge what about entity linking?

13. May 13, 2013 13/13Making Sense of Microposts (#MSM2013) Thanks for your time and attention http://www.slideshare.net/giusepperizzo N ERD-ML http://github.com/giusepperizzo/nerdml

Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

Similar to Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning (20)

More from Giuseppe Rizzo

More from Giuseppe Rizzo (20)

Recently uploaded

Recently uploaded (20)

Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning