SlideShare ist ein Scribd-Unternehmen logo
1 von 2
Downloaden Sie, um offline zu lesen
Harnessing Web Page Directories for Large-Scale
                      Classification of Tweets

                                                   Arkaitz Zubiaga1 , Heng Ji2
                                  Computer Science Department12 , Linguistics Department2
                                          Queens College and Graduate Center
                                     City University of New York, New York, NY, USA
                                    arkaitz@zubiaga.org, hengjicuny@gmail.com

ABSTRACT                                                          of conversations, e.g., events or memes. While they eval-
Classification is paramount for an optimal processing of tweets,   uated with 5,407 and 1,036 hand-labeled tweet instances,
albeit performance of classifiers is hindered by the need of       respectively, studying with larger and more diverse datasets
large sets of training data to encompass the diversity of con-    would strengthen the results, so the classifiers could be fur-
tents one can find on Twitter. In this paper, we introduce         ther generalized to Twitter’s diverse contents. In this paper
an inexpensive way of labeling large sets of tweets, which        we present a novel approach for large-scale classification of
can be easily regenerated or updated when needed. We use          tweets. We describe an inexpensive approach for automatic
human-edited web page directories to infer categories from        generation of labeled sets of tweets. Based on distant super-
URLs contained in tweets. By experimenting with a large           vision, we infer the category for each tweet from the URL it
set of more than 5 million tweets categorized accordingly,        points to. We extract categories for each URL from the
we show that our proposed model for tweet classification           Open Directory Project (ODP, http://dmoz.org), a large
can achieve 82% in accuracy, performing only 12.2% worse          human-edited web directory. Our approach allows to eas-
than for web page classification.                                  ily create large collections of labeled tweets, as well as to
                                                                  incrementally update or renew these collections. We exper-
                                                                  iment our proposal with over 5 million tweets, and compare
Categories and Subject Descriptors                                to web page classification, showing that web page directories
H.3.3 [Information Search and Retrieval]: Information             can achieve 82% in terms of accuracy.
filtering
                                                                  2. LEARNING MODEL
Keywords                                                             We propose a learning model based on distant supervision
                                                                  [4], which takes advantage of external information sources to
tweets, classification, distant, large-scale                       enhance supervision. Similar techniques have been adapted
                                                                  by [2] to categorize blog posts using Freebase, and [1] to infer
1. INTRODUCTION                                                   sentiment in tweets from hashtags. Here, we propose that
   Twitter’s evergrowing daily flow of millions of tweets in-      tweets can be assumed to be in the same topic as the URL
cludes all kinds of information, coming from different coun-       they are pointing to, provided that the tweet is referring to
tries, and written in different languages. Classifying tweets      or discussing about the contents of the web page. We set out
into categories is paramount to create narrower streams of        to use the ODP, a community-driven open access web page
data that interested communities could efficiently process.         directory. This way, we can incrementally build a labeled set
For instance, news media could be interested in mining what       of tweets, where each tweet will be labeled with the category
is being said in the news segment, while governments could        defined on the ODP for the URL being pointed.
be interested in tracking issues concerning health.                  The ODP is manually maintained continuously. However,
   Besides tweets being short and informally written, one of      since Twitter evolves rapidly, its stream is riddled with new
the challenges for classifying tweets by topic is hand-labeling   URLs that were not seen before. Thus, most URLs found
a large training set to learn from. Manually labeled sets of      in tweets are not likely to be on the ODP, and the coverage
tweets present several issues: (i) they are costly, and thus      would be very low. To solve this, we consider the domain
tend to be small and hardly representative of all the content     or subdomain of URLs, removing the path. This enables to
on Twitter, (ii) it is unaffordable to annotate tweets in a        match a higher number of URLs in tweets to (sub)domains
wide variety of languages, and (iii) the annotated set might      previously categorized on the ODP. Thus, a tweet pointing
become obsolete as both Twitter and the vocabulary evolve.        to nytimes.com/mostpopular/ will be categorized as News
   Previous research has studied the use of different features     if the ODP says that nytimes.com belongs to News.
for classification purposes. For instance, [5] use a set of so-
cial features to categorize tweets by type, e.g., opinions or     3. DATA COLLECTION
deals. Similarly, [6] categorized groups of tweets into types        We collected tweets containing links, so we could catego-
                                                                  rize tweets from the URLs they pointed to. Given the re-
                                                                  stricted access for white-listed users to the stream of linked
Copyright is held by the author/owner(s).
WWW 2013 Companion, May 13–17, 2013, Rio de Janeiro, Brazil.      tweets, we tracked tweets containing ’http’ from Twitter’s
ACM 978-1-4503-2038-2/13/05.                                      streaming API, and removed the few tweets that mentioned
the word ’http’ instead of containing a URL. We collected            Table 2 further looks at the performance of learning from
tweets during two non-consecutive 3-day periods: March 20-        and testing on tweets. After computing the cosine similar-
22, and April 17-19, 2012. Having two non-consecutive time-       ity of words in tweet texts and words in associated URL
frames enables to have the former as training data and the        titles, we break down the accuracy into ranges of tweet-title
latter as test data, with a certainty that tweets from the test   similarity. This aims to analyze the extent to which tweets
set are unlikely to be part of the same conversations as in       can be better classified when they are similar to the title of
the training data, avoiding overfitting. The process above         the web page (more akin to web page classification), or when
returned 12,876,106 tweets with URLs for the training set,        they are totally different (e.g., adding user’s own comments).
and 12,894,602 for the test set. Since URLs in tweets are         There is in fact difference in that high-similarity tweets (.8-
usually shortened, we retrieved all the unshortened destina-      1) are classified most accurately, but low-similarity tweets
tion URLs, so we could extract the actual domain each tweet       (0-.2) show promising results by losing only 8.9%.
points to. For matching final URLs to their corresponding
ODP categories, we defined the following two restrictions:             Similarity     0-.2    .2-.4   .4-.6   .6-.8    .8-1
(i) from the list of 16 top-level categories on the ODP, we           Accuracy       0.778   0.808   0.799   0.787    0.854
removed ’Regional’ and ’World’ since, unlike the other cat-
egories, do not represent topics, but a transversal catego-          Table 2: Classification of tweets by similarity.
rization based on countries and languages; we keep the re-
maining 14 categories, and (ii) a small subset of the domains
are categorized in more than one category on the ODP; we
exclude these to avoid inconsistencies, and consider those        5. CONCLUSION
with just one category. Matching our tweet dataset to ODP,           We have described a novel learning model that, based on
we got 2,265,411 tweets for the training set, and 2,851,900       distant supervision and inferring categories for tweets from
tweets for the test set (which covers 19.9% of the tweets         web page directories, enables to build large-scale training
collected). For each tweet, we keep the text of the tweet,        data for tweet classification by topic. This method can help
the category extracted from ODP, as well as the text within       accurately classify the diverse contents found in tweets, not
the <title> tag of the associated URL, so we can compare          only when they resemble web contents. The proposed solu-
web page classification and tweet classification. The result-       tion presents an inexpensive way to build large-scale labeled
ing dataset includes tweets in tens of languages, where only      sets of tweets, which can be easily updated, extended, or
31.1% of the tweets are written in English.                       renewed with fresh data as the vocabulary evolves.
                                                                  Acknowledgments. This work was supported by the U.S.
                                                                  Army Research Laboratory under Cooperative Agreement
4. EXPERIMENTS                                                    No. W911NF- 09-2-0053 (NS-CTA), the U.S. NSF CA-
   We use SVM-multiclass [3] for classification experiments.       REER Award under Grant IIS-0953149, the U.S. NSF EA-
We do 10-fold cross-validation by using 10% splits (of ∼226k      GER Award under Grant No. IIS-1144111, the U.S. DARPA
tweets each) of the training data described above. Since          Deep Exploration and Filtering of Text program and CUNY
the focus of this study is to validate the learning model,        Junior Faculty Award. The views and conclusions contained
we rely on TF-based representation of the bag-of-words for        in this document are those of the authors and should not be
both tweets (with URL removed for representation) and web         interpreted as representing the official policies, either ex-
pages, and leave the comparison with other representations        pressed or implied, of the U.S. Government. The U.S. Gov-
for future work. We show the accuracy results, representing       ernment is authorized to reproduce and distribute reprints
the fraction of correct categorizations.                          for Government purposes notwithstanding any copyright no-
   Table 1 compares the performance of using either web           tation here on.
pages or tweets for training and test. While obviously learn-
ing from web pages to classify web pages performs best,           6. REFERENCES
given the nature of web page directories, classifying tweets      [1] A. Go, R. Bhayani, and L. Huang. Twitter sentiment
from tweets performs only 12.2% worse, achieving 82% per-             classification using distant supervision. CS224N Project
formance. Interestingly, the system can be trained from               Report, Stanford, pages 1–12, 2009.
tweets to categorize web pages as accurately, but classifying     [2] S. D. Husby and D. Barbosa. Topic classification of
tweets from web pages performs 34.7% worse, potentially               blog posts using distant supervision. EACL, pages
because web pages do not include the informal vocabulary              28–36, 2012.
needed to train. Moreover, if we reduce the training set from     [3] T. Joachims. Text categorization with support vector
226k to a manually affordable subset of 10k tweets (note               machines: Learning with many relevant features.
that this is still larger than the 5k and 1k used in previous         Machine learning: ECML-98, pages 137–142, 1998.
research described above), the overall accuracy of learning       [4] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant
from and testing on tweets drops to 0.565 (31.4% loss).               supervision for relation extraction without labeled data.
                                                                      In ACL/CONLL, pages 1003–1011, 2009.
                            Tweet      Web                        [5] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu,
                  Tweet      0.824     0.830                          and M. Demirbas. Short text classification in twitter to
                   Web       0.542     0.938                          improve information filtering. In SIGIR, pages 841–842,
                                                                      2010.
Table 1: Classification of tweets (rows = train,                   [6] A. Zubiaga, D. Spina, V. Fresno, and R. Mart´  ınez.
columns = test).                                                      Classifying trending topics: a typology of conversation
                                                                      triggers on twitter. In CIKM, pages 2461–2464, 2011.

Weitere ähnliche Inhalte

Was ist angesagt?

Fake News Detection using Machine Learning
Fake News Detection using Machine LearningFake News Detection using Machine Learning
Fake News Detection using Machine Learningijtsrd
 
Cross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning TechniquesCross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning TechniquesIJSRED
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research ReportAlex Sumner
 
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGRAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGijaia
 
Twitter Sentiment Analysis
Twitter Sentiment AnalysisTwitter Sentiment Analysis
Twitter Sentiment Analysisijtsrd
 
Improved spambase dataset prediction using svm rbf kernel with adaptive boost
Improved spambase dataset prediction using svm rbf kernel with adaptive boostImproved spambase dataset prediction using svm rbf kernel with adaptive boost
Improved spambase dataset prediction using svm rbf kernel with adaptive boosteSAT Journals
 
Aspects of broad folksonomies
Aspects of broad folksonomiesAspects of broad folksonomies
Aspects of broad folksonomiesdermotte
 
Wikipedia as an Ontology for Describing Documents
Wikipedia as an Ontology for Describing DocumentsWikipedia as an Ontology for Describing Documents
Wikipedia as an Ontology for Describing DocumentsZareen Syed
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionaryEditor IJMTER
 

Was ist angesagt? (11)

Fake News Detection using Machine Learning
Fake News Detection using Machine LearningFake News Detection using Machine Learning
Fake News Detection using Machine Learning
 
Cross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning TechniquesCross breed Spam Categorization Method using Machine Learning Techniques
Cross breed Spam Categorization Method using Machine Learning Techniques
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research Report
 
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGRAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
 
A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...
 
Twitter Sentiment Analysis
Twitter Sentiment AnalysisTwitter Sentiment Analysis
Twitter Sentiment Analysis
 
Improved spambase dataset prediction using svm rbf kernel with adaptive boost
Improved spambase dataset prediction using svm rbf kernel with adaptive boostImproved spambase dataset prediction using svm rbf kernel with adaptive boost
Improved spambase dataset prediction using svm rbf kernel with adaptive boost
 
Aspects of broad folksonomies
Aspects of broad folksonomiesAspects of broad folksonomies
Aspects of broad folksonomies
 
Wikipedia as an Ontology for Describing Documents
Wikipedia as an Ontology for Describing DocumentsWikipedia as an Ontology for Describing Documents
Wikipedia as an Ontology for Describing Documents
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
Query expansion
Query expansionQuery expansion
Query expansion
 

Andere mochten auch

Exploiting Wikipedia for Entity Name Disambiguation in Tweets
Exploiting Wikipedia for Entity Name Disambiguation in TweetsExploiting Wikipedia for Entity Name Disambiguation in Tweets
Exploiting Wikipedia for Entity Name Disambiguation in TweetsM. Atif Qureshi
 
Classifying Microblogs For Disasters
Classifying Microblogs For DisastersClassifying Microblogs For Disasters
Classifying Microblogs For DisastersSarvnaz Karimi
 
Discovering Context
Discovering ContextDiscovering Context
Discovering ContextYegin Genc
 
Semantic Entity extraction from Sports Tweets
Semantic Entity extraction from Sports TweetsSemantic Entity extraction from Sports Tweets
Semantic Entity extraction from Sports Tweetsmitsmit
 
Tweets Classification
Tweets ClassificationTweets Classification
Tweets ClassificationVarun Gupta
 
CLASSIFICATION OF TWEETS
CLASSIFICATION OF TWEETSCLASSIFICATION OF TWEETS
CLASSIFICATION OF TWEETSMukul Jha
 
SubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an EntitySubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an EntityAnkita Kumari
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTrilok Sharma
 
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes ClassifiersDongseo University
 

Andere mochten auch (9)

Exploiting Wikipedia for Entity Name Disambiguation in Tweets
Exploiting Wikipedia for Entity Name Disambiguation in TweetsExploiting Wikipedia for Entity Name Disambiguation in Tweets
Exploiting Wikipedia for Entity Name Disambiguation in Tweets
 
Classifying Microblogs For Disasters
Classifying Microblogs For DisastersClassifying Microblogs For Disasters
Classifying Microblogs For Disasters
 
Discovering Context
Discovering ContextDiscovering Context
Discovering Context
 
Semantic Entity extraction from Sports Tweets
Semantic Entity extraction from Sports TweetsSemantic Entity extraction from Sports Tweets
Semantic Entity extraction from Sports Tweets
 
Tweets Classification
Tweets ClassificationTweets Classification
Tweets Classification
 
CLASSIFICATION OF TWEETS
CLASSIFICATION OF TWEETSCLASSIFICATION OF TWEETS
CLASSIFICATION OF TWEETS
 
SubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an EntitySubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an Entity
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
 
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
 

Ähnlich wie Harnessing Web Page Directories for Large-Scale Classification of Tweets

Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemLatent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemShailly Saxena
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterDan Nguyen
 
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECH
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECHABBREVIATION DICTIONARY FOR TWITTER HATE SPEECH
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECHIJCI JOURNAL
 
Tweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionTweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionieeepondy
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpagescsandit
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpagescsandit
 
IRJET- Finding Related Forum Posts through Intention-Based Segmentation
IRJET-  	  Finding Related Forum Posts through Intention-Based SegmentationIRJET-  	  Finding Related Forum Posts through Intention-Based Segmentation
IRJET- Finding Related Forum Posts through Intention-Based SegmentationIRJET Journal
 
Tweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity RecognitionTweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity Recognition1crore projects
 
On Incentive-based Tagging
On Incentive-based TaggingOn Incentive-based Tagging
On Incentive-based TaggingFrancesco Rizzo
 
On Summarization and Timeline Generation for Evolutionary Tweet Streams
On Summarization and Timeline Generation for Evolutionary Tweet StreamsOn Summarization and Timeline Generation for Evolutionary Tweet Streams
On Summarization and Timeline Generation for Evolutionary Tweet Streams1crore projects
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET Journal
 
Questions about questions
Questions about questionsQuestions about questions
Questions about questionsmoresmile
 
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATAFLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATAIJwest
 
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...IJCI JOURNAL
 
Tweet Summarization and Segmentation: A Survey
Tweet Summarization and Segmentation: A SurveyTweet Summarization and Segmentation: A Survey
Tweet Summarization and Segmentation: A Surveyvivatechijri
 
14420-Article Text-17938-1-2-20201228.pdf
14420-Article Text-17938-1-2-20201228.pdf14420-Article Text-17938-1-2-20201228.pdf
14420-Article Text-17938-1-2-20201228.pdfMehwishKanwal14
 

Ähnlich wie Harnessing Web Page Directories for Large-Scale Classification of Tweets (20)

Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemLatent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitter
 
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECH
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECHABBREVIATION DICTIONARY FOR TWITTER HATE SPEECH
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECH
 
Tweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionTweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognition
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
IRJET- Finding Related Forum Posts through Intention-Based Segmentation
IRJET-  	  Finding Related Forum Posts through Intention-Based SegmentationIRJET-  	  Finding Related Forum Posts through Intention-Based Segmentation
IRJET- Finding Related Forum Posts through Intention-Based Segmentation
 
Spotlight
SpotlightSpotlight
Spotlight
 
Tweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity RecognitionTweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity Recognition
 
On Incentive-based Tagging
On Incentive-based TaggingOn Incentive-based Tagging
On Incentive-based Tagging
 
On Summarization and Timeline Generation for Evolutionary Tweet Streams
On Summarization and Timeline Generation for Evolutionary Tweet StreamsOn Summarization and Timeline Generation for Evolutionary Tweet Streams
On Summarization and Timeline Generation for Evolutionary Tweet Streams
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
 
I0331047050
I0331047050I0331047050
I0331047050
 
Notes on mining social media updated
Notes on mining social media updatedNotes on mining social media updated
Notes on mining social media updated
 
Questions about questions
Questions about questionsQuestions about questions
Questions about questions
 
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATAFLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
 
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
 
Tweet Summarization and Segmentation: A Survey
Tweet Summarization and Segmentation: A SurveyTweet Summarization and Segmentation: A Survey
Tweet Summarization and Segmentation: A Survey
 
14420-Article Text-17938-1-2-20201228.pdf
14420-Article Text-17938-1-2-20201228.pdf14420-Article Text-17938-1-2-20201228.pdf
14420-Article Text-17938-1-2-20201228.pdf
 
Twitter System Design
Twitter System DesignTwitter System Design
Twitter System Design
 

Mehr von Gabriela Agustini

Como a cultura maker vai mudar o modo de produção global
Como a cultura maker vai mudar o modo de produção globalComo a cultura maker vai mudar o modo de produção global
Como a cultura maker vai mudar o modo de produção globalGabriela Agustini
 
Cidadãos como protagonistas das transformações sociais
Cidadãos como protagonistas das transformações sociaisCidadãos como protagonistas das transformações sociais
Cidadãos como protagonistas das transformações sociaisGabriela Agustini
 
Movimento Maker e Educação
Movimento Maker e EducaçãoMovimento Maker e Educação
Movimento Maker e EducaçãoGabriela Agustini
 
Diversidade cultural gilberto gil
Diversidade cultural gilberto gilDiversidade cultural gilberto gil
Diversidade cultural gilberto gilGabriela Agustini
 
Social Entrepreneurship - International School of Law and Technology
Social Entrepreneurship - International School of Law and TechnologySocial Entrepreneurship - International School of Law and Technology
Social Entrepreneurship - International School of Law and TechnologyGabriela Agustini
 
A tecnologia pode salvar a gente? | A gente pode salvar a tecnologia?
A tecnologia pode salvar a gente? | A gente pode salvar a tecnologia?A tecnologia pode salvar a gente? | A gente pode salvar a tecnologia?
A tecnologia pode salvar a gente? | A gente pode salvar a tecnologia?Gabriela Agustini
 
Makersfor Global Good Report
Makersfor Global Good ReportMakersfor Global Good Report
Makersfor Global Good ReportGabriela Agustini
 
Apresentação olabi institucional interna - abril 17
Apresentação olabi institucional interna - abril 17Apresentação olabi institucional interna - abril 17
Apresentação olabi institucional interna - abril 17Gabriela Agustini
 
Pretalab- apresentação institucional
Pretalab- apresentação institucionalPretalab- apresentação institucional
Pretalab- apresentação institucionalGabriela Agustini
 
Cultura e tecnologia - aula2
Cultura e tecnologia - aula2Cultura e tecnologia - aula2
Cultura e tecnologia - aula2Gabriela Agustini
 
Cultura e tecnologia - aula1
Cultura e tecnologia - aula1Cultura e tecnologia - aula1
Cultura e tecnologia - aula1Gabriela Agustini
 
Global Innovation Gathering featured in Make Magazine Germany
Global Innovation Gathering featured in Make Magazine GermanyGlobal Innovation Gathering featured in Make Magazine Germany
Global Innovation Gathering featured in Make Magazine GermanyGabriela Agustini
 
Inovação de baixo para cima e o poder dos cidadãos
Inovação de baixo para cima e o poder dos cidadãos Inovação de baixo para cima e o poder dos cidadãos
Inovação de baixo para cima e o poder dos cidadãos Gabriela Agustini
 
Makerspaces e hubs de inovação
Makerspaces e hubs de inovaçãoMakerspaces e hubs de inovação
Makerspaces e hubs de inovaçãoGabriela Agustini
 

Mehr von Gabriela Agustini (20)

Como a cultura maker vai mudar o modo de produção global
Como a cultura maker vai mudar o modo de produção globalComo a cultura maker vai mudar o modo de produção global
Como a cultura maker vai mudar o modo de produção global
 
Cidadãos como protagonistas das transformações sociais
Cidadãos como protagonistas das transformações sociaisCidadãos como protagonistas das transformações sociais
Cidadãos como protagonistas das transformações sociais
 
Inovação digital
Inovação digital Inovação digital
Inovação digital
 
Movimento Maker e Educação
Movimento Maker e EducaçãoMovimento Maker e Educação
Movimento Maker e Educação
 
Cultura digital - Aula 4
Cultura digital - Aula 4Cultura digital - Aula 4
Cultura digital - Aula 4
 
Cultura Digital- aula 3
Cultura Digital- aula 3Cultura Digital- aula 3
Cultura Digital- aula 3
 
Cultura Digital- aula 2
Cultura Digital- aula 2Cultura Digital- aula 2
Cultura Digital- aula 2
 
Diversidade cultural gilberto gil
Diversidade cultural gilberto gilDiversidade cultural gilberto gil
Diversidade cultural gilberto gil
 
Social Entrepreneurship - International School of Law and Technology
Social Entrepreneurship - International School of Law and TechnologySocial Entrepreneurship - International School of Law and Technology
Social Entrepreneurship - International School of Law and Technology
 
A tecnologia pode salvar a gente? | A gente pode salvar a tecnologia?
A tecnologia pode salvar a gente? | A gente pode salvar a tecnologia?A tecnologia pode salvar a gente? | A gente pode salvar a tecnologia?
A tecnologia pode salvar a gente? | A gente pode salvar a tecnologia?
 
Makersfor Global Good Report
Makersfor Global Good ReportMakersfor Global Good Report
Makersfor Global Good Report
 
Apresentação olabi institucional interna - abril 17
Apresentação olabi institucional interna - abril 17Apresentação olabi institucional interna - abril 17
Apresentação olabi institucional interna - abril 17
 
7 Forum Nacional de Museus
7 Forum Nacional de Museus7 Forum Nacional de Museus
7 Forum Nacional de Museus
 
Apresentacao metashop
Apresentacao metashopApresentacao metashop
Apresentacao metashop
 
Pretalab- apresentação institucional
Pretalab- apresentação institucionalPretalab- apresentação institucional
Pretalab- apresentação institucional
 
Cultura e tecnologia - aula2
Cultura e tecnologia - aula2Cultura e tecnologia - aula2
Cultura e tecnologia - aula2
 
Cultura e tecnologia - aula1
Cultura e tecnologia - aula1Cultura e tecnologia - aula1
Cultura e tecnologia - aula1
 
Global Innovation Gathering featured in Make Magazine Germany
Global Innovation Gathering featured in Make Magazine GermanyGlobal Innovation Gathering featured in Make Magazine Germany
Global Innovation Gathering featured in Make Magazine Germany
 
Inovação de baixo para cima e o poder dos cidadãos
Inovação de baixo para cima e o poder dos cidadãos Inovação de baixo para cima e o poder dos cidadãos
Inovação de baixo para cima e o poder dos cidadãos
 
Makerspaces e hubs de inovação
Makerspaces e hubs de inovaçãoMakerspaces e hubs de inovação
Makerspaces e hubs de inovação
 

Kürzlich hochgeladen

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Kürzlich hochgeladen (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Harnessing Web Page Directories for Large-Scale Classification of Tweets

  • 1. Harnessing Web Page Directories for Large-Scale Classification of Tweets Arkaitz Zubiaga1 , Heng Ji2 Computer Science Department12 , Linguistics Department2 Queens College and Graduate Center City University of New York, New York, NY, USA arkaitz@zubiaga.org, hengjicuny@gmail.com ABSTRACT of conversations, e.g., events or memes. While they eval- Classification is paramount for an optimal processing of tweets, uated with 5,407 and 1,036 hand-labeled tweet instances, albeit performance of classifiers is hindered by the need of respectively, studying with larger and more diverse datasets large sets of training data to encompass the diversity of con- would strengthen the results, so the classifiers could be fur- tents one can find on Twitter. In this paper, we introduce ther generalized to Twitter’s diverse contents. In this paper an inexpensive way of labeling large sets of tweets, which we present a novel approach for large-scale classification of can be easily regenerated or updated when needed. We use tweets. We describe an inexpensive approach for automatic human-edited web page directories to infer categories from generation of labeled sets of tweets. Based on distant super- URLs contained in tweets. By experimenting with a large vision, we infer the category for each tweet from the URL it set of more than 5 million tweets categorized accordingly, points to. We extract categories for each URL from the we show that our proposed model for tweet classification Open Directory Project (ODP, http://dmoz.org), a large can achieve 82% in accuracy, performing only 12.2% worse human-edited web directory. Our approach allows to eas- than for web page classification. ily create large collections of labeled tweets, as well as to incrementally update or renew these collections. We exper- iment our proposal with over 5 million tweets, and compare Categories and Subject Descriptors to web page classification, showing that web page directories H.3.3 [Information Search and Retrieval]: Information can achieve 82% in terms of accuracy. filtering 2. LEARNING MODEL Keywords We propose a learning model based on distant supervision [4], which takes advantage of external information sources to tweets, classification, distant, large-scale enhance supervision. Similar techniques have been adapted by [2] to categorize blog posts using Freebase, and [1] to infer 1. INTRODUCTION sentiment in tweets from hashtags. Here, we propose that Twitter’s evergrowing daily flow of millions of tweets in- tweets can be assumed to be in the same topic as the URL cludes all kinds of information, coming from different coun- they are pointing to, provided that the tweet is referring to tries, and written in different languages. Classifying tweets or discussing about the contents of the web page. We set out into categories is paramount to create narrower streams of to use the ODP, a community-driven open access web page data that interested communities could efficiently process. directory. This way, we can incrementally build a labeled set For instance, news media could be interested in mining what of tweets, where each tweet will be labeled with the category is being said in the news segment, while governments could defined on the ODP for the URL being pointed. be interested in tracking issues concerning health. The ODP is manually maintained continuously. However, Besides tweets being short and informally written, one of since Twitter evolves rapidly, its stream is riddled with new the challenges for classifying tweets by topic is hand-labeling URLs that were not seen before. Thus, most URLs found a large training set to learn from. Manually labeled sets of in tweets are not likely to be on the ODP, and the coverage tweets present several issues: (i) they are costly, and thus would be very low. To solve this, we consider the domain tend to be small and hardly representative of all the content or subdomain of URLs, removing the path. This enables to on Twitter, (ii) it is unaffordable to annotate tweets in a match a higher number of URLs in tweets to (sub)domains wide variety of languages, and (iii) the annotated set might previously categorized on the ODP. Thus, a tweet pointing become obsolete as both Twitter and the vocabulary evolve. to nytimes.com/mostpopular/ will be categorized as News Previous research has studied the use of different features if the ODP says that nytimes.com belongs to News. for classification purposes. For instance, [5] use a set of so- cial features to categorize tweets by type, e.g., opinions or 3. DATA COLLECTION deals. Similarly, [6] categorized groups of tweets into types We collected tweets containing links, so we could catego- rize tweets from the URLs they pointed to. Given the re- stricted access for white-listed users to the stream of linked Copyright is held by the author/owner(s). WWW 2013 Companion, May 13–17, 2013, Rio de Janeiro, Brazil. tweets, we tracked tweets containing ’http’ from Twitter’s ACM 978-1-4503-2038-2/13/05. streaming API, and removed the few tweets that mentioned
  • 2. the word ’http’ instead of containing a URL. We collected Table 2 further looks at the performance of learning from tweets during two non-consecutive 3-day periods: March 20- and testing on tweets. After computing the cosine similar- 22, and April 17-19, 2012. Having two non-consecutive time- ity of words in tweet texts and words in associated URL frames enables to have the former as training data and the titles, we break down the accuracy into ranges of tweet-title latter as test data, with a certainty that tweets from the test similarity. This aims to analyze the extent to which tweets set are unlikely to be part of the same conversations as in can be better classified when they are similar to the title of the training data, avoiding overfitting. The process above the web page (more akin to web page classification), or when returned 12,876,106 tweets with URLs for the training set, they are totally different (e.g., adding user’s own comments). and 12,894,602 for the test set. Since URLs in tweets are There is in fact difference in that high-similarity tweets (.8- usually shortened, we retrieved all the unshortened destina- 1) are classified most accurately, but low-similarity tweets tion URLs, so we could extract the actual domain each tweet (0-.2) show promising results by losing only 8.9%. points to. For matching final URLs to their corresponding ODP categories, we defined the following two restrictions: Similarity 0-.2 .2-.4 .4-.6 .6-.8 .8-1 (i) from the list of 16 top-level categories on the ODP, we Accuracy 0.778 0.808 0.799 0.787 0.854 removed ’Regional’ and ’World’ since, unlike the other cat- egories, do not represent topics, but a transversal catego- Table 2: Classification of tweets by similarity. rization based on countries and languages; we keep the re- maining 14 categories, and (ii) a small subset of the domains are categorized in more than one category on the ODP; we exclude these to avoid inconsistencies, and consider those 5. CONCLUSION with just one category. Matching our tweet dataset to ODP, We have described a novel learning model that, based on we got 2,265,411 tweets for the training set, and 2,851,900 distant supervision and inferring categories for tweets from tweets for the test set (which covers 19.9% of the tweets web page directories, enables to build large-scale training collected). For each tweet, we keep the text of the tweet, data for tweet classification by topic. This method can help the category extracted from ODP, as well as the text within accurately classify the diverse contents found in tweets, not the <title> tag of the associated URL, so we can compare only when they resemble web contents. The proposed solu- web page classification and tweet classification. The result- tion presents an inexpensive way to build large-scale labeled ing dataset includes tweets in tens of languages, where only sets of tweets, which can be easily updated, extended, or 31.1% of the tweets are written in English. renewed with fresh data as the vocabulary evolves. Acknowledgments. This work was supported by the U.S. Army Research Laboratory under Cooperative Agreement 4. EXPERIMENTS No. W911NF- 09-2-0053 (NS-CTA), the U.S. NSF CA- We use SVM-multiclass [3] for classification experiments. REER Award under Grant IIS-0953149, the U.S. NSF EA- We do 10-fold cross-validation by using 10% splits (of ∼226k GER Award under Grant No. IIS-1144111, the U.S. DARPA tweets each) of the training data described above. Since Deep Exploration and Filtering of Text program and CUNY the focus of this study is to validate the learning model, Junior Faculty Award. The views and conclusions contained we rely on TF-based representation of the bag-of-words for in this document are those of the authors and should not be both tweets (with URL removed for representation) and web interpreted as representing the official policies, either ex- pages, and leave the comparison with other representations pressed or implied, of the U.S. Government. The U.S. Gov- for future work. We show the accuracy results, representing ernment is authorized to reproduce and distribute reprints the fraction of correct categorizations. for Government purposes notwithstanding any copyright no- Table 1 compares the performance of using either web tation here on. pages or tweets for training and test. While obviously learn- ing from web pages to classify web pages performs best, 6. REFERENCES given the nature of web page directories, classifying tweets [1] A. Go, R. Bhayani, and L. Huang. Twitter sentiment from tweets performs only 12.2% worse, achieving 82% per- classification using distant supervision. CS224N Project formance. Interestingly, the system can be trained from Report, Stanford, pages 1–12, 2009. tweets to categorize web pages as accurately, but classifying [2] S. D. Husby and D. Barbosa. Topic classification of tweets from web pages performs 34.7% worse, potentially blog posts using distant supervision. EACL, pages because web pages do not include the informal vocabulary 28–36, 2012. needed to train. Moreover, if we reduce the training set from [3] T. Joachims. Text categorization with support vector 226k to a manually affordable subset of 10k tweets (note machines: Learning with many relevant features. that this is still larger than the 5k and 1k used in previous Machine learning: ECML-98, pages 137–142, 1998. research described above), the overall accuracy of learning [4] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant from and testing on tweets drops to 0.565 (31.4% loss). supervision for relation extraction without labeled data. In ACL/CONLL, pages 1003–1011, 2009. Tweet Web [5] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, Tweet 0.824 0.830 and M. Demirbas. Short text classification in twitter to Web 0.542 0.938 improve information filtering. In SIGIR, pages 841–842, 2010. Table 1: Classification of tweets (rows = train, [6] A. Zubiaga, D. Spina, V. Fresno, and R. Mart´ ınez. columns = test). Classifying trending topics: a typology of conversation triggers on twitter. In CIKM, pages 2461–2464, 2011.