Genre is one of the textual dimensions that can be used to reconstruct the communicative context needed to assess the value of information with respect to a purpose (business, learning, finding, monitoring, predicting, etc.). When we know the genre of a text, we can surmise the CONTEXT where a text has been created and for which purpose. Therefore we can more confidently decide whether a text contains the information we are looking for. For example, factual texts might have more credibility than opinionated texts. In this respect, genres such as press conferences, declarations or announcements by a White House spokesman might be more reliable than subjective genres, e.g. newspapers’ editorials or op-ed articles. On the other hand, if we want to test the pulse and explore the feelings about a product or a politician, we might give more weight to more emotional genres like blogs, forums or social networks’ microposts.
In recent years, important steps forward have been taken in Automatic Genre Identification (AGI). AGI can be defined as a meta-discipline that leverages on and spans Computational Linguistics, NLP, Corpus Linguistics, Information Retrieval, Information Extraction, Text Mining, Text Analytics, Sentiment Analysis and LIS, among others. Promising computational models have been proposed to automatically identify the genre(s) of a text, although no agreement has been reached on the definition of the concept of genre itself. AGI research has shown that genre classes such as blogs, online newspaper front pages, FAQs, DIYs can be automatically identified using a wide range of genre-revealing features -- from linguistic cues to character n-grams -- with a variety of classification algorithms.
In a world where information overload is still pervasive and where technology encourages massive text production through emailing, blogging, tweeting and social network communication, it is likely that the concept of genre and AGI are useful to convert unclassified and unstructured textual data to more structured and contextualized information.
This talk presents a summary of the state-of-the-art in AGI and discusses how genre-aware applications could help extract actionable information from raw textual data.
Automating Google Workspace (GWS) & more with Apps Script
Towards Contextualized Information: How Automatic Genre Identification Can Help
1. Towards Contextualized
Information:
How Automatic Genre Identification Can
Help
Marina Santini
MarinaSantini.MS@gmail.com
Seminar Series
Laboratory for Cognition, Interaction and Language Technology
(CILTLab)
Linköping University, Tuesday 28 August 2012
71. Context- and Content-revealing
Metadata and Text-Internal
Annotation
• Context can be ”reconstructed” if you
know the genre, and more accurately, if
we know other textual dimensions such
as the domain of a text, the sublanguage
used in the text, the sentiment expressed
in a text, etc.
Thank you very much being here today. My name is MS and I have being doing research in AGI for about 10 years.In this talk I present a summary of the state-of-the-art in AGI and show how a textual dimension like genre can help contextualize information
Ifthere is time I would like to introducesomefutureviableprojectswhere the concept of genre plays an importantrole.
Yh-utbildning:Yrkeshögskolan -- yrkeshögskoleutbildninghttp://www.kyh.se/pagaende/agile-web-developer/Ingen start höstterminen 2012Kristian Grossman-MadsenProgramansvarig Stockholm & Göteborgkristian.madsen@kyh.se08-410 821 310768-85 21 31KYHKYH AB, Vanadisvägen 9 113 46 Stockholm Tel: 08-410 821 20 www.kyh.se
Since my life has been quite intense since I moved to Sweden, this year I decided to slow down a little for a few months.Currently I am moderating a blog and a linkedIn group & elaborating an computational theory of genreOtheractivities: finding a job position where I canimplementsomeapplications I have in mindfinding large -- possibly public RAW corpora -- to test some hypothesis networking
The pressing need: exploiting BIG TEXTUAL DATANowadays all kinds of businesses, enterprises and customer care services produce huge amount of textual data in the form of many different "genres", i.e. emails, memos, notes from call-centers, news, user groups, chats, reports, tweets, Facebook pages, blogs, forums, marketing material and so on. The word "genre" means "type of text". All these genres contain valuable but UNSTRUCTURED textual data. It is difficult to search and find the information we need when data is unstructured.Contextualized informationWhatdo I mean by contextualized information? I mean to reconstruct the communicativecontext and the communicativepurpose for which a text has beenproduced by analysinghow the language is used and the content is organized in a text. The bag of words approach does not returncontexutalized information. So morphology is important, syntax is important, butalso the communicativecontext is important, because a piece of informatin that is useful in onecontextmight be useless in anothercontext. Is a text instructional or is it a propaganda text? Is it a newspaperarticle or an officialstatement, or a confidentialemail? Is it a public report or an exploratorystudy? This kind of details are not alwaysavaible from the source from where a text is retrieved. The knowledge of the communicativepurposehelpusidentifyactionable information. Actionable Information“Actionable information” provides data that can be used to make specific business decisions, or more in general, anycrucialdecision. Actionable information is specific, to the pointconsistent and credible. Contextualized information and actionable information do not necesseralyoverlap. Actionable information is a piece of information that is crucial for decisionmaking.
The concept of genre is veryrooted in ourlives, culture and society. This means that this concept is spontaneously, almost instintively, acknowledgedevenifpeopledo not know the word ”genre” itself.
Genre Analysis
As a researcher, my speciality is Automatic Genre Identification.
In order to show the state of the art of AGI, I willsummarizesome experimental settings presented in the Springer volume
Seminal paper: with Karlgren and Cutting the genre of documents become a text-internal class.They a supervised approach (discriminant analysis) and 20 features.
Stockholm-UmeåCorpus
These genres are not easilyrecognized by the classifier
Kris I is pdf7-webgenre collection
Manual annotation of a document by genre is – as all manual annotations – is:tiring (error-prone) time-consumingexpensivecontroversial
Bynoisyenvironment I mean that genre identification is carriedoutwithin a collectionlargerthan 1400 web pages, where the otherdocuments are unclassified and selectedrandomly and canbelongvirtually to any genre.
¾ of noiseOnly 10% decrease in the performance
Generalpurposepalette for webserches
Results so far are good and encouragingNor like the iris datase
Genre is an internalizedconcept. Section: Mastering the convention of different genres.
Language does not exitst in abstractLanguage is useddifferently in different contexts
For example, let’stake English as pivot language, and the highlyambigousword ”bank”…
Twictionary: The Dictionary for TwitterA repository for the meanings and manglings of words and language on Twitterhttp://twictionary.pbworks.com/w/page/22547584/FrontPageActionable information means having the necessary information immediately available in order to deal with the situation at hand. Vsac‧tion‧a‧ble lawif something you say or do is actionable, it is so bad or damaging that a claim could be made against you in a court of law:His remarks are actionable in my view.Genre gives us the compositional context. When we know the genre of a document, we know how the content is organized, we know where we can find the most important information. For instance, on the web when the genre of a digital text is unknown or not declared explicitly, users feel often at a loss and do not know how to assess how reliable, objective or useful information is. The same is true within business intelligence, customer care optimization, and in many other practical applications.Sublanguage provides a situational context influenced by the medium of communication (e.g. telephone, face-to-face, chats, video-conferencing, microblogging, etc). Sublangage has nothing to do with terminology (specialized words, aka terms, used in a specialized domain); sublanguage is not register (e.g. cues of formality, casual conversation, etc.); sublanguage is not style. Think of the sublaguage characterizing tweets and the sublanguage used in customer care help centers or chats. They can be enormously different, though they might all be informal, conversational and polite. Sublanguage is formulaic, cross-topical and mostly domain-independent. For instance, the sublaguage used in a car rental help center is similar to the sublanguage used in a first-aid call center. In both cases, there will be a salutation (e.g. good morning), investigation (e.g. How can I help you? When did this happen? Where are you now?), personal detail requests (e.g. what is your name?), and similar.Domain refers to a field of interest or to a subject matter. It can be medicine, politics, marketing, literary criticism, etc. A domain can have a specific terminology.
Kind of triggerphrasesBy acknowledging the concept of genre, we acknowledge that information is organized differently in different types of texts. In practical terms, this means that the genre of a document has a bearing on the identification of relevant content. Emails follow a quite convential content organization, where the “core content” might be preceded by salutations, a short introduction and/or additional elements. So they are quite ease to handle.