SlideShare ist ein Scribd-Unternehmen logo
1 von 90
Towards Contextualized
           Information:
How Automatic Genre Identification Can
               Help

                     Marina Santini
                   MarinaSantini.MS@gmail.com




                         Seminar Series
 Laboratory for Cognition, Interaction and Language Technology
                            (CILTLab)
         Linköping University, Tuesday 28 August 2012
Outline
1. Self-Presentation
2. Automatic Genre Identification (AGI)
  •   The Beginning
  •   State-of-the-Art
3. Why is Genre Important and Useful?
4. Viable Projects
  •   DaisyKB
  •   Contextify
  •   WebRider
  •   SearchInFocus

                                  Marina Santini © 2012.
SELF-PRESENTATION


                    Marina Santini © 2012.
Background
• Technical Translation (degree, Rome) [worked with
  localization, multilinguality, terminology, unstructured
  knowledge bases used as translation memory = for more
  than 10 years ]
• History of the Italian Language (degree, Rome) [corpora,
  manually analysed for philological purposes]
• NLP (MSc, Manchester) [corpus creation and natural
  language processing]
• Computational Linguistics (PhD, Brighton) [language
  processing and automatic text classification of web
  documents]
• Agile Web Development (KYHYh, Stockholm) [how web
  documents are conceived and implemented; what
  functionalities they can offer; how users interact with web
  documents; in which way functionalities affect content, 2012.
                                                    Marina Santini ©
                                                                     etc.]
Current Activities
• Managing a blog and a LinkedIn Group
• Planning a book (Springer?): A
  Computational Theory of Genre (pre-study
  phase)
• Designing web-based applications for
  contextualized language technology
• etc.


                 Marina Santini © 2012.
Strong interest…
  apply computational linguistics to BIG
  UNSTRUCTURED TEXTUAL DATA to extract
  contextualized and actionable information

                       A possible approach…
                                          Automatic Genre
                                            Identification
                                          (but also domain,
    Raw/Unstructured       Linguistic       sublanguage,       Contextualized    Actionable
      Textual Data       Preprocessing   register, and other    Information     Information
                                             textual and
                                              situational
                                             dimensions)




The reduced exploitation of big unstructured textual data is
recognized as a major problem causing remarkable economic
loss and ineffective decision-making             Marina Santini © 2012.
AUTOMATIC GENRE
IDENTIFICATION (AGI)

                  Marina Santini © 2012.
Genre Studies
• Aristotle (4th cent. b.C.): drama, lyrics,
  epics
   – Drama: tragedy, comedy, satyr

• Literary theory and literary genres [e.g.
  sonnets, ballads, monologues,
  epistolary novels]

• More recently, pratical genres in
  academic contexts [e.g. academic
  papers, essays], in workplace and
  professional contexts [e.g. tax forms,
  medical referrals, progress reports,
  patents], public contexts [e.g. popular
  magazines, public speeches], in
  pedagogy (teaching, writing, e-              Marina Santini © 2012.
Genres on the Web and in other Digital
                       Environments
 •   Emails
 •   Powerpoint slides
 •   Search Pages
 •   FAQs
 •   Personal Home Pages
 •   Corporate Home Pages
 •   Blogs
 •   Facebook Microposts
 •   Tweets
 •   etc.

     – [Migration & Colonization: music genres (jazz, rock, etc.), film
       genres (thriller, drama, western etc.)]


                              Marina Santini © 2012.
Recent Genre Studies & Analyses: 2008-
                2010




               Marina Santini © 2012.
AGI: 2007-2011
                            PhD Theses…


1. Santini M. (2007). Automatic Identification of Genre in Web Pages.
   University of Brighton (UK)
2. Meyer zu Eissen S. (2007). On Information Need and Categorizing
   Search. PhD thesis. University of Paderbon (Germany).
3. Freund L. (2008). Exploiting task-document relations in support of
   information retrieval in the workplace. University of Toronto (Canada)
4. Jebari C. (2008). Catégorisation Flexible et Incrémentale avec
   Raffinage de Pages Web par Genre. PhD thesis. Tunis El Manar
   University, Tunisia. In French.
5. Mason J. (2009). An n-tram based approach to the automatic
   classification of web pages by genre. Dalhousie University (Canada)
6. Gunnarsson M. (2011). Classification along Genre Dimension.
   Exploring a Multidisciplinary Problem. University of Borås. (Sweden)
7. Clark Malcom (PhD thesis in progress, UK)
8. Vidulin Vedrana (PhD thesis in progress, Croatia)
                                 Marina Santini © 2012.
AGI 2007-2011
      Workshops and Special Issues…



• The WebGenreWiki (must find the time to
  update it)
• Contact me:
  – Facebook page: Genres on the web
    https://www.facebook.com/genresontheweb
  – Email: MarinaSantini.MS@gmail.com
                  Marina Santini © 2012.
AGI: Companion Volumes




        Marina Santini © 2012.
The concept of genre is intuitive…
  but difficult to pin down and to
             agree upon
In the book, we do not
propose a single and
unified definition of
genre. Authors give
their different views on
genre.


                      Marina Santini © 2012.
Do we really need a definition when
            doing AGI?

 • After all….
   – … once we are convinced that genre is intuitive
     and useful, we could just say that:
      • genre is a classificatory principle based on a
        number of attributes.




                      Marina Santini © 2012.
What do we need for Automatic
   Genre Identification (AGI)?
• We need:

  – a genre taxonomy (genre palette)
  – a corpus of different genres (genre collection)
  – measurable attributes (genre-revealing
    features) that can be extracted automatically
  – a general-purpose automatic classifier, i.e. an
    off-the-shelf statistical software that builds a
    classification model for us
                    Marina Santini © 2012.
Vector representation & supervised
machine learning algorithms (esp.
              SVM)




             Marina Santini © 2012.
PIONEERS:
DOUGLAS BIBER – JUSSI
KARLGREN
           Marina Santini © 2012.
“I have used the term ‘genre’
Multi-Dimensional Analysis                                  (or ‘register’) for text varieties
                                                            that are readily recognized and
 Factor Analysis, Factors Scores (Biber, 1988)              ‘named’ within a culture (e.g.
 Cluster Analysis (Biber, 1989)                             letters, press editorials, sermon,
 Additional Statistical Tests (Biber, 2004a; 2004b, etc.)
                                                            conversation), while I have used
 66 linguistic features                                     the term ‘text type’ for varieties
                                                            that are defined linguistically
  Factor 2 - Biber (1988)                                   (rather than perceptually)”
                                                            (Biber, 1993).
                                Cluster Analysis - Biber (1989)
                               1.   intimate interpersonal interaction
                               2.   informational interaction
                               3.   scientific exposition
                               4.   learned exposition
                               5.   imaginative narrative
                               6.   general narrative exposition
                               7.   situated reportage
                               8.   involved persuasion
                                                                     Marina Santini © 2012.
Karlgren and Cutting (1994):
Recognizing Text Genres with Simple Metrics
       Using Discriminant Analysis




• 20 features
• Discriminant analysis
• Brown corpus




     Marina Santini © 2012.
From Biber’s text types to genres of electronic
    corpora: Karlgren and Cutting (1994)




                  Marina Santini © 2012.
POSs & SUC




  Marina Santini © 2012.
RECENT COMPUTATIONAL
MODELS FOR AGI

         Marina Santini © 2012.
AGI: Some Scenarios
•   Serge Sharoff
•   Kim & Ross
•   Santini
•   Stein et al.




                    Marina Santini © 2012.
Morphology & the Linguist

              Serge Sharoff

• Aim: Find a genre palette allowing
  comparison among corpora (Web As Corpus
  initiative ) and across languages
• A functional genre palette inspired by J.
  Sinclair
• Many corpora: English and Russian
• Classifier: SVM
• Features: POS trigrams (577 for Russian;
  593 for English)
                    Marina Santini © 2012.
          Ex of POS trigrams: ADV ADJ NOUN
The expert (the linguist)
       decides:




         Marina Santini © 2012.
Results




1. It might be that the features are not ideal for these genres
2. It might be that there is a problem with the distribution of
   genre classes: the information genre is not well
   represented and the classifier does not learn how to
   discriminate against the other genres
3. It might be that the genres of Sharoff’s palette are too
   broad and vague and thus confusing for the classifier
                                                                  Marina Santini © 2012.
Harmonic Descriptor
    Representation (HDR)
                                            2477 words
         Kim & Ross

• Aim: to apply genres for the
  classification of Digital Libraries
• Features: HDR = FP, LP or AP
  (betw 1 and T/ (N x MP))
• Number of features: 7431 (2477*3)
• Classifier: SVM
• KRYS I + 7 webgenre collection
  (total: 31genre classes , 3452
  documents)       Marina Santini © 2012.
31 genres




Marina Santini © 2012.
Accuracies




•   Inter-rater disagreement: people do not
    necessarily agree on genre labels to be
    assigned to a web document




                                              Marina Santini © 2012.
What about morphology & syntax?
         What about noise?

                   Santini

• Aim: automatically identify the genre of web
  pages
• Collection: 7-webgenre collection + others
• Features: 100 facets
• Genre palette: 7 webgenres
• Classifier: inferential model subjective
  Bayesian method
                    Marina Santini © 2012.
7-webgenre collection
• Balanced (200 web pages per
  genre class)
• Genre palette
• Not annotated manually
• Built following 2 principles:
  – Objective sources
  – Consistent genre granularity




                   Marina Santini © 2012.
100 Facets




 Marina Santini © 2012.
Inferential model
• It is a simple probabilistic model based on
  rules.
• It allows some ”reasonging” through the
  use of weights (closer to artificial
  intelligence than machine learning)

• The goal is to identify the correct genre of
  1400 web pages belonging to 7 genres in
  a noisy environment.
                   Marina Santini © 2012.
Comparisons (I)




    Marina Santini © 2012.
Comparisons (II)




           Marina Santini © 2012.
Resul
  ts
• An
 effort
 to
 contr
 ol the
 effect
 of
 noise
 on
 clussi
 ficatio
 n
 result
 s
 shoul
 d be
 made



           Marina Santini © 2012.
Three experimental settings, three
    different genre needs….
1. Genre comparison across corpora
2. Efficient AGI for digital libraries
3. Taming the wild web, where everything is
   uncertain and noisy


               WEGA prototype:
a retrieval model for genre-enabled web search

                 Marina Santini © 2012.
Genre retrieval model

                Stein, Meyer zu Eissen,
                        Lipka

• Aim: provide genre labels to the result set
• Genre collection and palette: KI-04 corpus: 8
  webgenres
• Firefox add-on
• Model: ”lightweight GenreRich model” (linear
  discriminant analysis)
• Features: HTML, link features, character features,
  vocabulary concentration features (< 100 features)
                      Marina Santini © 2012.
WEGA (WEb Genre Analysis)




          Marina Santini © 2012.
KI-04 genre collection: 8
                 webgenres




•   How can we definte the optimal genre palette for the task/purpose in
    focus
                                Marina Santini © 2012.
Genre Classes & Human
          Recognition


• How can we decide on the most
  representative genre classes? Let’s ask
  users… yes indeed, but how?

• 1) questionnaires (Karlgren)
• 2) card sorting (Rosso & Haas)
• 3) task-oriented studies (Crowston et al.)

                   Marina Santini © 2012.
Questionnaires: ”what genres
     are available on the
          internet?”




                                        •   Ever evolving
                                            taxonomy



               Marina Santini © 2012.
User Warrant                       Rosso &
                                    Haas
• Collecting genre terminology in the users’
  own words (3 participants)
  – Make the users classify web pages and create
    piles

• Users choose the best of the collected genre
  terminology (102 participants)

• User validation of the genre palette (257
  participants)

• Genres’ usefulness of web search (32
                   Marina Santini © 2012.
Final
Genre
Palette:
  18
Genres




           Marina Santini © 2012.
Genres & Tasks
             Crowston et al.

• 3 groups of respondents : teachers, journalists,
  engineers,
• Respondents were asked to carry out a web
  search for a real task of their own choice
  –   What is your search goal?
  –   What type of web page would you call this?
  –   What is it about the page that makes you call that?
  –   Was this page useful to you?

                      Marina Santini © 2012.
•    How can we find an
What type of web page would you call this?              optimal genre palette for
                                                        the task/purpose in focus
                                                        shareable by professionals
• 522 unique terms  about 300




                               Marina Santini © 2012.
Syracuse corpus & AGI
ACL 2010 (Uppsala):
 FINE-GRAINED GENRE CLASSIFICATION
 USING STRUCTURAL LEARNING ALGORITHMS
 Zhili Wu, Katja Markert and Serge Sharoff

• The whole corpus: 3027 annotated webpages
  divided into 292 genres.
• Focussing on genres containing 15 or more
  examples, the corpus is of about 2293 examples
• anddo wegenres. optimal number of genres a classifier can
  How 52 work out the
  digest with a useful performance




                              Marina Santini © 2012.
In summary
• AGI can be easily done. Results are
  promising.

• AGI presents a fascinating classification
  problem where classes are cultural, social
  and evolving artifacts.

• R&D of AGI systems will help understand
  complex classification problems in many
  other disciplines, (ex, social sciences,
  neuroscience, psychology, etc.).
                   Marina Santini © 2012.
Resumed: Do we need a genre
   definition when doing AGI?
• Yes. We need a computational theory of
  genre that allows us to shed some light on
  the correlations and interaction of different
  factors. Without a theoretical definition or a
  characterization of the concept of genre, it
  is not clear how to make decisions
  about…


                   Marina Santini © 2012.
Current Open Issues
•       The distribution of genre classes in the learning set
•       The optimal genre-revealing features for the genres in focus
•       The generality of the genre classes
•       The creation of a genre taxonomy that BOTH humans and automatic
        classifiers can easily discriminate against

         •   The inter-rater disagreement
                                                                                  M
                                                                                  a
    •    An increasing corpus size (can we work out a critical mass for a genre   ri
         corpus?)                                                                 n
                                                                                  a
    •    Noise (noisy and evolving digital environments)                          S
                                                                                  a
    •    An ever evolving corpus of genre classes                                 n
                                                                                  ti
                                                                                  n
    •    Optimal genre palette for the task/purpose in focus                      ©
                                                                                   i

                                                                                  2
                                                                                  0
    •    The optimal number of genres a classifier can digest with a useful       1
         performance                                                              2
                                                                                  .
A computational theory of genre
 Without a genre definition, all these
 experiments remain random, uncorrelated,
 fragmented…




                Marina Santini © 2012.
WHY IS GENRE IMPORTANT
AND USEFUL?

                   Marina Santini © 2012.
Why is genre important?
• It is a context carrier: being based on
  recurrent conventions and predictable
  expectations, genre provides the
  communicative context and the
  communicative purpose for which a text
  has been produced. It is both a semantic
  and a pragmatic concept (meaning +
  context)

    Think of what happens in your mind when
    you come across a specific genre. Eg, FAQs,
    reviews, interviews, academic papers,
                   Marina Santini © 2012.
The Information Interaction in Context
                                                                 IIiX
conference (IIiX) explores the relationships
between and within the contexts that affect
information retrieval (IR) and information
seeking, how these contexts impact
information behavior, and how knowledge
of information contexts and behaviors
improves the design of interactive
information systems.




                                                 The fourth IIiX (Information Interaction in
                                                 Context) symposium 2012 will be held in
                                                 Nijmegen, the Netherlands from August 21
                                                 to 24 2012.
                                                 IIiX 2012 is organized in cooperation with
                                                 the ACM and ACM SIGIR
                                     Marina Santini © 2012.
Major Benefits
Being a context carrier, contribute to:
• Complexity reduction and predictivity: a
  text receives identity through belonging to a
  certain genre and and this identity reduces
  the cognitive effort
• Improve Findability: genre helps find data
  that is more ”relevant” to our information
  needs
• Increase information understanding: genre
  competence increases self-protection
  against digital crimes (e.g. fishing, hoaxes,
  cyberbullying) because it can help spot genre
  anomalies and consequently malicious
                  Marina Santini © 2012.
Genre Conventions dominate
     Language since early age
• Mastering genres is an important factor in
  children’s linguistic development:
  – ”As they grow older, [children] become more
    accomplished and learn how to use linguistic
    features that are specific to the genre, for
    exmaple the appropriate use of present or
    past tense”. Source: Understanding children
    development, p. 419, 2011.


                   Marina Santini © 2012.
Genre is ubiquitous
• Language does not exist in abstract.
• Language use changes with the situation,
  purpose, audience, emotional state, etc.

• We might express the same meaning with
  different words according to different
  communicative contexts.


                 Marina Santini © 2012.
Benefits in Language
            Technology
• Genre competence and AGI:
  – could improve many current NLP subfields,
    eg. automatic summarization, machine
    translation, or Natural Language Generation
    (NLG)
  – Information Retrieval (IR) would benefit by
    identifying more relevant documents to the
    queries
  – Business Intelligence (BI) and Customer
    Experience Management (CEM) would find
    actionable information ©in the deluge of data
                     Marina Santini 2012.
GENRE-AWARE Advanced Text Analytics

VIABLE PROJECTS


                           Marina Santini © 2012.
Action Projects
•   Resource: DaisyKB - Fine-grained multilingual knowledge base
•   Tool: contextify - MetaDataTagger: creating metadata for genre,
    sublanguage, domain

•   Information System:    WebRider – socially- and emotionally-
    intelligent web search

•   Actionable information:    SearchInFocus Pilot study:
    ”Using query log analysis for BI and CEM”. Applying
    findability for Business Intelligence (BI) and Customer Experience
    Management (CEM).

•   More…
                                Marina Santini © 2012.
Fine-grained multilingual knowledge base

DAISYKB


                         Marina Santini © 2012.
The word ”bank”




    Marina Santini © 2012.
The 2 English senses are translated using 2 different Italian words.
 Senses and words across languages are linked together via the
                   abstract representation…




                            Marina Santini © 2012.
The importance of the abstract
          representation
• It enables cross-linguality and
  multilinguality
• Consistency across all languages.




                 Marina Santini © 2012.
Single words are often not enough
   to understand the meaning…
• DaisyKB can have fields with:
  –   Collocations
  –   Terminology
  –   Companies
  –   Domain
  –   Frequent co-occurent words
  –   Frequent queries containing the keyword
  –   Sublanguage
  –   Genre
  –   Etc.

                      Marina Santini © 2012.
Object-Oriented Approach
• Each entry is like an object
• Each field is like a method in an object,
  you just call it when you need it
• An object can be called at different levels
  of granularity




                  Marina Santini © 2012.
Pre-population…
• Migrating extisting resources to populate
  DaisyKB.



• The dictionary structure must be well-
  thought in terms of flexibility and design in
  order to accomodate future needs.


                   Marina Santini © 2012.
Benefits
•   Standardization of scattered resourses
•   Flexibility: it can be updated any time, systematically
•   Consistency
•   Coherence
•   Reduced management
•   Reduced idiosincracies and errors
•   Increased efficiency
•   Reusable for many products or activities
•   It can be open-source and collaborative
•   It can be built with XML and programmed with XSLT
    for quick updating and deletion or insertion of new
    fields.
                         Marina Santini © 2012.
Contextify is a metadata tagger: text classification according to:
domain, genre, sublanguage (register, style, sentiment, emotions,
opnions…)

CONTEXTIFY


                          Marina Santini © 2012.
Context- and Content-revealing
    Metadata and Text-Internal
            Annotation
• Context can be ”reconstructed” if you
  know the genre, and more accurately, if
  we know other textual dimensions such
  as the domain of a text, the sublanguage
  used in the text, the sentiment expressed
  in a text, etc.
Context is the King
• Context helps disambiguate words and assess the
  relevance/importance of texts
• Context helps identify the most important information.
• How can we capture context from a text? In this application, I
  would start with genre, sublanguage, and domain i.e. three
  textual dimensions that say something about the
  communicative context in which a text or a document has
  been issued:

   – A ”weird” word like ” Spweet ” is not a typo if it belongs to a
     Twitter micropost (genre and sublanguage: tweet spam)

   – A ”normal” word like ”mouse” is a specialized term if it belongs to
     the computer domain.

   – Figurative senses and metaphores: surfing (sport, internet
     communication), agile (ordinary word, software), sentence (law,
     grammar), appeal (ordinary language: ”appeal for help” or legal
     sublanguage: to lodge an appeal, genre: newspaper, court act)
     etc.
                             Marina Santini © 2012.
Content Enrichment




      Marina Santini © 2012.
Benefits: Contextify
• Helps identify:
  – the most reliable/relevant documents
  – how language is used in different communicative
    contexts
  – linguistic conventions
  – textual conventions
  – etc.
• Contextualized information can be exploited
  to improve automatic summarization,
  machine translation, terminology extraction,
  indexing, etc.
                    Marina Santini © 2012.
A genre-aware information system

WEB RIDER


                        Marina Santini © 2012.
Web Rider is…
• “WebRider” is a metaphor to describe a
  “socially and emotionally intelligent”
  information system that helps web users
  make sense of information on the web by
  internalizing genre cues.

• The claim: genre cues contribute to
  information understanding and decision
  making, thus assisting users to ride safely
  the web.
                  Marina Santini © 2012.
Inspirators
• Social and emotional intelligence
  [Daniel Goleman, psychologist] is the
  competence that lies behind mutual
  understanding, group interaction,
  social behaviour and all kinds of
  social actions.

• Genres are social actions [Genre as
  social action (Carolyn Miller, 1984)]




                                          Marina Santini © 2012.
Simply put…
• Social actions – the driving forces behind the web – are
  manifested through genres.

• Genres -- e.g. FAQs, press releases, product descriptions,
  instructions, guides, etc. -- are recurring and recognized
  patterns of communication that can help contextualize
  information.

• An information system with Genre Competence would make
  web users socially and emotionally intelligent because,
  through the deep understanding of genre conventions and
  expectations, they would be able to evaluate the
  genuineness, reliability, authenticity, and the actual purpose of
  information distributed on the web.

                           Marina Santini © 2012.
The Technology underlying
           WebRider
• Under discussion…




               Marina Santini © 2012.
Main Benefits
• Increased information understanding.
  WebRider would promote more adequate
  social and emotional behaviours through its
  genre competence
• [Increased Accuracy and Relevance: more
  accurate information systems & more
  relevant search results]

• Ethics: Protection against fishing, scams,
  malicious behaviour, cyberbullying, etc. while
  searching
                   Marina Santini © 2012.
Pilot study: SearchInFocus

FINDABILITY FOR BI & CEM


                           Marina Santini © 2012.
If you think of BI and CEM in terms of
      searchability, findability and
             actionability…
– “Merrill Lynch estimates that more than 85
  percent of all business information exists as
  unstructured data –commonly appearing in
  e‐mails, memos, notes from call centers and
  support operations, news, user groups,
  chats, reports, letters, surveys, white papers,
  marketing material, research, presentations
  and Web pages.” [DM Review Magazine,
  February 2003 Issue]

– ECONOMIC LOSS! Santini © 2012.
               Marina
Simple search is not enough…
• Of course, it is possible to use simple search.
  But simple search is unrewarding, because is
  based on single terms.
  – ”a search is made on the term felony. In a simple
    search, the term felony is used, and everywhere
    there is a reference to felony, a hit to an
    unstructured document is made. But a simple
    search is crude. It does not find references to
    crime, arson, murder, embezzlement, vehicular
    homicide, and such, even though these crimes
    are types of felonies” [ Source: Inmon, B. & A.
    Nesavich, "Unstructured Textual Data in the
    Organization" from "Managing Unstructured data
    in the organization", Prentice Hall 2008, pp. 1–13]
                     Marina Santini © 2012.
Text Analytics
• A set of NLP techniques that provide some
  structure to textual documents.
• Common components:
  –   Tokenization
  –   Morphological Analysis
  –   Syntactic Analysis
  –   Named Entity Recognition
  –   Sentiment Analysis
  –   Automatic Summarization
  –   Etc.
                     Marina Santini © 2012.
Text Analytics
• Commercial:                  • Open Source:
  – Attensity
                                  GATE
  – Clarabridge
  – Temis                         NLTK
  – Lexalytics                    UIMA
  – Texify                        Rapid Miner
  – SAS                           etc.
  – IBM Cognos
  – etc.

                  Marina Santini © 2012.
SearchInFocus: searching for
     actionable intelligence
• Pilot study: ”Using query log analysis
  for BI and CEM”
  [In collaboration with Findwise(?)]




                 Marina Santini © 2012.
WHERE TO START?


         Marina Santini © 2012.
Strategy


               Use cases:
               what kind of                                        BETA          User
     Big
                actionable                                     Open-source    feedback &
unstructured                  Wireframes           User test
               information                                      web-based    collaborative
textual data
                 do user                                         prototyes   improvement
                  need?




                                    Marina Santini © 2012.
Academia
• Many students can work on it at the same
  time, with different languages
• There are many linguistic, cognitive and
  technical aspects to be analysed
• Creation of multilingual shareable
  resources
• Open source algorithms enriched by a
  collaborative effort

                 Marina Santini © 2012.
Thank you for your attention




         Questions?



          Marina Santini © 2012.

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (7)

IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
Lecture 9 Perceptron
Lecture 9 PerceptronLecture 9 Perceptron
Lecture 9 Perceptron
 
Lecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular LanguagesLecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular Languages
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free Grammars
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 

Ähnlich wie Towards Contextualized Information: How Automatic Genre Identification Can Help

Uppsala uni 4march2011
Uppsala uni 4march2011Uppsala uni 4march2011
Uppsala uni 4march2011
Marina Santini
 
Diversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. MadalliDiversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. Madalli
RENDER project
 
Methodology & Content analysis
Methodology & Content analysisMethodology & Content analysis
Methodology & Content analysis
Florence Paisey
 
WK 10 – Research Workshop - Content and discourse analysis
WK 10 – Research Workshop - Content and discourse analysis WK 10 – Research Workshop - Content and discourse analysis
WK 10 – Research Workshop - Content and discourse analysis
Carolina Matos
 
A2 revision guide section acopy
A2 revision guide section acopyA2 revision guide section acopy
A2 revision guide section acopy
jphibbert1979
 

Ähnlich wie Towards Contextualized Information: How Automatic Genre Identification Can Help (20)

Uppsala uni 4march2011
Uppsala uni 4march2011Uppsala uni 4march2011
Uppsala uni 4march2011
 
Ethnography
EthnographyEthnography
Ethnography
 
Researching Multilingually (RMTC) Hub
Researching Multilingually (RMTC) HubResearching Multilingually (RMTC) Hub
Researching Multilingually (RMTC) Hub
 
S5. qualitative #2 2019
S5. qualitative #2 2019S5. qualitative #2 2019
S5. qualitative #2 2019
 
Diversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. MadalliDiversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. Madalli
 
Communicative-discursive models and cognitive linguistics
Communicative-discursive models and cognitive linguisticsCommunicative-discursive models and cognitive linguistics
Communicative-discursive models and cognitive linguistics
 
Multimodal Methods for Analyzing Communication and Learning with Digital Tech...
Multimodal Methods for Analyzing Communication and Learning with Digital Tech...Multimodal Methods for Analyzing Communication and Learning with Digital Tech...
Multimodal Methods for Analyzing Communication and Learning with Digital Tech...
 
Exam 1 b genre
Exam 1 b   genreExam 1 b   genre
Exam 1 b genre
 
Methodology & Content analysis
Methodology & Content analysisMethodology & Content analysis
Methodology & Content analysis
 
Juup Stelma
Juup StelmaJuup Stelma
Juup Stelma
 
Juup Stelma
Juup StelmaJuup Stelma
Juup Stelma
 
SAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotationSAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotation
 
QQML 2012 - PhD Research The Mutual Shaping of Social Media, Learning Experie...
QQML 2012 - PhD Research The Mutual Shaping of Social Media, Learning Experie...QQML 2012 - PhD Research The Mutual Shaping of Social Media, Learning Experie...
QQML 2012 - PhD Research The Mutual Shaping of Social Media, Learning Experie...
 
CNPwebinar029 Sal Consoli - Narrative Analysis using MAXQDA.pdf
CNPwebinar029 Sal Consoli - Narrative Analysis using MAXQDA.pdfCNPwebinar029 Sal Consoli - Narrative Analysis using MAXQDA.pdf
CNPwebinar029 Sal Consoli - Narrative Analysis using MAXQDA.pdf
 
CIES 2012
CIES 2012CIES 2012
CIES 2012
 
Sh. tamizrad discourse and genre
Sh. tamizrad  discourse and genreSh. tamizrad  discourse and genre
Sh. tamizrad discourse and genre
 
Slanguages2009 Zheng
Slanguages2009 ZhengSlanguages2009 Zheng
Slanguages2009 Zheng
 
WK 10 – Research Workshop - Content and discourse analysis
WK 10 – Research Workshop - Content and discourse analysis WK 10 – Research Workshop - Content and discourse analysis
WK 10 – Research Workshop - Content and discourse analysis
 
A2 revision guide section acopy
A2 revision guide section acopyA2 revision guide section acopy
A2 revision guide section acopy
 
flowerdew basics
 flowerdew basics  flowerdew basics
flowerdew basics
 

Mehr von Marina Santini

Mehr von Marina Santini (18)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 1: Introduction to the Course (Practical Information)
Lecture 1: Introduction to the Course (Practical Information)Lecture 1: Introduction to the Course (Practical Information)
Lecture 1: Introduction to the Course (Practical Information)
 
Lecture: Joint, Conditional and Marginal Probabilities
Lecture: Joint, Conditional and Marginal Probabilities Lecture: Joint, Conditional and Marginal Probabilities
Lecture: Joint, Conditional and Marginal Probabilities
 
Mathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability TheoryMathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability Theory
 
Lecture: Automata
Lecture: AutomataLecture: Automata
Lecture: Automata
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Towards Contextualized Information: How Automatic Genre Identification Can Help

  • 1. Towards Contextualized Information: How Automatic Genre Identification Can Help Marina Santini MarinaSantini.MS@gmail.com Seminar Series Laboratory for Cognition, Interaction and Language Technology (CILTLab) Linköping University, Tuesday 28 August 2012
  • 2. Outline 1. Self-Presentation 2. Automatic Genre Identification (AGI) • The Beginning • State-of-the-Art 3. Why is Genre Important and Useful? 4. Viable Projects • DaisyKB • Contextify • WebRider • SearchInFocus Marina Santini © 2012.
  • 3. SELF-PRESENTATION Marina Santini © 2012.
  • 4. Background • Technical Translation (degree, Rome) [worked with localization, multilinguality, terminology, unstructured knowledge bases used as translation memory = for more than 10 years ] • History of the Italian Language (degree, Rome) [corpora, manually analysed for philological purposes] • NLP (MSc, Manchester) [corpus creation and natural language processing] • Computational Linguistics (PhD, Brighton) [language processing and automatic text classification of web documents] • Agile Web Development (KYHYh, Stockholm) [how web documents are conceived and implemented; what functionalities they can offer; how users interact with web documents; in which way functionalities affect content, 2012. Marina Santini © etc.]
  • 5. Current Activities • Managing a blog and a LinkedIn Group • Planning a book (Springer?): A Computational Theory of Genre (pre-study phase) • Designing web-based applications for contextualized language technology • etc. Marina Santini © 2012.
  • 6. Strong interest… apply computational linguistics to BIG UNSTRUCTURED TEXTUAL DATA to extract contextualized and actionable information A possible approach… Automatic Genre Identification (but also domain, Raw/Unstructured Linguistic sublanguage, Contextualized Actionable Textual Data Preprocessing register, and other Information Information textual and situational dimensions) The reduced exploitation of big unstructured textual data is recognized as a major problem causing remarkable economic loss and ineffective decision-making Marina Santini © 2012.
  • 7. AUTOMATIC GENRE IDENTIFICATION (AGI) Marina Santini © 2012.
  • 8. Genre Studies • Aristotle (4th cent. b.C.): drama, lyrics, epics – Drama: tragedy, comedy, satyr • Literary theory and literary genres [e.g. sonnets, ballads, monologues, epistolary novels] • More recently, pratical genres in academic contexts [e.g. academic papers, essays], in workplace and professional contexts [e.g. tax forms, medical referrals, progress reports, patents], public contexts [e.g. popular magazines, public speeches], in pedagogy (teaching, writing, e- Marina Santini © 2012.
  • 9. Genres on the Web and in other Digital Environments • Emails • Powerpoint slides • Search Pages • FAQs • Personal Home Pages • Corporate Home Pages • Blogs • Facebook Microposts • Tweets • etc. – [Migration & Colonization: music genres (jazz, rock, etc.), film genres (thriller, drama, western etc.)] Marina Santini © 2012.
  • 10. Recent Genre Studies & Analyses: 2008- 2010 Marina Santini © 2012.
  • 11. AGI: 2007-2011 PhD Theses… 1. Santini M. (2007). Automatic Identification of Genre in Web Pages. University of Brighton (UK) 2. Meyer zu Eissen S. (2007). On Information Need and Categorizing Search. PhD thesis. University of Paderbon (Germany). 3. Freund L. (2008). Exploiting task-document relations in support of information retrieval in the workplace. University of Toronto (Canada) 4. Jebari C. (2008). Catégorisation Flexible et Incrémentale avec Raffinage de Pages Web par Genre. PhD thesis. Tunis El Manar University, Tunisia. In French. 5. Mason J. (2009). An n-tram based approach to the automatic classification of web pages by genre. Dalhousie University (Canada) 6. Gunnarsson M. (2011). Classification along Genre Dimension. Exploring a Multidisciplinary Problem. University of Borås. (Sweden) 7. Clark Malcom (PhD thesis in progress, UK) 8. Vidulin Vedrana (PhD thesis in progress, Croatia) Marina Santini © 2012.
  • 12. AGI 2007-2011 Workshops and Special Issues… • The WebGenreWiki (must find the time to update it) • Contact me: – Facebook page: Genres on the web https://www.facebook.com/genresontheweb – Email: MarinaSantini.MS@gmail.com Marina Santini © 2012.
  • 13. AGI: Companion Volumes Marina Santini © 2012.
  • 14. The concept of genre is intuitive… but difficult to pin down and to agree upon In the book, we do not propose a single and unified definition of genre. Authors give their different views on genre. Marina Santini © 2012.
  • 15. Do we really need a definition when doing AGI? • After all…. – … once we are convinced that genre is intuitive and useful, we could just say that: • genre is a classificatory principle based on a number of attributes. Marina Santini © 2012.
  • 16. What do we need for Automatic Genre Identification (AGI)? • We need: – a genre taxonomy (genre palette) – a corpus of different genres (genre collection) – measurable attributes (genre-revealing features) that can be extracted automatically – a general-purpose automatic classifier, i.e. an off-the-shelf statistical software that builds a classification model for us Marina Santini © 2012.
  • 17. Vector representation & supervised machine learning algorithms (esp. SVM) Marina Santini © 2012.
  • 18. PIONEERS: DOUGLAS BIBER – JUSSI KARLGREN Marina Santini © 2012.
  • 19. “I have used the term ‘genre’ Multi-Dimensional Analysis (or ‘register’) for text varieties that are readily recognized and Factor Analysis, Factors Scores (Biber, 1988) ‘named’ within a culture (e.g. Cluster Analysis (Biber, 1989) letters, press editorials, sermon, Additional Statistical Tests (Biber, 2004a; 2004b, etc.) conversation), while I have used 66 linguistic features the term ‘text type’ for varieties that are defined linguistically Factor 2 - Biber (1988) (rather than perceptually)” (Biber, 1993). Cluster Analysis - Biber (1989) 1. intimate interpersonal interaction 2. informational interaction 3. scientific exposition 4. learned exposition 5. imaginative narrative 6. general narrative exposition 7. situated reportage 8. involved persuasion Marina Santini © 2012.
  • 20. Karlgren and Cutting (1994): Recognizing Text Genres with Simple Metrics Using Discriminant Analysis • 20 features • Discriminant analysis • Brown corpus Marina Santini © 2012.
  • 21. From Biber’s text types to genres of electronic corpora: Karlgren and Cutting (1994) Marina Santini © 2012.
  • 22. POSs & SUC Marina Santini © 2012.
  • 23. RECENT COMPUTATIONAL MODELS FOR AGI Marina Santini © 2012.
  • 24. AGI: Some Scenarios • Serge Sharoff • Kim & Ross • Santini • Stein et al. Marina Santini © 2012.
  • 25. Morphology & the Linguist Serge Sharoff • Aim: Find a genre palette allowing comparison among corpora (Web As Corpus initiative ) and across languages • A functional genre palette inspired by J. Sinclair • Many corpora: English and Russian • Classifier: SVM • Features: POS trigrams (577 for Russian; 593 for English) Marina Santini © 2012. Ex of POS trigrams: ADV ADJ NOUN
  • 26. The expert (the linguist) decides: Marina Santini © 2012.
  • 27. Results 1. It might be that the features are not ideal for these genres 2. It might be that there is a problem with the distribution of genre classes: the information genre is not well represented and the classifier does not learn how to discriminate against the other genres 3. It might be that the genres of Sharoff’s palette are too broad and vague and thus confusing for the classifier Marina Santini © 2012.
  • 28. Harmonic Descriptor Representation (HDR) 2477 words Kim & Ross • Aim: to apply genres for the classification of Digital Libraries • Features: HDR = FP, LP or AP (betw 1 and T/ (N x MP)) • Number of features: 7431 (2477*3) • Classifier: SVM • KRYS I + 7 webgenre collection (total: 31genre classes , 3452 documents) Marina Santini © 2012.
  • 30. Accuracies • Inter-rater disagreement: people do not necessarily agree on genre labels to be assigned to a web document Marina Santini © 2012.
  • 31. What about morphology & syntax? What about noise? Santini • Aim: automatically identify the genre of web pages • Collection: 7-webgenre collection + others • Features: 100 facets • Genre palette: 7 webgenres • Classifier: inferential model subjective Bayesian method Marina Santini © 2012.
  • 32. 7-webgenre collection • Balanced (200 web pages per genre class) • Genre palette • Not annotated manually • Built following 2 principles: – Objective sources – Consistent genre granularity Marina Santini © 2012.
  • 33. 100 Facets Marina Santini © 2012.
  • 34. Inferential model • It is a simple probabilistic model based on rules. • It allows some ”reasonging” through the use of weights (closer to artificial intelligence than machine learning) • The goal is to identify the correct genre of 1400 web pages belonging to 7 genres in a noisy environment. Marina Santini © 2012.
  • 35. Comparisons (I) Marina Santini © 2012.
  • 36. Comparisons (II) Marina Santini © 2012.
  • 37. Resul ts • An effort to contr ol the effect of noise on clussi ficatio n result s shoul d be made Marina Santini © 2012.
  • 38. Three experimental settings, three different genre needs…. 1. Genre comparison across corpora 2. Efficient AGI for digital libraries 3. Taming the wild web, where everything is uncertain and noisy WEGA prototype: a retrieval model for genre-enabled web search Marina Santini © 2012.
  • 39. Genre retrieval model Stein, Meyer zu Eissen, Lipka • Aim: provide genre labels to the result set • Genre collection and palette: KI-04 corpus: 8 webgenres • Firefox add-on • Model: ”lightweight GenreRich model” (linear discriminant analysis) • Features: HTML, link features, character features, vocabulary concentration features (< 100 features) Marina Santini © 2012.
  • 40. WEGA (WEb Genre Analysis) Marina Santini © 2012.
  • 41. KI-04 genre collection: 8 webgenres • How can we definte the optimal genre palette for the task/purpose in focus Marina Santini © 2012.
  • 42. Genre Classes & Human Recognition • How can we decide on the most representative genre classes? Let’s ask users… yes indeed, but how? • 1) questionnaires (Karlgren) • 2) card sorting (Rosso & Haas) • 3) task-oriented studies (Crowston et al.) Marina Santini © 2012.
  • 43. Questionnaires: ”what genres are available on the internet?” • Ever evolving taxonomy Marina Santini © 2012.
  • 44. User Warrant Rosso & Haas • Collecting genre terminology in the users’ own words (3 participants) – Make the users classify web pages and create piles • Users choose the best of the collected genre terminology (102 participants) • User validation of the genre palette (257 participants) • Genres’ usefulness of web search (32 Marina Santini © 2012.
  • 45. Final Genre Palette: 18 Genres Marina Santini © 2012.
  • 46. Genres & Tasks Crowston et al. • 3 groups of respondents : teachers, journalists, engineers, • Respondents were asked to carry out a web search for a real task of their own choice – What is your search goal? – What type of web page would you call this? – What is it about the page that makes you call that? – Was this page useful to you? Marina Santini © 2012.
  • 47. How can we find an What type of web page would you call this? optimal genre palette for the task/purpose in focus shareable by professionals • 522 unique terms  about 300 Marina Santini © 2012.
  • 48. Syracuse corpus & AGI ACL 2010 (Uppsala): FINE-GRAINED GENRE CLASSIFICATION USING STRUCTURAL LEARNING ALGORITHMS Zhili Wu, Katja Markert and Serge Sharoff • The whole corpus: 3027 annotated webpages divided into 292 genres. • Focussing on genres containing 15 or more examples, the corpus is of about 2293 examples • anddo wegenres. optimal number of genres a classifier can How 52 work out the digest with a useful performance Marina Santini © 2012.
  • 49. In summary • AGI can be easily done. Results are promising. • AGI presents a fascinating classification problem where classes are cultural, social and evolving artifacts. • R&D of AGI systems will help understand complex classification problems in many other disciplines, (ex, social sciences, neuroscience, psychology, etc.). Marina Santini © 2012.
  • 50. Resumed: Do we need a genre definition when doing AGI? • Yes. We need a computational theory of genre that allows us to shed some light on the correlations and interaction of different factors. Without a theoretical definition or a characterization of the concept of genre, it is not clear how to make decisions about… Marina Santini © 2012.
  • 51. Current Open Issues • The distribution of genre classes in the learning set • The optimal genre-revealing features for the genres in focus • The generality of the genre classes • The creation of a genre taxonomy that BOTH humans and automatic classifiers can easily discriminate against • The inter-rater disagreement M a • An increasing corpus size (can we work out a critical mass for a genre ri corpus?) n a • Noise (noisy and evolving digital environments) S a • An ever evolving corpus of genre classes n ti n • Optimal genre palette for the task/purpose in focus © i 2 0 • The optimal number of genres a classifier can digest with a useful 1 performance 2 .
  • 52. A computational theory of genre Without a genre definition, all these experiments remain random, uncorrelated, fragmented… Marina Santini © 2012.
  • 53. WHY IS GENRE IMPORTANT AND USEFUL? Marina Santini © 2012.
  • 54. Why is genre important? • It is a context carrier: being based on recurrent conventions and predictable expectations, genre provides the communicative context and the communicative purpose for which a text has been produced. It is both a semantic and a pragmatic concept (meaning + context) Think of what happens in your mind when you come across a specific genre. Eg, FAQs, reviews, interviews, academic papers, Marina Santini © 2012.
  • 55. The Information Interaction in Context IIiX conference (IIiX) explores the relationships between and within the contexts that affect information retrieval (IR) and information seeking, how these contexts impact information behavior, and how knowledge of information contexts and behaviors improves the design of interactive information systems. The fourth IIiX (Information Interaction in Context) symposium 2012 will be held in Nijmegen, the Netherlands from August 21 to 24 2012. IIiX 2012 is organized in cooperation with the ACM and ACM SIGIR Marina Santini © 2012.
  • 56. Major Benefits Being a context carrier, contribute to: • Complexity reduction and predictivity: a text receives identity through belonging to a certain genre and and this identity reduces the cognitive effort • Improve Findability: genre helps find data that is more ”relevant” to our information needs • Increase information understanding: genre competence increases self-protection against digital crimes (e.g. fishing, hoaxes, cyberbullying) because it can help spot genre anomalies and consequently malicious Marina Santini © 2012.
  • 57. Genre Conventions dominate Language since early age • Mastering genres is an important factor in children’s linguistic development: – ”As they grow older, [children] become more accomplished and learn how to use linguistic features that are specific to the genre, for exmaple the appropriate use of present or past tense”. Source: Understanding children development, p. 419, 2011. Marina Santini © 2012.
  • 58. Genre is ubiquitous • Language does not exist in abstract. • Language use changes with the situation, purpose, audience, emotional state, etc. • We might express the same meaning with different words according to different communicative contexts. Marina Santini © 2012.
  • 59. Benefits in Language Technology • Genre competence and AGI: – could improve many current NLP subfields, eg. automatic summarization, machine translation, or Natural Language Generation (NLG) – Information Retrieval (IR) would benefit by identifying more relevant documents to the queries – Business Intelligence (BI) and Customer Experience Management (CEM) would find actionable information ©in the deluge of data Marina Santini 2012.
  • 60. GENRE-AWARE Advanced Text Analytics VIABLE PROJECTS Marina Santini © 2012.
  • 61. Action Projects • Resource: DaisyKB - Fine-grained multilingual knowledge base • Tool: contextify - MetaDataTagger: creating metadata for genre, sublanguage, domain • Information System: WebRider – socially- and emotionally- intelligent web search • Actionable information: SearchInFocus Pilot study: ”Using query log analysis for BI and CEM”. Applying findability for Business Intelligence (BI) and Customer Experience Management (CEM). • More… Marina Santini © 2012.
  • 62. Fine-grained multilingual knowledge base DAISYKB Marina Santini © 2012.
  • 63. The word ”bank” Marina Santini © 2012.
  • 64. The 2 English senses are translated using 2 different Italian words. Senses and words across languages are linked together via the abstract representation… Marina Santini © 2012.
  • 65. The importance of the abstract representation • It enables cross-linguality and multilinguality • Consistency across all languages. Marina Santini © 2012.
  • 66. Single words are often not enough to understand the meaning… • DaisyKB can have fields with: – Collocations – Terminology – Companies – Domain – Frequent co-occurent words – Frequent queries containing the keyword – Sublanguage – Genre – Etc. Marina Santini © 2012.
  • 67. Object-Oriented Approach • Each entry is like an object • Each field is like a method in an object, you just call it when you need it • An object can be called at different levels of granularity Marina Santini © 2012.
  • 68. Pre-population… • Migrating extisting resources to populate DaisyKB. • The dictionary structure must be well- thought in terms of flexibility and design in order to accomodate future needs. Marina Santini © 2012.
  • 69. Benefits • Standardization of scattered resourses • Flexibility: it can be updated any time, systematically • Consistency • Coherence • Reduced management • Reduced idiosincracies and errors • Increased efficiency • Reusable for many products or activities • It can be open-source and collaborative • It can be built with XML and programmed with XSLT for quick updating and deletion or insertion of new fields. Marina Santini © 2012.
  • 70. Contextify is a metadata tagger: text classification according to: domain, genre, sublanguage (register, style, sentiment, emotions, opnions…) CONTEXTIFY Marina Santini © 2012.
  • 71. Context- and Content-revealing Metadata and Text-Internal Annotation • Context can be ”reconstructed” if you know the genre, and more accurately, if we know other textual dimensions such as the domain of a text, the sublanguage used in the text, the sentiment expressed in a text, etc.
  • 72. Context is the King • Context helps disambiguate words and assess the relevance/importance of texts • Context helps identify the most important information. • How can we capture context from a text? In this application, I would start with genre, sublanguage, and domain i.e. three textual dimensions that say something about the communicative context in which a text or a document has been issued: – A ”weird” word like ” Spweet ” is not a typo if it belongs to a Twitter micropost (genre and sublanguage: tweet spam) – A ”normal” word like ”mouse” is a specialized term if it belongs to the computer domain. – Figurative senses and metaphores: surfing (sport, internet communication), agile (ordinary word, software), sentence (law, grammar), appeal (ordinary language: ”appeal for help” or legal sublanguage: to lodge an appeal, genre: newspaper, court act) etc. Marina Santini © 2012.
  • 73. Content Enrichment Marina Santini © 2012.
  • 74. Benefits: Contextify • Helps identify: – the most reliable/relevant documents – how language is used in different communicative contexts – linguistic conventions – textual conventions – etc. • Contextualized information can be exploited to improve automatic summarization, machine translation, terminology extraction, indexing, etc. Marina Santini © 2012.
  • 75. A genre-aware information system WEB RIDER Marina Santini © 2012.
  • 76. Web Rider is… • “WebRider” is a metaphor to describe a “socially and emotionally intelligent” information system that helps web users make sense of information on the web by internalizing genre cues. • The claim: genre cues contribute to information understanding and decision making, thus assisting users to ride safely the web. Marina Santini © 2012.
  • 77. Inspirators • Social and emotional intelligence [Daniel Goleman, psychologist] is the competence that lies behind mutual understanding, group interaction, social behaviour and all kinds of social actions. • Genres are social actions [Genre as social action (Carolyn Miller, 1984)] Marina Santini © 2012.
  • 78. Simply put… • Social actions – the driving forces behind the web – are manifested through genres. • Genres -- e.g. FAQs, press releases, product descriptions, instructions, guides, etc. -- are recurring and recognized patterns of communication that can help contextualize information. • An information system with Genre Competence would make web users socially and emotionally intelligent because, through the deep understanding of genre conventions and expectations, they would be able to evaluate the genuineness, reliability, authenticity, and the actual purpose of information distributed on the web. Marina Santini © 2012.
  • 79. The Technology underlying WebRider • Under discussion… Marina Santini © 2012.
  • 80. Main Benefits • Increased information understanding. WebRider would promote more adequate social and emotional behaviours through its genre competence • [Increased Accuracy and Relevance: more accurate information systems & more relevant search results] • Ethics: Protection against fishing, scams, malicious behaviour, cyberbullying, etc. while searching Marina Santini © 2012.
  • 81. Pilot study: SearchInFocus FINDABILITY FOR BI & CEM Marina Santini © 2012.
  • 82. If you think of BI and CEM in terms of searchability, findability and actionability… – “Merrill Lynch estimates that more than 85 percent of all business information exists as unstructured data –commonly appearing in e‐mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material, research, presentations and Web pages.” [DM Review Magazine, February 2003 Issue] – ECONOMIC LOSS! Santini © 2012. Marina
  • 83. Simple search is not enough… • Of course, it is possible to use simple search. But simple search is unrewarding, because is based on single terms. – ”a search is made on the term felony. In a simple search, the term felony is used, and everywhere there is a reference to felony, a hit to an unstructured document is made. But a simple search is crude. It does not find references to crime, arson, murder, embezzlement, vehicular homicide, and such, even though these crimes are types of felonies” [ Source: Inmon, B. & A. Nesavich, "Unstructured Textual Data in the Organization" from "Managing Unstructured data in the organization", Prentice Hall 2008, pp. 1–13] Marina Santini © 2012.
  • 84. Text Analytics • A set of NLP techniques that provide some structure to textual documents. • Common components: – Tokenization – Morphological Analysis – Syntactic Analysis – Named Entity Recognition – Sentiment Analysis – Automatic Summarization – Etc. Marina Santini © 2012.
  • 85. Text Analytics • Commercial: • Open Source: – Attensity  GATE – Clarabridge – Temis  NLTK – Lexalytics  UIMA – Texify  Rapid Miner – SAS  etc. – IBM Cognos – etc. Marina Santini © 2012.
  • 86. SearchInFocus: searching for actionable intelligence • Pilot study: ”Using query log analysis for BI and CEM” [In collaboration with Findwise(?)] Marina Santini © 2012.
  • 87. WHERE TO START? Marina Santini © 2012.
  • 88. Strategy Use cases: what kind of BETA User Big actionable Open-source feedback & unstructured Wireframes User test information web-based collaborative textual data do user prototyes improvement need? Marina Santini © 2012.
  • 89. Academia • Many students can work on it at the same time, with different languages • There are many linguistic, cognitive and technical aspects to be analysed • Creation of multilingual shareable resources • Open source algorithms enriched by a collaborative effort Marina Santini © 2012.
  • 90. Thank you for your attention Questions? Marina Santini © 2012.

Hinweis der Redaktion

  1. Thank you very much being here today. My name is MS and I have being doing research in AGI for about 10 years.In this talk I present a summary of the state-of-the-art in AGI and show how a textual dimension like genre can help contextualize information
  2. Ifthere is time I would like to introducesomefutureviableprojectswhere the concept of genre plays an importantrole.
  3. Yh-utbildning:Yrkeshögskolan -- yrkeshögskoleutbildninghttp://www.kyh.se/pagaende/agile-web-developer/Ingen start höstterminen 2012Kristian Grossman-MadsenProgramansvarig Stockholm &amp; Göteborgkristian.madsen@kyh.se08-410 821 310768-85 21 31KYHKYH AB, Vanadisvägen 9 113 46 Stockholm Tel: 08-410 821 20  www.kyh.se
  4. Since my life has been quite intense since I moved to Sweden, this year I decided to slow down a little for a few months.Currently I am moderating a blog and a linkedIn group &amp; elaborating an computational theory of genreOtheractivities: finding a job position where I canimplementsomeapplications I have in mindfinding large -- possibly public RAW corpora -- to test some hypothesis networking
  5. The pressing need: exploiting BIG TEXTUAL DATANowadays all kinds of businesses, enterprises and customer care services produce huge amount of textual data in the form of many different &quot;genres&quot;, i.e. emails, memos, notes from call-centers, news, user groups, chats, reports, tweets, Facebook pages, blogs, forums, marketing material and so on. The word &quot;genre&quot; means &quot;type of text&quot;. All these genres contain valuable but UNSTRUCTURED textual data. It is difficult to search and find the information we need when data is unstructured.Contextualized informationWhatdo I mean by contextualized information? I mean to reconstruct the communicativecontext and the communicativepurpose for which a text has beenproduced by analysinghow the language is used and the content is organized in a text. The bag of words approach does not returncontexutalized information. So morphology is important, syntax is important, butalso the communicativecontext is important, because a piece of informatin that is useful in onecontextmight be useless in anothercontext. Is a text instructional or is it a propaganda text? Is it a newspaperarticle or an officialstatement, or a confidentialemail? Is it a public report or an exploratorystudy? This kind of details are not alwaysavaible from the source from where a text is retrieved. The knowledge of the communicativepurposehelpusidentifyactionable information. Actionable Information“Actionable information” provides data that can be used to make specific business decisions, or more in general, anycrucialdecision. Actionable information is specific, to the pointconsistent and credible. Contextualized information and actionable information do not necesseralyoverlap. Actionable information is a piece of information that is crucial for decisionmaking.
  6. The concept of genre is veryrooted in ourlives, culture and society. This means that this concept is spontaneously, almost instintively, acknowledgedevenifpeopledo not know the word ”genre” itself.
  7. Genre Analysis
  8. As a researcher, my speciality is Automatic Genre Identification.
  9. In order to show the state of the art of AGI, I willsummarizesome experimental settings presented in the Springer volume
  10. Seminal paper: with Karlgren and Cutting the genre of documents become a text-internal class.They a supervised approach (discriminant analysis) and 20 features.
  11. Stockholm-UmeåCorpus
  12. These genres are not easilyrecognized by the classifier
  13. Kris I is pdf7-webgenre collection
  14. Manual annotation of a document by genre is – as all manual annotations – is:tiring (error-prone) time-consumingexpensivecontroversial
  15. Complex features: concessionclauseFine-grained features: mental verbs, activity verbs
  16. Bynoisyenvironment I mean that genre identification is carriedoutwithin a collectionlargerthan 1400 web pages, where the otherdocuments are unclassified and selectedrandomly and canbelongvirtually to any genre.
  17. ¾ of noiseOnly 10% decrease in the performance
  18. Generalpurposepalette for webserches
  19. Results so far are good and encouragingNor like the iris datase
  20. Genre is an internalizedconcept. Section: Mastering the convention of different genres.
  21. Language does not exitst in abstractLanguage is useddifferently in different contexts
  22. For example, let’stake English as pivot language, and the highlyambigousword ”bank”…
  23.  Twictionary: The Dictionary for TwitterA repository for the meanings and manglings of words and language on Twitterhttp://twictionary.pbworks.com/w/page/22547584/FrontPageActionable information means having the necessary information immediately available in order to deal with the situation at hand. Vsac‧tion‧a‧ble lawif something you say or do is actionable, it is so bad or damaging that a claim could be made against you in a court of law:His remarks are actionable in my view.Genre gives us the compositional context. When we know the genre of a document, we know how the content is organized, we know where we can find the most important information. For instance, on the web when the genre of a digital text is unknown or not declared explicitly, users feel often at a loss and do not know how to assess how reliable, objective or useful information is. The same is true within business intelligence, customer care optimization, and in many other practical applications.Sublanguage provides a situational context influenced by the medium of communication (e.g. telephone, face-to-face, chats, video-conferencing, microblogging, etc). Sublangage has nothing to do with terminology (specialized words, aka terms, used in a specialized domain); sublanguage is not register (e.g. cues of formality, casual conversation, etc.); sublanguage is not style. Think of the sublaguage characterizing tweets and the sublanguage used in customer care help centers or chats. They can be enormously different, though they might all be informal, conversational and polite. Sublanguage is formulaic, cross-topical and mostly domain-independent. For instance, the sublaguage used in a car rental help center is similar to the sublanguage used in a first-aid call center. In both cases, there will be a salutation (e.g. good morning), investigation (e.g. How can I help you? When did this happen? Where are you now?), personal detail requests (e.g. what is your name?), and similar.Domain refers to a field of interest or to a subject matter. It can be medicine, politics, marketing, literary criticism, etc. A domain can have a specific terminology.
  24. Kind of triggerphrasesBy acknowledging the concept of genre, we acknowledge that information is organized differently in different types of texts. In practical terms, this means that the genre of a document has a bearing on the identification of relevant content. Emails follow a quite convential content organization, where the “core content” might be preceded by salutations, a short introduction and/or additional elements. So they are quite ease to handle.
  25. Findwise….