SlideShare ist ein Scribd-Unternehmen logo
1 von 92
Downloaden Sie, um offline zu lesen
AMAI - 2009 -           September 8th, 2009 (Mexico D.F.)




Text Mining and Open-ended Questions
          in Sample Surveys




                    Ludovic Lebart
         Centre National de la Recherche Scientifique
              Telecom-ParisTech, Paris, France



                                                               1
Text Mining and Open-ended Questions
                   in Sample Surveys


                       Summary / Outline


1) Principles of Data Mining and Text mining: A reminder

2) Open-ended Questions: Why? How?

3) From texts to numerical data

4) Basic statistical tools: Visualization, Characteristic words.

5) Applications: Open questions, sample surveys, texts

6) About textual data in general

7) Conclusions
Text Mining and Open-ended Questions
                         in Sample Surveys



1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?

3) From texts to numerical data

4) Basic statistical tools: Visualization, Characteristic words.

5) Applications: Open questions, sample surveys, texts

6) About textual data in general

7) Conclusions
1- Principles of Data Mining and Text mining: A reminder




    The « data Mining » approach...

✔   Ancient techniques are easier to use,

✔   Ancient techniques are improved

✔   New techniques are conceived

✔   New fields of application

✔   New products:  Softwares 

✔   Need for a selection of methods, of simple and clear strategy for
    data processing
1- Principles of Data Mining and Text mining: A reminder



              Survey data processing and data mining

Reminder : (Fayadd et al.)

Data Mining (KDD) is the non-trivial process of identifying patterns
in huge data sets, these patterns being supposed to be

valid,
     novel,
          potentially useful,
                                 and ultimately…        understandable


These huge data sets could be unstructured, non representative.

The main goal being to automatically extract from the ore (raw data)
the genuine diamond of truth…. (Benzécri 1973)
1- Principles of Data Mining and Text mining: A reminder


           Survey data : Homogeneity of content, of coding…
       … different from the usual inputs of Data Mining programs.

Despite the fact that we may deal with several observational levels
(households, individuals, trajectories or biographical data,
areas or regions…), there is a consistency and a unity of content
in a survey data set - together with general hypotheses formulated
beforehand - that are not present in the usual data mining input data.

A survey (whatever its complexity) is a costly set of measurements
that follows a specific decision.


 In this context, a lot of meta-information is generally available
 (Demographic, economic, sociologic, epidemiologic, etc)
 that provides a framework for the interpretation phase.
1- Principles of Data Mining and Text mining: A reminder


    “Text Mining” and Multivariate exploratory statistical
                     analysis of texts


✔   Initial paradigm:
✔


✔     - Extracting statistical units from texts 
✔


✔     - Complementing lexicometry with a multivariate approach 
✔


✔     - Applying visualization tools to lexical tables 
✔


✔   Evolution and diversification of techniques and approaches




                                                                         7
1- Principles of Data Mining and Text mining: A reminder


                   The fields of Text Mining



    WEB                                    Press



Scientific papers, abstracts                Information Retrieval



Open-ended questions, free responses

                 Qualitative interviews, Discourses, Reports


                                     Complaints


                                                                      8
Text Mining and Open-ended Questions
                       in Sample Surveys



1) Principles of Data Mining and Text mining: A reminder

2) Open-ended Questions: Why? How?
3) From texts to numerical data

4) Basic statistical tools: Visualization, Characteristic words.

5) Applications: Open questions, sample surveys, texts

6) About textual data in general

7) Conclusions
2- Open-ended Questions: Why? How?


                Open questions : Why?
 x To shorten interview time:
 Open ended questions are less costly in terms of interview time,
 and generate less fatigue and tension (voluminous lists of items)


x To gather spontaneous information:
Marketing survey questions contain many questions of this type.
" What do you recall (or: what do you like) about this ad? 


x To probe the response to a closed-end question.:
This is the follow up additional question "Why?".
Explanations concerning a response already given have to be
provided in a spontaneous fashion.


  x To get information relating to non-comparable variables:
  Example : Environmental activism, dietary habits….
                                                                     10
2- Open-ended Questions: Why? How?



  Open questions : Why ?

                               Cost
DRAWBACKS
                               Complexity
                               Specificity




ADVANTAGES                      Speed
                                Freedom
                                Specificity




                                                  11
2- Open-ended Questions: Why? How?


           Comparison between open and closed questions

A classical experiment, quoted by Schuman and Presser (1981),
stresses the difficulty of comparing the two types of questionning.

When asked
"what is the most important problem facing this country [USA] at present",
 16% of Americans mention crime and violence (grouped free responses),
whereas the same item asked in a closed question
produces 35% of the same response.

The explanation given by authors is the following:
 lack of security is often considered as a local, not a national problem,
so that the item crime and violence is not mentioned
spontaneously very often.

Closing the question indicates that this response is a relevant or
possible response, resulting in a higher response percentage.
2- Open-ended Questions: Why? How?


           Heuristic value of open-ended questions


In some particular contexts, the absence of a response item can
 play a positive role.
It can establish a climate of confidence and communication,
and lead to better results when certain subjects are brought up.

This is what is indicated by the work of Sudman and Bradburn (1974)
concerning questions having to do with "threats", and of Bradburn et al.
(1979) concerning questions about alcohol and sexuality.

In international studies, it is important to know whether people interviewed
 in different countries understand the closed questions in the same way.
(case of the follow up :”Why” ).

As a matter of fact, it is also legitimate to raise this same issue with respect
to regional and generational differences.
2- Open-ended Questions: Why? How?

       Heuristic value of open-ended questions (continuation)

In some other particular contexts, the cultural gap between those who
have conceived the questionnaire and the interviewees is hidden by the
purely numerical coding of the closed questions.

In a national survey about the attitudes of economically impaired people
towards the minimum wage system in France, a classical open question
was asked at the end of the interview:

“Would you like to add something about some topics that could be
missing in this questionnaire, about the minimum wage system ?”

One answer (among many others of the same vein) was
“ We eat potatoes and eggs, despite my diabetes and my cholesterol,
 because there are cheap.”

Another: “Thank you for coming. It proves that you are thinking of me”.

Some respondents are far from the problematic “Attitude towards an institution”
2- Open-ended Questions: Why? How?

   Empirical Post-Coding of free responses

 (Drawbacks of this type of processing)

 Coder bias: Coder bias is added to interviewer bias,
 since the coder makes decisions and formulates
  interpretations, introducing a «personal touch ».


 Alteration of form: Information is destroyed in its form
 and often weakened in its content: quality of expression,
 level of vocabulary, and general interview tonality are lost.


Weakening of content: (case of responses that are composed,
complex, vague and diversified).

Infrequent responses are eliminated a priori.

                                                                 15
2- Open-ended Questions: Why? How?


Example 1: Open Question « Life » (international sample surveys)

The following open-ended question was asked :

  "What is the single most important thing in life for you?"
    It was followed by the probe:
  "What other things are very important to you?".


  This question was included in a multinational survey conducted
  in seven countries (Japan, France, Germany, Italy, Nederland,
  United Kingdom, USA) in the late nineteen eighties
  (Hayashi et al., 1992).



  Our illustrative example is limited to the British sample
  (Sample size: 1043).

                                                                   16
2- Open-ended Questions: Why? How?


Example 1: Open Question « Life »: Examples of responses



   GenderEduc. Age Responses

     1   1   4   happiness in people around me, contented family,
                 would make me happy
     1   2   2   my own time, not dictated by other people
     1   2   2   freedom of choice as to what I do in my leisure time
     1   3   2   I suppose work
     1   2   1   firm, my work, which is my dad's firm
     2   1   6   just the memory of my last husband
     2   2   6   well-being of my handicapped son
     1   1   5   my wife, she gave me courage to carry on even in the
                 bad times
     2   2   3   my sons, my kids are very important to me, being on
                 my own, I am responsible for their education
     1   3   3   job, being a teacher I love my job, for the well-being
                 of the children                                          17
2- Open-ended Questions: Why? How?

           Example 2: Open Questions / Copy-Test

Following a viewing of a television commercial on breakfast cereals (copy-test),
several open questions were asked.
One of them, which we shall use as our example, is :

                 What was the main idea of this commercial?

In addition a number of closed questions were also asked (socio-demographic
characteristics of respondents, purchase intent toward product seen).
Purchase intent being an important issue, this question plays a major role in
the discussions that follow.

Two examples of responses to that open question.

1 - That it has complex carbohydrates in it, it has energy releaser
and it tastes good... It showed people eating grape nuts.

2 - It gives you energy in the morning, nothing else.
2- Open-ended Questions: Why? How?


    Example 3: An international survey (Tokyo Gas Company)



A survey in three cities (Tokyo, New York, Paris) about dietary habits.


The common open-ended questions were:

 "What dishes do you like and eat often?
(With a probe: "Any other dishes you like and eat often?").
“ What would be an ideal meal?”

Akuto H.(Ed.) (1992). International Comparison of Dietary Cultures,
Nihon Keizai Shimbun, Tokyo.

 Akuto H., Lebart L. (1992). Le Repas Idéal. Analyse de Réponses
Libres en Anglais, Français, Japonais. Les Cahiers de l'Analyse des
Données, vol XVII, n°3, Dunod, Paris
2- Open-ended Questions: Why? How?

Example 3: An international survey (continuation)

               Four responses (New York)

"What dishes do you like and eat often?
“What would be an ideal meal?”

---- 1
SPAGHETTI,CHINESE
++++
CAESAR SALAD,LOBSTER TAILS,BAKED POTATO, CHOCOLATE MOUSSE

---- 2
SEAFOOD,GREEN SALAD,CHINESE FOOD
++++
CHAMPAGNE,CAVIAR,GREEN SALAD,GRILLED SEAFOOD

---- 3
CHINESE FOOD
++++
CHINESE FOOD,FRENCH FOOD,VEAL,BREAD
---- 4
PASTA
++++
BEARNAISE BEEF,CHINESE FOOD,ITALIAN FOOD,PASTA
2- Open-ended Questions: Why? How?


   Example 4: Evaluación de vinos mediante notas y comentarios


               Guia de Catas de Castilla y León (2005)
        522 vinos de Castilla y León pertenecientes a 207 bodegas




5 denominaciones: Bierzo, Cigales, Ribera del Duero, Rueda, Toro
2- Open-ended Questions: Why? How?



     Example 4: Evaluación de vinos mediante notas y comentarios
                               Example of two texts

---- Nota= 80 Valdelosfriales-2003
Joven típico, con notas de tempranillo y balsámicos; en boca amable y frutoso.



---- Nota=91 Tares P3-2001 premium
Mucho terruño se detecta en el bouquet de este gran tinto; pólvora, sílex, pizarra, cascajo
caliente con el contraste de tierra húmeda y mucha fruta madura de hueso. concentrado,
tacto graso sobre el paladar; impresionante viscosidad en la lengua, otra vez impresiones
de tierra húmeda y pólvora en el largo final.
Text Mining and Open-ended Questions
                       in Sample Surveys




1) Principles of Data Mining and Text mining: A reminder

2) Open-ended Questions: Why? How?

3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.

5) Applications: Open questions, sample surveys, texts

6) About textual data in general

7) Conclusions
3- From texts to numerical data


 Statistical units derived from texts 



                        CORPUS

             Texts

          Sentences or responses


   Segments or quasi-segments



 Words,     lemmas,     n-grams

           Characters
           s




                                            24
3- From texts to numerical data

                Ambiguity of frequencies:
statistical frequency versus « linguistic frequency  »

               Sample surveys
              (statistical frequency)

                Closed questions


                  Open questions
                  ouvertes

                         Texts
                ( linguistic frequency)



                                                         25
3- From texts to numerical data



          Example 1: Question « Life » -         continuation



The counts for the first phase of numeric coding are as follows:
Out of 1043 responses, there are 13 669 occurrences,
with 1 413 distinct words.
When the words appearing at least 16 times are selected,
there remain 10 357 occurrences of these words,
with 135 distinct words (types).

The same questionnaire also had a number of closed-end
questions (among them, the socio-demographic characteristics
of the respondents, which play a major role).


In this example we focus on a partitioning of the sample into
 nine categories, obtained by cross-tabulating age
(three categories) with educational level (three categories).
                                                                   26
3- From texts to numerical data


       Example 1: Selected statistical units


Words Appearing at Least Sixteen Times (Alphabetic Order)
       in the 1043 responses to the open question


Word Frequency   Word Frequency Word Frequency

I        248     go           19      of           312
I'm       22     going        26      on            59
a        298     good        303      other         33
able      55     grandchildren 30     others        17
about     31     happiness 227        our           29
after     26     happy       137      out           34
all       86     have         99      own           16
and      504     having       70      peace         77
anything 19      health      609      people        61
are       65     healthy      45      really        28



                                                         27
3- From texts to numerical data


Example of morpho-syntactic information


 Gender Educ. Age     Tagged responses

  1    1   4   happiness/NN in/IN people/NNS around/IN me/PRP
               ,/, contented/VBN family/NN ,/, would/MD make/VB
               me/PRP happy/JJ
  1    2   2   my/PRP$ own/JJ time/NN ,/, not/RB dictated/VBN
               by/IN other/JJ people/NNS
  1    2   2   freedom/NN of/IN choice/NN as/IN to/TO what/WP
               I/PRP do/VB in/IN my/PRP$ leisure/NN time/NN
  1    3   2   I/PRP suppose/VBP work/NN
  1    2   1   firm/NN ,/, my/PRP$ work/NN ,/, which/WDT is/VBZ
               my/PRP$ dad's/NNS firm/NN
  2    1   6   just/RB the/DT memory/NN of/IN my/PRP$ last/JJ
               husband/NN
  2    2   6   wellbeing/NN of/IN my/PRP$ handicapped/JJ son/NN
  1    1   5   my/PRP$ wife/NN ,/, she/PRP gave/VBD me/PRP
               courage/NN to/TO carry/VB on/IN even/RB in/IN
               the/DT bad/JJ times/NNS



                                                                  28
3- From texts to numerical data

Example of a lexical contingency table


          - First partition: three age categories

          - less than 30 years [noted -30],
          - between 30 years and 55 years [-55]
          - over 55 years [+ 55] .

          - Second partition: three educational levels

          - No degree or Low [noted L],
          - Medium [M],
          - High level [H]




                                                         29
3- From texts to numerical data

      Example 1: A lexical contingency table

Partial listing of lexical table cross-tabulating 135 words of frequency
     greater than or equal to 16 with 9 age-education categories

          L-30 L-55 L+55 M-30 M-55 M+55 H-30 H-55 H+55

  I           2   46 92      30    25 19      11     21    2
  I'm         2    5   9     3      2 1        0      0    0
  a          10   56 66      54    44 19      20     22    7
  able        1    9 16      9      7 4        4      5    0
  about       0    3 13      7      1 2        4      1    0
  after       1    8 11      3      1 2        0      0    0
  all         1   24 19      8     18 6        3      5    2
  and         8   89 148     86    73 30      25     32   13
  anything    0    4   9     1      3 0        1      1    0




                                                                    30
3- From texts to numerical data

Example 2: "What is the main idea in this commercial"
       Words appearing more than 9 times (100 responses)

Number Word       Frequency       Number      Word   Frequency

1     I            14              25      in          27
2     a            59              26      is          37
3     about        15              27      it         133
4     all          21              28      it's        28
5     and          42              29      long        14
6     are          25              30      morning      9
7     been         12              37      nothing     25
8     carbohydrate 14              32      nutritional 9
9     carbohydrates33              33      nutritious 12
10    cereal       34              34      nuts        25
11    complex      25              35      of          25
12    crunchy       9              36      people      28
13    eaten        10              37      showed      11
14    eating       19              38      taste       11
15    energy       33              39      that        80
16    for          57              40      that's      13
17    give          9              41      he          82
18    gives        11              42      they        50
19    good         52              43      to          32
20    grape        25              44      was         19
21    has          30              45      with        11
22    have         27              46      years       11
23    healthy      23              47      you         81
24    how           9
3- From texts to numerical data
 Example 2: "What is the main idea in this commercial"          Examples of repeated segments

 SEGM FREQ LENGTH "TEXT of SEGMENT"
-------------------------------------
-----------------------------------------a
   1    8    3 a long time
-----------------------------------------are
   2    6    4 are good for you
-----------------------------------------carbohydrates
   3    5    3 carbohydrates in it
-----------------------------------------complex
   4   15    2 complex carbohydrates
-----------------------------------------for
   5   37    2 for you
-----------------------------------------give
   6    7    3 give you energy
-----------------------------------------gives
   7   11    2 gives you
   8    9    3 gives you energy
-----------------------------------------good
   9   24    2 good for
  10   22    3 good for you
-----------------------------------------grape
  11   25    2 grape nuts
-----------------------------------------have
  12    6    3 have been eating
-----------------------------------------healthy
  13    6    3 healthy for you
-----------------------------------------is
  14    9    4 is good for you
-----------------------------------------it
  15   26    2 it has
  16   19    2 it is
  17   14    2 it was
  18    8    3 it gives you
  19    8    3 it has a
  20    6    3 it has complex
  21    5    3 it is good
  22    6    4 it gives you energy
-----------------------------------------people
3- From texts to numerical data

           Example 3: An international survey (Tokyo Gas Company)

The common open-ended question : "What dishes do you like and eat often?
(With a probe: "Any other dishes you like and eat often?").

- Sub-sample 1 (city of Tokyo) : 1008 individuals.
The global corpus of open responses contains 6219 occurrences of
832 distinct words. 139 words appear at least 7 times, leading to 4975
occurrences.

- Sub-sample 2 (city of New-York) contains 634 individuals.
 (6511 occurrences of 638 distinct words).
The processing takes into account the 83 words appearing at least 12 times.

- Sub-sample 3 (city of Paris) contains 1000 individuals.
The global corpus contains 11108 occurrences of 1229 distinct words.
The processing takes into account the 112 words appearing at least 18 times,
leading to 7806 occurrences.

- The three sets of respondents are broken down into into six categories
(three categories of age, combined with the gender).
3- From texts to numerical data

Example 3: An international survey (Tokyo Gas Company)
!------------------------------------!
!      words (frequency order)        !
!-------!---------------------!------!
! num. !      used words       ! freq.!
!-------!---------------------!------!
!    12 ! CHICKEN              ! 254 !
!    73 ! STEAK                ! 101 !
!
!
     49 ! PASTA
     22 ! FISH
                               !
                               !
                                   95 !
                                   87 !                              City of New York
!    60 ! SALAD                !   85 !
!     1 ! AND                  !   85 !
!
!
     23 ! FOOD
     52 ! PIZZA
                               !
                               !
                                   82 !
                                   62 !
                                                      The common open-ended question :
!    79 ! VEGETABLES           !   57 !               "What dishes do you like and eat
!     4 ! BEEF                 !   56 !
!    71 ! SPAGHETTI            !   55 !               often?
!
!
     13 ! CHINESE
     80 ! WITH
                               !
                               !
                                   54 !
                                   48 !
                                                      (With a probe: "Any other dishes you
!    59 ! ROAST                !   47 !               like and eat often?").
!    58 ! RICE                 !   45 !
!    67 ! SHRIMP               !   45 !
!    43 ! MACARONI             !   42 !
!    56 ! POTATOES             !   39 !
!    35 ! HAMBURGERS           !   36 !
!    75 ! TUNA                 !   35 !
!    26 ! FRIED                !   33 !
!    77 ! VEAL                 !   33 !
!    38 ! ITALIAN              !   31 !
!     2 ! BAKED                !   29 !
!    48 ! PARMESAN             !   29 !
!    55 ! POTATO               !   27 !
!    46 ! MEATBALLS            !   25 !
!     3 ! BEANS                !   24 !
!    45 ! MEAT                 !   24 !
!    76 ! TURKEY               !   24 !
!    14 ! CHOPS                !   23 !
!    34 ! HAMBURGER            !   22 !
!------------------------------------!
3- From texts to numerical data

     Example 4: Evaluación de vinos mediante notas y comentarios


Pos.        Palabra     Frec

1           de          891                 Lematización:
2           y           806                 -Singulares y plurales
3           en          694
                                            - Masculino y femenino
                                            - Formas verbales a infinitivo
4           boca        433                 -…
5           con         356
                                            Eliminados artículos,
6           fruta       334
                                            preposiciones …
7           un          308

8           nariz       246                 Conservadas palabras
                                            utilizadas al menos 8 veces.
9           muy         237
10          la          211                 Quedan
10          notas       211                 -250 palabras
12          que         168
                                            -443 vinos

13          taninos     167
                                                            P1 P2 ... P250
14          el          159
                                       Vino 1     0              1   ...     2
15          una         152
                                       Vino 2     1              0   ...     1
16          madera      140
                                       Vino 3     0              0   ...     1
17          bien        116
                                       . . . . . . .             .   . .     .
18          toque       108
                                       Vino 443   1              2   ...     0
3- From texts to numerical data

          Example 4: Evaluación de vinos mediante notas y comentarios (Continuation)


           boca (mouth)
               fruta (fruit)
               muy (very)
             nariz (nose)
               nota (note)
       taninos (tannins)
         madera (wood)
             buen (good)
                 rojo (red)
           negro (black)
              toque (hint)
               bien (well)
                final (end)
      maduro (ripened)
 balsámico (balsamic)
               vino (wine)
            todavía (still)
     elegante (elegant)
  agradable (pleasant)
           jugoso (juicy)
       medio (medium)
                 fino (fine)
algo (some/something)
         cereza (cherry)
                ser (to be)
             ligero (light)
             suave (mild)
     potente (powerful)
         acidez (acidity)
                               0   50      100       150      200         250   300   350   400
                                                           Frequency
Text Mining and Open-ended Questions
                       in Sample Surveys




1) Principles of Data Mining and Text mining: A reminder

2) Open-ended Questions: Why? How?

3) From texts to numerical data

4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts

6) About textual data in general

7) Conclusions
4) Basic statistical tools: Visualization, Characteristic words.



✔   Applying visualization tools to lexical tables
✔

    ●   Principal axes analyses of lexical tables
    ●   Classification (clustering) of words and texts
    ✔


✔   Selecting characteristic units and responses
✔   (or: sentences)
✔

    ●   Characteristic units (words, segments, lemmas)
    ●   Selecting « Modal responses »




                                                                               38
4) Basic statistical tools: Visualization, Characteristic words.


 Briefly, one can summarize the principles of methods for performing these
data reductions:

 Principal axes methods, largely based upon linear algebra, produce
graphical representations on which the geometric proximities among row-
points and among column-points translate statistical associations among rows
and among columns. Correspondence analysis belongs to this family of
methods.

 Clustering or classification methods that create groupings of rows or of
columns into clusters (or into families of hierarchical clusters) including the
SOM (Self Organizing Maps, or Kohonen maps).

     These two families of methods can be used on the same data matrix and
    they complement one another very effectively.

     Selection of characteristic units and responses (or: sentences)
    Characteristic units (words, segments, lemmas)Selecting « Modal
    responses »
4) Basic statistical tools: Visualization, Characteristic words.

                                         General factor for individual i




                      x = aj f + ε
                         i
                          j                         i                j
                                                                     i
Value of variable j
for individual i                               Residual (hopefully small)


                                           Coefficient of variable j


            Known              =          Unknown

 Visualization through principal coordinates, « a breakthrough in 1904 ».

Charles Spearman (1904) – “General intelligence, objectively determined and
    measured”. Amer. Journal of Psychology, 15, p 201-293.
                                                                                     40
4) Basic statistical tools: Visualization, Characteristic words.




Garnett J.-C. (1919) - General ability, cleverness and purpose. British J. of Psych.,
9, p 345-366.
Thurstone L. L. (1947) - Multiple Factor Analysis. The University of Chicago Press,
Chicago.



 x = a j f + b j g + ... + ε
      i
       j                       i                        i                         j
                                                                                  i




                                                                                      41
4) Basic statistical tools: Visualization, Characteristic words.



         Singular Values Decomposition is a theorem, not a model
 Eckart C., Young G. (1936) - The approximation of one matrix by another of lower rank.
 Psychometrika, l, p 211-218.

 Eckart C., Young G. (1939) - A principal axis transformation for non- hermitian matrices.
 Bull. Amer. Math. Assoc., 45, p 118-121.




          =   λ1        ×         + ... + λ α        ×         + ... + λ p          ×


   X               v1       u'1                 vα       u'α                   vp       u'p




A precursor: Pearson K. (1901) - On lines and planes of closest fit to
systems of points in space. Phil. Mag. 2, n°ll, p 559-572.


                                                                                              42
4) Basic statistical tools: Visualization, Characteristic words.

            Image “Cheetah” (Data Compression, Mark Nelson)
               and table (200 x 320) containing levels of grey.




 95    88    88      87    95    88    95    95    95   106    95    78    65    71    78    77    77   etc.
143   144   151     151   153   170   183   181   162   140   116   128   133   144   159   166   170
153   151   162     166   162   151   126   117   128   143   147   175   181   170   166   132   116
143   144   133     130   143   153   159   175   192   201   188   162   135   116   101   106   118
123   112   116     130   143   147   162   183   166   135   123   120   116   116   129   140   159
133   151   162     166   170   188   166   128   116   132   140   126   143   151   144   155   176
160   168   166     159   135   101    93    98   120   128   126   147   154   158   176   181   181
154   155   153     144   126   106   118   133   136   153   159   153   162   162   154   143   128
159   153   147     159   150   154   155   153   158   170   159   147   130   136   140   150   150
151   144   147     176   188   170   166   183   170   166   153   130   132   154   162   120   135
155   181   183     162   144   147   147   144   126   120   123   129   130   112   101   135   150
166   147   129     123   133   144   133   117   109   118   132   112   109   120   136   120   136
136   130   136     147   147   140   136   144   140   132   129   151   153   140   128   153   147
130   133   140     124   136   152   166   147   144   151   159   140   123   130   123   109   112
126   120   143     145   162   153   155   175   154   144   136   130   120   112   123   123   144
144   159   155     155   162   166   158   147   140   147   126   123   132   135   136   144   147
136   143   162     175   136   110   112   135   120   118   126   151   150   130   129   133   147
133   151   143     106    85    93   128   136   140   140   144   143   126   117   116   129   124
……………………………..etc.
                                                                                                        43
4) Basic statistical tools: Visualization, Characteristic words.


Eigenvalues of the Correspondence Analysis of the previous table

  Trace before diagonalization: 0.15930 Trace after diagonalization: 0.15930
 eigenvalues
    1:    0.045 28.549 28.549 **************************************************
    2:    0.028 17.695 46.243 ******************************
    3:    0.019 12.205 58.448 *********************
    4:    0.012 7.306 65.754 ************
    5:    0.007 4.674 70.428 ********
    6:    0.006 3.516 73.944 ******
    7:    0.005 2.944 76.888 *****
    8:    0.003 2.179 79.067 ***
    9:    0.003 1.869 80.936 ***
   10:     0.002 1.531 82.467 **
   11:     0.002 1.371 83.838 **
   12:     0.002 1.106 84.944 *
   13:     0.002 1.066 86.010 *
   14:     0.002 0.956 86.965 *
   15:     0.001 0.791 87.756 *
   16:     0.001 0.758 88.514 *
   17:     0.001 0.690 89.204 *
   18:     0.001 0.567 89.771
   19:     0.001 0.554 90.325
   20:     0.001 0.477 90.801
   21:     0.001 0.422 91.223
   22:     0.001 0.406 91.629
   23:     0.001 0.384 92.013
   24:     0.001 0.339 92.352                                                      44
4) Basic statistical tools: Visualization, Characteristic words.

Reconstitution of the Cheetah with 2, 4, 6, 8, 10, 12, 20, 30, 40 principal axes




                                                                                45
4) Basic statistical tools: Visualization, Characteristic words.

A pedagogical example: Description of « Textual Graphs »




                                                                         46
4) Basic statistical tools: Visualization, Characteristic words.

 Each area “answers” to the fictitious “open-question” :
          Which are your neighbouring areas?

**** Ain
   Ain Isere Jura Rhone Hte_Saone Savoie Hte_Savoie
**** Aisne
   Aisne Ardennes Marne Nord Oise Seine_Marne Somme
**** Allier
   Allier Cher Creuse Loire Nievre Puy_de_Dome Hte_Saone
**** Alpes_Prov
   Alpes_Prov Alpes_Hautes Alpes_Marit Drome Var Vaucluse
**** Alpes_Hautes
   Alpes_Hautes Alpes_Prov Drome Isere Savoie
**** Alpes_Marit
   Alpes_Marit Alpes_Prov Var
**** Ardeche
   Ardeche Drome Gard Loire Hte_Loire Lozere
**** Ardennes
   Ardennes Aisne Marne Meuse
……………………….
                                                                         47
4) Basic statistical tools: Visualization, Characteristic words.

The idea: When a pattern exists
within a text, some techniques may                                       This map is blindly
detect it and exhibit it.                                                produced from the
                                                                         previous texts.




                                                                                      48
4) Basic statistical tools: Visualization, Characteristic words.




   Example: CORRESPONDENCE ANALYSIS of a simple
                 lexical table


Correspondence analysis can be presented in several different ways.

• It is difficult to trace the method's history accurately
(see, e.g., Hill, 1974 ; Benzecri, 1976 ; Nishisato, 1980; Gifi, 1990).

•The underlying theory probably dates back to...

•Fisher (1936) , Guttman (1941), and Hayashi (1956).




                                                                                 49
4) Basic statistical tools: Visualization, Characteristic words.




• Correspondence analysis and principal components
analysis are used under different circumstances:

• Principal components analysis is used for tables
consisting of continuous measurements.

• Correspondence analysis is best applied to
contingency tables (cross-tabulations) frequently
encountered when analyzing textual data.

• By extension, it also provides a satisfactory
description of data tables with binary coding.




                                                                         50
4) Basic statistical tools: Visualization, Characteristic words.




• Cross-tabulations or contingency tables are among the
most common data structures used for analyzing
qualitative data.

• By looking simultaneously at two partitions at a time of a
population or sample, a cross-tabulation enables us to
work with variations in the data by response categories, a
necessary step for the interpretation of results.




                                                                          51
4) Basic statistical tools: Visualization, Characteristic words.



• We will use as a leading example a small
contingency table.
• However, this kind of exploratory method is chiefly
useful when we are dealing with very large data
tables
•                     (pedagogical paradox)


• In the following table, the 14 rows are words used
in responses to an open-ended question given by
2000 respondents.
•The 5 columns are the educational levels of the
respondents.

                                                                         52
4) Basic statistical tools: Visualization, Characteristic words.

A contingency table crossing words and education level

                        No    Elem.Trade High Coll-               Total
    Words               degr. Sch. Sch. Sch. ege

    Money                 51     64     32     29      17         193
    Future                53     90     78     75      22         318
    Unemployment          71    111     50     40      11         283
    Decision               1      7      5      5       4          22
    Difficult              7     11      4      3       2          27
    Economic               7     13     12     11      11          54
    Selfishness           21     37     14     26       9         107
    Occupation            12     35     19      6       7          79
    Finances              10      7      7      3       1          28
    War                    4      7      7      6       2          26
    Housing                8     22      7     10       5          52
    Fear                  25     45     38     38      13         159
    Health                18     27     20     19       9          93
    Work                  35     61     29     14      12         151

    Total               323     537    322    285    125         1592




                                                                               53
4) Basic statistical tools: Visualization, Characteristic words.

              Row-profiles of the same table


                    No  Elem. Trade High             Coll Total
   Words          degree Sch. Sch. Sch.              ege

Money              26.4     33.2     16.6    15.0      8.8     100.0
Future             16.7     28.3     24.5    23.6      6.9     100.0
Unemployment       25.1     39.2     17.7    14.1      3.9     100.0
Decision            4.5     31.8     22.7    22.7     18.2     100.0
Difficult          25.9     40.7     14.8    11.1      7.4     100.0
Economic           13.0     24.1     22.2    20.4     20.4     100.0
Selfishness        19.6     34.6     13.1    24.3      8.4     100.0
Occupation         15.2     44.3     24.1     7.6      8.9     100.0
Finances           35.7     25.0     25.0    10.7      3.6     100.0
War                15.4     26.9     26.9    23.1      7.7     100.0
Housing            15.4     42.3     13.5    19.2      9.6     100.0
Fear               15.7     28.3     23.9    23.9      8.2     100.0
Health             19.4     29.0     21.5    20.4      9.7     100.0
Work               23.2     40.4     19.2     9.3      7.9     100.0

Total              20.3     33.7     20.2    17.9       7.9    100.0



                                                                           54
4) Basic statistical tools: Visualization, Characteristic words.


              Column-profiles of the same table


                    No   Elem. Trade            High Coll- Total
   Words          degree Sch. Sch.              Sch. ege

Money                15.8     11.9     9.9     10.2     13.6    12.1
Future               16.4     16.8    24.2     26.3     17.6    20.0
Unemployment         22.0     20.7    15.5     14.0      8.8    17.8
Decision               .3      1.3     1.6      1.8      3.2     1.4
Difficult             2.2      2.0     1.2      1.1      1.6     1.7
Economic              2.2      2.4     3.7      3.9      8.8     3.4
Selfishness           6.5      6.9     4.3      9.1      7.2     6.7
Occupation            3.7      6.5     5.9      2.1      5.6     5.0
Finances              3.1      1.3     2.2      1.1       .8     1.8
War                   1.2      1.3     2.2      2.1      1.6     1.6
Housing               2.5      4.1     2.2      3.5      4.0     3.3
Fear                  7.7      8.4    11.8     13.3     10.4    10.0
Health                5.6      5.0     6.2      6.7      7.2     5.8
Work                 10.8     11.4     9.0      4.9      9.6     9.5

Total              100.0 100.0 100.0 100.0 100.0 100.0


                                                                           55
4) Basic statistical tools: Visualization, Characteristic words.

                                              1   j     p
                                         1                       general term of the
                                                                 contingency table
Symmetry                           F=
                                         i        fij
                                 (n,p)
of the two
spaces:                                  n
                 row-profiles                               column-profiles
rows and                 1   j       p                              j   j'
columns                                                     1
                    i

                    i'                                      i

                                                            n

                                         p                                      n
                    n points in R                               p points in R
                  • •• • • •       •   • •                        • • •• • •
                                 • • •••
                     • • • •      • •• • ••                     • • • •• • ••
                  • • •• •          •                           • • •   • •
                  • • •      •    • • • •                       •        •     •
                                 • • • ••                                  • •
                                                                            • •
                                         •
                     p
                                                            Rn
                                      • • •
                    R
                                                                                       56
4) Basic statistical tools: Visualization, Characteristic words.


                                                        A x is 2           (21 % )                                          .
                                                                                                              F in a n c e s
                                                                   .1 5
        H IG H
           .                    F u tu r e
                                 .
                    W ar                                                                            N o DE GRE E
                     .                                                                                            .
                       F ea r                                                                                   .
                         .                                                                         U n em p l o y m e n t
                                    T R A D E S e lfi s h n e s s
                                          .            .                                                          A x is 1
                                          . H e a l th
                       - .2                  - .1                  0                     .1  .                    .2
                                                                                         M on ey                   (57 % )
                                                                                                 .
                                                                                                                        .
                                                                                              EL EM         D iff ic u lt
                                                                          H o u s in g
                                                                          .                             W o rk .
                                                               - .1 5

 .D e c i s i o n
                                                                                          O c c u p a tio n
                                                                                                    .
E c o n o m ic   C O LL E G E
  .            .
                                                                                                                                57
4) Basic statistical tools: Visualization, Characteristic words.



        Characteristic elements (words, lemmas, segments)

        The corpus contains several parts (categories of respondents).

Notations:
kij     -sub-frequency of word i in the part j of the corpus;
ki.     -frequency of word i in the whole corpus;
k.j     -frequency (size) of part j;
k..     -size of the corpus (or, simply, k).

We are interested in the statistical significance of sub-frequency kij .

Is the word i abnormally frequent in part j ? Is it abnormally rare?

The comparison between the relative frequency of word i in part j
and the relative frequency of word i in the entire corpus leads to a classical
statistical test using either the hypergeometric distribution or its normal
                                                                            58
approximation.
4) Basic statistical tools: Visualization, Characteristic words.


The 4 parameters for computing characteristic elements

                      T E X T       P A R T S

    W O R D S




                                           ki j             ki .



                                           k. j            k. .



        k. .    size of corpus
         ki .   frequency of word in corpus

        ki j    frequency of word in text part

        k. j    size of text part

                                                                      59
Text Mining and Open-ended Questions
                   in Sample Surveys


                        Summary / Outline



1) Principles of Data Mining and Text mining: A reminder

2) Open-ended Questions: Why? How?

3) From texts to numerical data

4) Basic statistical tools: Visualization, Characteristic words.

5) Applications: Open questions, sample surveys, texts
6) About textual data in general

7) Conclusions
                                                                   60
5) Applications: Open questions, sample surveys, texts


Example 1: Open Question « Life » (International sample surveys)

The following open-ended question was asked :

  "What is the single most important thing in life for you?"
    It was followed by the probe:
  "What other things are very important to you?".


      The two forthcoming diapositives show the principal plane
       produced by a correspondence analysis of the previous
       lexical contingency table (section 3).

       Proximity between 2 category-points (columns) means
       similarity of lexical profiles of the 2 categories.

       Proximity between 2 word-points (rows) means similarity
       of lexical profiles of these words.
c h u r c h w e lf a r e                            Example 1 (« Life » question)
                                                    m in d                                              son
                                                                       E 3-A G E 3                                                                if
Correspondence                                                                                                                                 th em
Words - Categories                                   k id s                                                      fo r
                                                       p ea ce
                                                                                                      E 2 -A G E 3
   s e c u rity                               co n ten tm en t
                                                                   c h il d r e n
                                                                  E 2-A G E 2

                                                                                    E 1 -A G E 2                                           a re
                                             E 3-A G E 2           f a m i ly                                                         f ro m           sh o u ld
                       g e n e ra l                                                                                                v er y w if e                   m e
 le is u re fr e e d o m                                      h a p p in ess
                  sta n d a rd                                     e m plo y m e nt                                d og                                               h elp
                                            tim e
                            house                                                             w o rld          d a u g h ter

                                  m u sic
                                                                                             a nd                                              w o u ld
       e d u catio n              w o rk                                                                                       I
                                                    m one y                                     be                         w ell
                                                                                                to                                      E 1 -A G E 3
                                                                                                        not                                            a n y thin g
                                                E 2 -A G E 1                                    you
          s a t is f .
                         lo v e                                                                                                                                       da y
                                                                  E 1 -A G E 1
       jo b                                                                                                       ha ve                   g o in g
                                                                                                                            ke e p
                                                                                                        c o m f o r ta b l y
                                                     t h in g s                                 c o m f o r t a b le                                               m o re
                                                                                                                    m u ch
                  E 3 -A G E 1
                                               f rie nd s                                            th in k                                                       c an
    f u tu re                                                      ca r                out                                         go
                                                                                                                                                                              62
p ea c e o f m in d                  Example 1 (« Life » question)
                                          E3-A G E3
                                                                                 Location of
 p e a c e in t h e w o r ld                               E 2 -A G E 3          Segments
                    w e lfa r e o f m y fa m ily
                                        E2-A G E2          h a p p in e ss , g o o d h e a lth
                                               E 1 -A G E 2
                        E3-A G E2                                  la w a n d o r d e r




                                                              a n ic e h o m e
a g o o d st a n d a r d o f liv in g                                                 I d o n 't kn o w
                                 h a v in g e n o u g h m o n e y to l ive    E1-A G E3
                          E 2 -A G E 1
                                     E1-A G E1
a g o o d jo b                                                  c a n 't t h i n k o f a n y t h i n g e l s e

                 fr ie n d s a n d fa m i ly



           E 3 -A G E 1
                                                                                                                 63
5) Applications: Open questions, sample surveys, texts

Example 1 (« Life » question) Characteristic words
         words                   %W                %glob       Fr.W Fr.glob TestValue
        H-30 = -30 * high (Young, High education)

         1 f rie n d s               2    .8   7    1 .1   1    17    11   6    3   .4   4
         2 do                        1    .3   5      .4   5     8     4   7    2   .6   0
         3 w ant                     1    .0   1      .3   0     6     3   1    2   .4   4
         4 b e in g                  2    .1   9    1 .1   1    13    11   6    2   .1   8
         5 jo b                      2    .5   3    1 .3   6    15    14   2    2   .1   6
         6 h a v in g                1    .5   2      .6   7     9     7   0    2   .1   1
         7 t h in g s                     .8   4      .2   7     5     2   8    2   .0   6
        - -- - -- - -- - -- - -- -
         2 w ife                       .0 0           .6 5       0     68      -2 .1 0
         1 h e a lt h                2 .7 0         5 .8 5      16    609      -3 .5 9

        H+55 = +55 * high (Older, High education)

         1 m in d                        2 .5 5       .4 5       5     47           2 .9 1
         2 w e lfa r e                   1 .5 3       .2 1       3     22           2 .4 2
         3 peace                         2 .5 5       .7 4       5     77           2 .1 7
                                                                                             64
5) Applications: Open questions, sample surveys, texts

Example 1 (« Life » question) : Modal Responses
      Category 7                Less than 30 years, high level of education

     1 .3 3 - 1 f r i e n d s , f r i e n d s , m y h o m e l i f e
     1 .1 2 - 2 b e i n g c o n t e n t h a v i n g e n o u g h m o n e y t o d o w h a t y o u w a n t t o d o ,
                 w ith in re a s o n , h a v in g g o o d frie n d s , h a v in g a fu lfillin g jo b to d o ,
                 h a v in g s o m e id e a o f w h a t y o u w a n t to d o a n d th e fre e d o m to c h o o s e ,
                 p ro te c tio n o f th e e n v iro n m e n t
      1 .0 5 - 3 to h a v e g o o d frie n d s a ro u n d h a v in g a g o o d jo b , liv in g in a g o o d a re a ,
                 h a v in g lo ts o f fre e d o m to d o th e th in g s y o u w a n t to d o
       .9 3 - 4 g o o d l i v i n g e d u c a t i o n , g o o d j o b , m o n e y

      Category 9                Over 55 years, high level of education

      .9 7 - 1    to g e th e rn e s s , p e a c e o f m in d , g o o d h e a lth , re lig io n , n o
      .6 4 - 2    n o t t o d i e , p e a c e o f m i n d , d o n 't l i k e p e o p l e l i v i n g e n v i o u s o f e a c h
                  o th e r
      .6 3 - 3    p e a c e o f m in d g o o d h e a lth , h a p p in e s s , e n o u g h m o n e y to k e e p a
                  s ta n d a rd o f liv in g
      .3 8 - 4    w e lfa re o f m y fa m ily w o rk , s a tis fa c tio n , g o o d h e a lth , tra v e l                        65
5) Applications: Open questions, sample surveys, texts

                        Example 1: Similar survey in Japan
                       (Same open question, same categories of
                       respondents)




                                                          66
5) Applications: Open questions, sample surveys, texts

          Example 1: Similar survey in Japan: visualization
          of the characteristic words for 2 categories




                                                          67
5) Applications: Open questions, sample surveys, texts

Example 2: Open Questions / Copy-Test
5) Applications: Open questions, sample surveys, texts


         E x a m p le s o f 2 C h a ra c te ris tic re s p o n s e s fo r 4 c a te g o rie s
                  TEXT 1   PWNB        =   Prob.w.n.buy
                                                                               Example 2: Open
    --    1   to tell you about how long people have eaten them.               Questions / Copy-
    --    1   the complex carbohydrate that are in this cereal.
    --    1   the people who eat this cereal and the product. that's all.      Test
    --    2   it's supposed to be healthy, it has good carbohydrates in it.


                 TEXT 2    Hesi    =       Hesitates

.   --    1   it gives you energy in the morning. nothing else.

.   --    2   grape nuts cereal gives you energy
    --    2   it has complex carbohydrates. they showed the man eating it with
    --    2   strawberries and bananas.


                  TEXT 3     PWB       =    Probably would buy


    --    1   it's nutritious for you. nothing else.

    --    2   that,is good for you,          that,s all it said to me


                  TEXT 4     DW B      =    Definitely would buy

    --    1   they are bigger nuggets. low in carbohydrates, that's all.

    --    2   it has nutty flavor, it is nutritious
5) Applications: Open questions, sample surveys, texts

  Example 3: International survey (Tokyo Gas Company).                    A survey in three cities (Tokyo,
  New York, Paris) about dietary habits. Open question: "What dishes do you like and eat often?




New York: First principal plane. Table crossing words and age x gender categories
5) Applications: Open questions, sample surveys, texts

Example 3: International survey (continuation). Question: "What dishes do you like and eat often?




New York: First principal plane. Example of confidence areas for categories (Bootstrap)
5) Applications: Open questions, sample surveys, texts

Example 3: International survey (continuation). Question: "What dishes do you like and eat often?




New York: First principal plane. Example of confidence areas for words (Bootstrap)
5) Applications: Open questions, sample surveys, texts

Example 3: International survey (continuation). Question: "What dishes do you like and eat often?




New York: First principal plane. Example of Kohonen Map (Self Organizing map).
5) Applications: Open questions, sample surveys, texts


                                            Axis 2 : 1.75%
                                                                                                                                   First Principal Plane
     Mesoneros de Castilla (03)
                                                                   6.0
                                                                                                                                   WINES & MARKS


                        Example 4: Comments about 522 Spanish wines.
                                                    Nota y commentarios activos
                                                                   4.5
                Valdelosfrailes (03)

                                                                                Gran Reserva
                 Fuentenarro (02)


                                   Tinto joven                     3.0


        Torondos (02)                               Tinto crianza

               Gayubar (02)                                                                                               Vega Sicilia 'Único' (94)
                                                                                                                        Viña Sastre Pesus(01)
                                                                   1.5
       Valdetán (02)
                                                                                                                94
                                                                                                                                 Eje de calidad
                            78

                                       79

                                                                                                            93

                                       80
                                                    82                                   88
                                                                                                                       97   Jaros Chafandín (01)
                                            81           83                                        90 91   92
                         -3.0                -1.5             84          85   86                   1.5                                     Axis 1: 3.52%
                                                                                              89
                                                                                    87                               San Román (01)
                                                                                                                      Numanthia (02)
                                                                                                                         Bienvenida Sitio de El Palo (01)
                                                                                                                         Bienvenida Sitio de El Palo (02)
                                                                                                                        95 Termanthia (02)
                                                                                                                             Tares P3 (01)
            Carramimbre (03)                                                                                           Gran Elías Mora (00)
                                                                   -1.5
           Viña Eremos (03)


Marqués de Peñamonte (01)
                                            Tinto reserva
                                                      Tinto roble
5) Applications: Open questions, sample surveys, texts

             Example 4: Comments about 522 Spanish wines (continuation)
                    lowest marks                                                  highest       marks
             agradable             reducido                             potente   estilo                      denso
                                                                                                 vez
 frutal      sobremadurez          discreto                             puro      concentrado                 salado
                                                                                                 graso
 crianza     sequedad              frutosidad                           dejar     necesitar                   impresionante
                         tuestes                                                                 torrefacto
 algo        medio                 ensamblado                           mineral   potencial
                         cierto                                                                  granuloso
 limpio      tempranillo           seco                                 primer    sabroso
                         abierto              rojo
                                                                                                 gran         enérgico
 ligero      ligeramente           clásico                              moderno   sorprende
                         algún                típico        salino                               tiempo
             americano
 beber                   demasiado dominar expresión        fino        carnoso   tacto
 evolucionar capa        franco                             donde       amargo    complejo        todo
                                              compotado
 fácil                                                      mucho                 largo           noble
                                              suave
                                                            ser                                   cascajo
tradicional                                   Ribera
                                              cesta         bouquet
rústico                                                                                     coco
                                                            sílex
joven                                         toque                                         pólvora
                                                            intenso
roble                                                                                       voluptuoso
                                                            firme
lineal                                                                                      magnífico
                                                            vino
corto     amable                                            chocolate
    herbáceo
          consistencia


-1,9         -1,5         -1,1        -0,7           -0,3               0,1          0,5            0,9             1,3


        Mark81           82      83    84                   85          86         87            88 89                    90
                                             Average mark: 85.16
5) Applications: Open questions, sample surveys, texts
                                                                                                            Example 4: Comments about 522
                                                                                            Axis2
Variables suplementarias                                                                                     Spanish wines (continuation)
            Mesoneros de Castilla (03)
                                                                                                                                               Jaros Chafandín (01)
                                                                                                                                    Vega Sicilia 'Único' (94)

                                                                               4.5
                                                                                                                                 Viña Sastre Pesus(01)
                        Valdelosfrailes (03)
                                                                                             Gran Reserva                        Punta Esencia (01)
                       Fuentenarro (02)
                                                                                                                             Astrales (02)

                                                                               3.0                      20-24,9€
                                                       0-4,9€    5-9,9€
  Torondos (02) Valdecuadrón (02)                                                            15-19,9€         25-29,9€
                                       Tinto joven                        10-14,9€                                       50-99,9€
                Gayubar (02)
                                                                Tinto crianza
                                                                               1.5
                   Viñatorondos (03)                                                                                                            94
Valdetán (02)                       79                                                                                                                      100-300€
                78
                                                                                                                                               93
                                           81                                                                                       91                 97
                             80
             Viña Valdable (03)                            82      83                                        88    30-49,9€ 90            92
           - 3.0                               - 1.5                      84                                       89
                                                                                                                              1.5                            Axis1
Marqués de Olivara (98)                                                                85    86     87
            Rauda (01)
         El Marqués (02)                                                                                                                                 95

   Carramimbre (03) Valsotillo (01)                                            - 1.5
   Viña Eremos (03)
                                                                                                          San Román (01)
                                                       Tinto reserva                                     Numanthia (02)
  Marqués de Peñamonte (01)
                                                                                     Bienvenida Sitio de El Palo (01)                       Termanthia (02)
                                                                Tinto roble
                                                                                                                      Gran Elías Mora (00) Tares P3 (01)
                                                                                        Bienvenida Sitio de El Palo (02)
5) Applications: Open questions, sample surveys, texts

           Example 4: Comments about 522 Spanish wines (continuation)
---- Wine 212 (mark= 85) Legaris-2001
Tuestes, gominolas y buenos balsámicos marcan la intensidad media frutal de este
crianza. En boca aparece muy lineal, con consistencia media; el retrogusto frutal todavía
tapado por una madera algo rústica.

---- Wine 30 (mark=91) Tares P3-2001 premium
Mucho terruño se detecta en el bouquet de este gran tinto; pólvora, sílex, pizarra, cascajo
caliente con el contraste de tierra húmeda y mucha fruta madura de hueso. concentrado,
tacto graso sobre el paladar; impresionante viscosidad en la lengua, otra vez impresiones
de tierra húmeda y pólvora en el largo final.

---- Wine 314 (mark=97) Vega Sicilia 'Único-1994
Hay que realizar un ejercicio de disciplina gustativa de primer rango para describir este
gran vino. el bouquet es fresco, bien armado de fruta roja que se ve potenciada por tintes
de chocolates, tabacos, notas de sotobosque y una madera que se manifiesta pero que
resulta difícil de localizar y menos de concretar. Tenemos el caso raro de un tinto que
sale ileso del paso del tiempo sin lucir su armadura, que es la barrica. En boca joven,
aunque ya tiene su cuerpo vigoroso y enérgico bastante ensamblado, con la excepción de
algunos taninos saltamontes que quedan para domesticar. Largo y vibrante final que
mezcla madurez con una notable finura fresca.
Text Mining and Open-ended Questions
                    in Sample Surveys



1) Principles of Data Mining and Text mining: A reminder

2) Open-ended Questions: Why? How?

3) From texts to numerical data

4) Basic statistical tools: Visualization, Characteristic words.

5) Applications: Open questions, sample surveys, texts

6) About textual data in general
7) Conclusions
6) About textual data in general



Processing Strategy




A priori Grouping (Lexical contingency table)

Juxtaposition of Lexical contingency tables

Instrumental Partition

Direct Analysis of the sparse Lexical table




                                                    79
6) About textual data in general


        Importance of Meta-data

                                Meta-data
                                linguistics

                                        Grammar / Syntax
  Textual data
                                       Semantics networks

                                        External Corpora
                                        externes
Other a priori structures
  sociolinguistics,
   chronology, etc.




                                                            80
6) About textual data in general



     The four phases of a linguistic analysis
                                                          (A bxg flower)
                               A big flower      A bug flower
Morphology
                                    A bag flower      A bog flower


Syntax            The spoon speaks                  (The speaks)



Semantics
                      A man thinks                 (A stone thinks)



Pragmatics                 A challenge to I.A.


                                                                   81
6) About textual data in general



      Homography,        Polysemy,          Synonymy

                                                         To bear


 Homographs:                BORE                       A tedious person


                                                          To bore


                                                              Task
Polysemous words:                 DUTY
                                                            Tax
      DRUG                   medicine


               Addicting product
                                                                          82
6) About textual data in general



           Semantic content of a lexical profile


      Distributional linguistics (Z. Harris)

                 X is sometimes purring
                 X mews
                 X has whiskers
                 X likes milk
                 X likes chasing mice




At the end, the point « X » will be superimposed with the point « CAT»


                                                                   83
6) About textual data in general




Semantic similarity is not a transitive relationship

Example of semantic chains:


(1) calm–wisdom–discretion–wariness–fear–panic,


(2) fact–feature –aspect–appearance–illusion .




                                                       84
6) About textual data in general



New additional variables
 Nouns (proper, common)
 Verbs (auxiliary, modal, usual…)
 Adjective
 Pronoun
 Determiner
 Adverb
 Preposition
 Conjunction


                                                  85
6) About textual data in general


new variables, new metrics




                                      86
Text Mining and Open-ended Questions
                        in Sample Surveys


                      Summary / Outline

1) Principles of Data Mining and Text mining: A reminder

2) Open-ended Questions: Why? How?

3) From texts to numerical data

4) Basic statistical tools: Visualization, Characteristic words.

5) Applications: Open questions, sample surveys, texts

6) About textual data in general

7) Conclusions
7) Conclusions


  As a conclusion...

For each open-ended question,

and for each partition of the sample of respondents,

we obtain, without any preliminary coding or other intervention:

• A visualization of proximities between words and categories.

• Characteristic elements or words for each category .

• Modal responses for each category (a kind of automatic summary).

[Remember also that the open question “Why” following a closed
question provides an indispensable assessment of the real
understanding of the question].
7) Conclusions


   As a conclusion... (continuation)

All these processing are carried out under the supervision of robust
assessment procedures:

- Non-parametric statistical tests,
- Bootstrap validation.

We are not dealing here with a novel sophisticated modelling.

It is rather a painstaking effort to stick to the real concerns of the
respondent, i.e.: the customer, the user, the client.

With the rapid development of online surveys, the spreading of e-mails
and blogs, the presented set of tools is expected to be a noteworthy
component in a new methodology of customer knowledge.
7) Conclusions – Short Bibliography



- Akuto H. (1992). International Comparison of Dietary Culture. Nihon Keizai Simbun, Tokyo.
- Bécue M., Lebart L. (1996). Clustering of texts using semantic graphs. Application to open-ended questions in surveys,
  Proceedings of the IFCS 96 Symposium, Kobe, Springer Verlag, Tokyo (in press).
- Bécue-Bertaut M., Pagès J., Alvarez-Esteban R., Vásquez Burguete J.L. (2006) Détermination d’une note globale, synthèse
  d’une évaluation numérique et d’appréciations libres. Application aux études de marché. (in French) Actes des JADT-2006.
- Bécue-Bertaut, M., Álvarez Esteban R., Pagès (2008,)
  http://www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2006/tocJADT2006.htm
  Rating of products through scores and free-text assertions. Comparing and combining both. Food Quality and Preference,
  19/1, 122-134.
- Belson W.A., Duncan J.A. (1962): A Comparison of the check-list and the open response questioning system, Applied
  Statistics, 2, 120-132.
- Benzécri J.-P. (1992). Correspondence Analysis Handbook. Marcel Dekker, New York.
- Biber D. (1995). Dimensions of register variation. Cambridge Univ. Press, Cambridge.
- Bradburn N., Sudman S., and associates (1979): Improving Interview Method and Questionnaire Design, Jossey Bass,
  San Francisco.
- Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R. (1990). Indexing by latent semantic analysis,
  J. of the Amer. Soc. for Information Science, 41 (6), 391-407.
- Habert B., Nazarenko A., Salem A. (1997). Les linguistiques de corpus. Armand colin, Paris.
- Hayashi C., Suzuki T., Sasaki M. (1992): Data Analysis for Social Comparative research: International Perspective,
  North-Holland, Amsterdam.
- Lebart L. (1982). Exploratory analysis of large sparse matrices, with application to textual data, COMPSTAT,
  Physica Verlag, 67-76.
- Lebart L., Salem A., Bécue M., (2000), Análisis estadístico de textos, Editorial Milenio, Lleida.
- Lebart L., Salem A., Berry E. (1998). Exploring Textual Data. Kluwer, Dordrecht.
- Lebart L., Morineau A., Warwick K. (1984). Multivariate Descriptive Statistical Analysis. John Wiley. N.Y.
- Ritter H., Kohonen T. (1989). Self Organizing Semantic Maps. Biol. Cybern. 61, 241-254.
- Salem A. (1984). La typologie des segments répétés dans un corpus, fondée sur l'analyse d'un tableau croisant mots et textes,
  Cahiers de l'Analyse des Données, 489-500.
- Sasaki M., Suzuki T. (1989): New directions in the study of general social attitudes : trends and cross-national perspectives,
   Behaviormetrika, 26, 9-30.
- Schuman H., Presser F. (1981): Question and Answers in Attitude Surveys, Academic Press, New York.
- Sudman S., Bradburn N. (1974): Response Effects in Survey, Aldine, Chicago.
Surveys data and software (DtmVic)
    can be downloaded from
        www.dtm-vic.com
Thank You
Gracias
                Grazie
     Obrigado
                         Merci


                                 92

Weitere ähnliche Inhalte

Ähnlich wie 4 text mining and open ended questions in sample surveys ludovic lebart cnrs

Practical Issues in Social Research Methods
Practical Issues in Social Research MethodsPractical Issues in Social Research Methods
Practical Issues in Social Research Methodsjdubrow2000
 
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...JohannWanja
 
06 internet and_educational_research_palitha_edirisingha
06 internet and_educational_research_palitha_edirisingha06 internet and_educational_research_palitha_edirisingha
06 internet and_educational_research_palitha_edirisinghaPalitha Edirisingha
 
Question Types in Natural Language Processing
Question Types in Natural Language ProcessingQuestion Types in Natural Language Processing
Question Types in Natural Language ProcessingCraig Trim
 
Experiments on Pattern-based Ontology Design
Experiments on Pattern-based Ontology DesignExperiments on Pattern-based Ontology Design
Experiments on Pattern-based Ontology Designevabl444
 
Unit 4 identification of research problem
Unit 4 identification of research problemUnit 4 identification of research problem
Unit 4 identification of research problemAsima shahzadi
 
Research methodoligies in architecture
Research methodoligies in architectureResearch methodoligies in architecture
Research methodoligies in architectureSamanth kumar
 
Edirisingha ethics unisa2012_12_june2012
Edirisingha ethics unisa2012_12_june2012Edirisingha ethics unisa2012_12_june2012
Edirisingha ethics unisa2012_12_june2012Palitha Edirisingha
 
SCI 100 Question Development WorksheetJeimy JimenezAnswer .docx
SCI 100 Question Development WorksheetJeimy JimenezAnswer .docxSCI 100 Question Development WorksheetJeimy JimenezAnswer .docx
SCI 100 Question Development WorksheetJeimy JimenezAnswer .docxbagotjesusa
 
Literature review dr. munir 2018 sept
Literature review dr. munir 2018 septLiterature review dr. munir 2018 sept
Literature review dr. munir 2018 septDrMunirudheenA
 
NLP Workshop Presentation at Universitat de Barcelona
NLP Workshop Presentation at Universitat de BarcelonaNLP Workshop Presentation at Universitat de Barcelona
NLP Workshop Presentation at Universitat de BarcelonaSergiPons5
 
Introduction to Thesis
Introduction to ThesisIntroduction to Thesis
Introduction to ThesisUltraman Taro
 
How Do UK Students, Researchers and Academics use the Internet
How Do UK Students, Researchers and Academics use the InternetHow Do UK Students, Researchers and Academics use the Internet
How Do UK Students, Researchers and Academics use the InternetCaroline Williams
 
Dr.saleem gul assignment summary
Dr.saleem gul assignment summaryDr.saleem gul assignment summary
Dr.saleem gul assignment summaryJaved Riza
 
Research Process Steps
Research Process StepsResearch Process Steps
Research Process StepsShakeel Ahmed
 

Ähnlich wie 4 text mining and open ended questions in sample surveys ludovic lebart cnrs (20)

Practical Issues in Social Research Methods
Practical Issues in Social Research MethodsPractical Issues in Social Research Methods
Practical Issues in Social Research Methods
 
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
 
06 internet and_educational_research_palitha_edirisingha
06 internet and_educational_research_palitha_edirisingha06 internet and_educational_research_palitha_edirisingha
06 internet and_educational_research_palitha_edirisingha
 
Question Types in Natural Language Processing
Question Types in Natural Language ProcessingQuestion Types in Natural Language Processing
Question Types in Natural Language Processing
 
Experiments on Pattern-based Ontology Design
Experiments on Pattern-based Ontology DesignExperiments on Pattern-based Ontology Design
Experiments on Pattern-based Ontology Design
 
Research
ResearchResearch
Research
 
Unit 4 identification of research problem
Unit 4 identification of research problemUnit 4 identification of research problem
Unit 4 identification of research problem
 
Research methodoligies in architecture
Research methodoligies in architectureResearch methodoligies in architecture
Research methodoligies in architecture
 
meta analysis
meta analysis meta analysis
meta analysis
 
Edirisingha ethics unisa2012_12_june2012
Edirisingha ethics unisa2012_12_june2012Edirisingha ethics unisa2012_12_june2012
Edirisingha ethics unisa2012_12_june2012
 
SCI 100 Question Development WorksheetJeimy JimenezAnswer .docx
SCI 100 Question Development WorksheetJeimy JimenezAnswer .docxSCI 100 Question Development WorksheetJeimy JimenezAnswer .docx
SCI 100 Question Development WorksheetJeimy JimenezAnswer .docx
 
Literature review dr. munir 2018 sept
Literature review dr. munir 2018 septLiterature review dr. munir 2018 sept
Literature review dr. munir 2018 sept
 
NLP Workshop Presentation at Universitat de Barcelona
NLP Workshop Presentation at Universitat de BarcelonaNLP Workshop Presentation at Universitat de Barcelona
NLP Workshop Presentation at Universitat de Barcelona
 
Introduction to Thesis
Introduction to ThesisIntroduction to Thesis
Introduction to Thesis
 
Research methods l1 3
Research methods l1 3Research methods l1 3
Research methods l1 3
 
How Do UK Students, Researchers and Academics use the Internet
How Do UK Students, Researchers and Academics use the InternetHow Do UK Students, Researchers and Academics use the Internet
How Do UK Students, Researchers and Academics use the Internet
 
2 purpose of business research, inductive & deductive approaches
2 purpose of business research, inductive & deductive approaches2 purpose of business research, inductive & deductive approaches
2 purpose of business research, inductive & deductive approaches
 
Dr.saleem gul assignment summary
Dr.saleem gul assignment summaryDr.saleem gul assignment summary
Dr.saleem gul assignment summary
 
Research Process Steps
Research Process StepsResearch Process Steps
Research Process Steps
 
IFI7159 M4
IFI7159 M4IFI7159 M4
IFI7159 M4
 

Mehr von Evelyn Femat

Habitos del internauta mexicano 2014 amipci
Habitos del internauta mexicano 2014 amipciHabitos del internauta mexicano 2014 amipci
Habitos del internauta mexicano 2014 amipciEvelyn Femat
 
20 nuevas reglas para la investigacion sobre la actividad publicitaria jorge ...
20 nuevas reglas para la investigacion sobre la actividad publicitaria jorge ...20 nuevas reglas para la investigacion sobre la actividad publicitaria jorge ...
20 nuevas reglas para la investigacion sobre la actividad publicitaria jorge ...Evelyn Femat
 
19 una misma mandarina, nuevos gajos luis woldenberg nodo
19 una misma mandarina, nuevos gajos luis woldenberg nodo19 una misma mandarina, nuevos gajos luis woldenberg nodo
19 una misma mandarina, nuevos gajos luis woldenberg nodoEvelyn Femat
 
18 accelerating sales and profits herb sorensen tns
18 accelerating sales and profits herb sorensen tns18 accelerating sales and profits herb sorensen tns
18 accelerating sales and profits herb sorensen tnsEvelyn Femat
 
17 nuevas reglas en el juego de la investigación a. lopez y e. gonzalez millw...
17 nuevas reglas en el juego de la investigación a. lopez y e. gonzalez millw...17 nuevas reglas en el juego de la investigación a. lopez y e. gonzalez millw...
17 nuevas reglas en el juego de la investigación a. lopez y e. gonzalez millw...Evelyn Femat
 
16 las mediciones neurobiológicas alicia martin del campo qualimerc
16 las mediciones neurobiológicas alicia martin del campo qualimerc16 las mediciones neurobiológicas alicia martin del campo qualimerc
16 las mediciones neurobiológicas alicia martin del campo qualimercEvelyn Femat
 
15 la investigación de mercado en un mundo de redes sociales diego meller livra
15 la investigación de mercado en un mundo de redes sociales diego meller livra15 la investigación de mercado en un mundo de redes sociales diego meller livra
15 la investigación de mercado en un mundo de redes sociales diego meller livraEvelyn Femat
 
14 la palabra clave para la investigación en proximos 10 años abraham geifman...
14 la palabra clave para la investigación en proximos 10 años abraham geifman...14 la palabra clave para la investigación en proximos 10 años abraham geifman...
14 la palabra clave para la investigación en proximos 10 años abraham geifman...Evelyn Femat
 
13 por qué la creatividad jose terrats phenoma
13 por qué la creatividad jose terrats phenoma13 por qué la creatividad jose terrats phenoma
13 por qué la creatividad jose terrats phenomaEvelyn Femat
 
12 la palabra clave en investigación adrzej rattinger merca20
12 la palabra clave en investigación adrzej rattinger merca2012 la palabra clave en investigación adrzej rattinger merca20
12 la palabra clave en investigación adrzej rattinger merca20Evelyn Femat
 
11 pecha kucha alici martin del campo qualimerc
11 pecha kucha alici martin del campo qualimerc11 pecha kucha alici martin del campo qualimerc
11 pecha kucha alici martin del campo qualimercEvelyn Femat
 
10 pecha kucha alejandro herrera tns ri
10 pecha kucha alejandro herrera tns ri10 pecha kucha alejandro herrera tns ri
10 pecha kucha alejandro herrera tns riEvelyn Femat
 
9 cual será la palabra im en 2020 oscar balcazar serta
9 cual será la palabra im en 2020 oscar balcazar serta9 cual será la palabra im en 2020 oscar balcazar serta
9 cual será la palabra im en 2020 oscar balcazar sertaEvelyn Femat
 
8 presentacion amai pecha kucha
8 presentacion amai pecha kucha8 presentacion amai pecha kucha
8 presentacion amai pecha kuchaEvelyn Femat
 
7 from asking to anticipating lindsay atkins millward brown
7 from asking to anticipating lindsay atkins millward brown7 from asking to anticipating lindsay atkins millward brown
7 from asking to anticipating lindsay atkins millward brownEvelyn Femat
 
6 la vida en linea daniela buenfil kitelab
6 la vida en linea daniela buenfil kitelab6 la vida en linea daniela buenfil kitelab
6 la vida en linea daniela buenfil kitelabEvelyn Femat
 
5 tendencias mundiales su impacto en el mexicano l. ruvalcaba y c. galindo brain
5 tendencias mundiales su impacto en el mexicano l. ruvalcaba y c. galindo brain5 tendencias mundiales su impacto en el mexicano l. ruvalcaba y c. galindo brain
5 tendencias mundiales su impacto en el mexicano l. ruvalcaba y c. galindo brainEvelyn Femat
 
4 appreciation leonardo lopez ibope
4 appreciation leonardo lopez ibope4 appreciation leonardo lopez ibope
4 appreciation leonardo lopez ibopeEvelyn Femat
 
3 targeting efectivo ma luisa villegas nielsen
3 targeting efectivo ma luisa villegas nielsen3 targeting efectivo ma luisa villegas nielsen
3 targeting efectivo ma luisa villegas nielsenEvelyn Femat
 
2 nuevas formas, nuevos métodos javier alagon estadística aplicada
2 nuevas formas, nuevos métodos javier alagon estadística aplicada2 nuevas formas, nuevos métodos javier alagon estadística aplicada
2 nuevas formas, nuevos métodos javier alagon estadística aplicadaEvelyn Femat
 

Mehr von Evelyn Femat (20)

Habitos del internauta mexicano 2014 amipci
Habitos del internauta mexicano 2014 amipciHabitos del internauta mexicano 2014 amipci
Habitos del internauta mexicano 2014 amipci
 
20 nuevas reglas para la investigacion sobre la actividad publicitaria jorge ...
20 nuevas reglas para la investigacion sobre la actividad publicitaria jorge ...20 nuevas reglas para la investigacion sobre la actividad publicitaria jorge ...
20 nuevas reglas para la investigacion sobre la actividad publicitaria jorge ...
 
19 una misma mandarina, nuevos gajos luis woldenberg nodo
19 una misma mandarina, nuevos gajos luis woldenberg nodo19 una misma mandarina, nuevos gajos luis woldenberg nodo
19 una misma mandarina, nuevos gajos luis woldenberg nodo
 
18 accelerating sales and profits herb sorensen tns
18 accelerating sales and profits herb sorensen tns18 accelerating sales and profits herb sorensen tns
18 accelerating sales and profits herb sorensen tns
 
17 nuevas reglas en el juego de la investigación a. lopez y e. gonzalez millw...
17 nuevas reglas en el juego de la investigación a. lopez y e. gonzalez millw...17 nuevas reglas en el juego de la investigación a. lopez y e. gonzalez millw...
17 nuevas reglas en el juego de la investigación a. lopez y e. gonzalez millw...
 
16 las mediciones neurobiológicas alicia martin del campo qualimerc
16 las mediciones neurobiológicas alicia martin del campo qualimerc16 las mediciones neurobiológicas alicia martin del campo qualimerc
16 las mediciones neurobiológicas alicia martin del campo qualimerc
 
15 la investigación de mercado en un mundo de redes sociales diego meller livra
15 la investigación de mercado en un mundo de redes sociales diego meller livra15 la investigación de mercado en un mundo de redes sociales diego meller livra
15 la investigación de mercado en un mundo de redes sociales diego meller livra
 
14 la palabra clave para la investigación en proximos 10 años abraham geifman...
14 la palabra clave para la investigación en proximos 10 años abraham geifman...14 la palabra clave para la investigación en proximos 10 años abraham geifman...
14 la palabra clave para la investigación en proximos 10 años abraham geifman...
 
13 por qué la creatividad jose terrats phenoma
13 por qué la creatividad jose terrats phenoma13 por qué la creatividad jose terrats phenoma
13 por qué la creatividad jose terrats phenoma
 
12 la palabra clave en investigación adrzej rattinger merca20
12 la palabra clave en investigación adrzej rattinger merca2012 la palabra clave en investigación adrzej rattinger merca20
12 la palabra clave en investigación adrzej rattinger merca20
 
11 pecha kucha alici martin del campo qualimerc
11 pecha kucha alici martin del campo qualimerc11 pecha kucha alici martin del campo qualimerc
11 pecha kucha alici martin del campo qualimerc
 
10 pecha kucha alejandro herrera tns ri
10 pecha kucha alejandro herrera tns ri10 pecha kucha alejandro herrera tns ri
10 pecha kucha alejandro herrera tns ri
 
9 cual será la palabra im en 2020 oscar balcazar serta
9 cual será la palabra im en 2020 oscar balcazar serta9 cual será la palabra im en 2020 oscar balcazar serta
9 cual será la palabra im en 2020 oscar balcazar serta
 
8 presentacion amai pecha kucha
8 presentacion amai pecha kucha8 presentacion amai pecha kucha
8 presentacion amai pecha kucha
 
7 from asking to anticipating lindsay atkins millward brown
7 from asking to anticipating lindsay atkins millward brown7 from asking to anticipating lindsay atkins millward brown
7 from asking to anticipating lindsay atkins millward brown
 
6 la vida en linea daniela buenfil kitelab
6 la vida en linea daniela buenfil kitelab6 la vida en linea daniela buenfil kitelab
6 la vida en linea daniela buenfil kitelab
 
5 tendencias mundiales su impacto en el mexicano l. ruvalcaba y c. galindo brain
5 tendencias mundiales su impacto en el mexicano l. ruvalcaba y c. galindo brain5 tendencias mundiales su impacto en el mexicano l. ruvalcaba y c. galindo brain
5 tendencias mundiales su impacto en el mexicano l. ruvalcaba y c. galindo brain
 
4 appreciation leonardo lopez ibope
4 appreciation leonardo lopez ibope4 appreciation leonardo lopez ibope
4 appreciation leonardo lopez ibope
 
3 targeting efectivo ma luisa villegas nielsen
3 targeting efectivo ma luisa villegas nielsen3 targeting efectivo ma luisa villegas nielsen
3 targeting efectivo ma luisa villegas nielsen
 
2 nuevas formas, nuevos métodos javier alagon estadística aplicada
2 nuevas formas, nuevos métodos javier alagon estadística aplicada2 nuevas formas, nuevos métodos javier alagon estadística aplicada
2 nuevas formas, nuevos métodos javier alagon estadística aplicada
 

Kürzlich hochgeladen

Call Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any TimeCall Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any Timedelhimodelshub1
 
Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessSeta Wicaksana
 
Future Of Sample Report 2024 | Redacted Version
Future Of Sample Report 2024 | Redacted VersionFuture Of Sample Report 2024 | Redacted Version
Future Of Sample Report 2024 | Redacted VersionMintel Group
 
2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis UsageNeil Kimberley
 
Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Seta Wicaksana
 
Market Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMarket Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMintel Group
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Pereraictsugar
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy Verified Accounts
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfrichard876048
 
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCRashishs7044
 
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...lizamodels9
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Kirill Klimov
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCRashishs7044
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailAriel592675
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaoncallgirls2057
 
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCRashishs7044
 
Independent Call Girls Andheri Nightlaila 9967584737
Independent Call Girls Andheri Nightlaila 9967584737Independent Call Girls Andheri Nightlaila 9967584737
Independent Call Girls Andheri Nightlaila 9967584737Riya Pathan
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024christinemoorman
 

Kürzlich hochgeladen (20)

Call Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any TimeCall Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any Time
 
Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful Business
 
Future Of Sample Report 2024 | Redacted Version
Future Of Sample Report 2024 | Redacted VersionFuture Of Sample Report 2024 | Redacted Version
Future Of Sample Report 2024 | Redacted Version
 
2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage
 
Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...
 
Market Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMarket Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 Edition
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Perera
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail Accounts
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdf
 
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
 
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detail
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
 
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
 
Independent Call Girls Andheri Nightlaila 9967584737
Independent Call Girls Andheri Nightlaila 9967584737Independent Call Girls Andheri Nightlaila 9967584737
Independent Call Girls Andheri Nightlaila 9967584737
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024
 

4 text mining and open ended questions in sample surveys ludovic lebart cnrs

  • 1. AMAI - 2009 - September 8th, 2009 (Mexico D.F.) Text Mining and Open-ended Questions in Sample Surveys Ludovic Lebart Centre National de la Recherche Scientifique Telecom-ParisTech, Paris, France 1
  • 2. Text Mining and Open-ended Questions in Sample Surveys Summary / Outline 1) Principles of Data Mining and Text mining: A reminder 2) Open-ended Questions: Why? How? 3) From texts to numerical data 4) Basic statistical tools: Visualization, Characteristic words. 5) Applications: Open questions, sample surveys, texts 6) About textual data in general 7) Conclusions
  • 3. Text Mining and Open-ended Questions in Sample Surveys 1) Principles of Data Mining and Text mining: A reminder 2) Open-ended Questions: Why? How? 3) From texts to numerical data 4) Basic statistical tools: Visualization, Characteristic words. 5) Applications: Open questions, sample surveys, texts 6) About textual data in general 7) Conclusions
  • 4. 1- Principles of Data Mining and Text mining: A reminder The « data Mining » approach... ✔ Ancient techniques are easier to use, ✔ Ancient techniques are improved ✔ New techniques are conceived ✔ New fields of application ✔ New products:  Softwares  ✔ Need for a selection of methods, of simple and clear strategy for data processing
  • 5. 1- Principles of Data Mining and Text mining: A reminder Survey data processing and data mining Reminder : (Fayadd et al.) Data Mining (KDD) is the non-trivial process of identifying patterns in huge data sets, these patterns being supposed to be valid, novel, potentially useful, and ultimately… understandable These huge data sets could be unstructured, non representative. The main goal being to automatically extract from the ore (raw data) the genuine diamond of truth…. (Benzécri 1973)
  • 6. 1- Principles of Data Mining and Text mining: A reminder Survey data : Homogeneity of content, of coding… … different from the usual inputs of Data Mining programs. Despite the fact that we may deal with several observational levels (households, individuals, trajectories or biographical data, areas or regions…), there is a consistency and a unity of content in a survey data set - together with general hypotheses formulated beforehand - that are not present in the usual data mining input data. A survey (whatever its complexity) is a costly set of measurements that follows a specific decision. In this context, a lot of meta-information is generally available (Demographic, economic, sociologic, epidemiologic, etc) that provides a framework for the interpretation phase.
  • 7. 1- Principles of Data Mining and Text mining: A reminder “Text Mining” and Multivariate exploratory statistical analysis of texts ✔ Initial paradigm: ✔ ✔   - Extracting statistical units from texts  ✔ ✔   - Complementing lexicometry with a multivariate approach  ✔ ✔   - Applying visualization tools to lexical tables  ✔ ✔ Evolution and diversification of techniques and approaches 7
  • 8. 1- Principles of Data Mining and Text mining: A reminder The fields of Text Mining WEB Press Scientific papers, abstracts Information Retrieval Open-ended questions, free responses Qualitative interviews, Discourses, Reports Complaints 8
  • 9. Text Mining and Open-ended Questions in Sample Surveys 1) Principles of Data Mining and Text mining: A reminder 2) Open-ended Questions: Why? How? 3) From texts to numerical data 4) Basic statistical tools: Visualization, Characteristic words. 5) Applications: Open questions, sample surveys, texts 6) About textual data in general 7) Conclusions
  • 10. 2- Open-ended Questions: Why? How? Open questions : Why? x To shorten interview time: Open ended questions are less costly in terms of interview time, and generate less fatigue and tension (voluminous lists of items) x To gather spontaneous information: Marketing survey questions contain many questions of this type. " What do you recall (or: what do you like) about this ad?  x To probe the response to a closed-end question.: This is the follow up additional question "Why?". Explanations concerning a response already given have to be provided in a spontaneous fashion. x To get information relating to non-comparable variables: Example : Environmental activism, dietary habits…. 10
  • 11. 2- Open-ended Questions: Why? How? Open questions : Why ? Cost DRAWBACKS Complexity Specificity ADVANTAGES Speed Freedom Specificity 11
  • 12. 2- Open-ended Questions: Why? How? Comparison between open and closed questions A classical experiment, quoted by Schuman and Presser (1981), stresses the difficulty of comparing the two types of questionning. When asked "what is the most important problem facing this country [USA] at present", 16% of Americans mention crime and violence (grouped free responses), whereas the same item asked in a closed question produces 35% of the same response. The explanation given by authors is the following: lack of security is often considered as a local, not a national problem, so that the item crime and violence is not mentioned spontaneously very often. Closing the question indicates that this response is a relevant or possible response, resulting in a higher response percentage.
  • 13. 2- Open-ended Questions: Why? How? Heuristic value of open-ended questions In some particular contexts, the absence of a response item can play a positive role. It can establish a climate of confidence and communication, and lead to better results when certain subjects are brought up. This is what is indicated by the work of Sudman and Bradburn (1974) concerning questions having to do with "threats", and of Bradburn et al. (1979) concerning questions about alcohol and sexuality. In international studies, it is important to know whether people interviewed in different countries understand the closed questions in the same way. (case of the follow up :”Why” ). As a matter of fact, it is also legitimate to raise this same issue with respect to regional and generational differences.
  • 14. 2- Open-ended Questions: Why? How? Heuristic value of open-ended questions (continuation) In some other particular contexts, the cultural gap between those who have conceived the questionnaire and the interviewees is hidden by the purely numerical coding of the closed questions. In a national survey about the attitudes of economically impaired people towards the minimum wage system in France, a classical open question was asked at the end of the interview: “Would you like to add something about some topics that could be missing in this questionnaire, about the minimum wage system ?” One answer (among many others of the same vein) was “ We eat potatoes and eggs, despite my diabetes and my cholesterol, because there are cheap.” Another: “Thank you for coming. It proves that you are thinking of me”. Some respondents are far from the problematic “Attitude towards an institution”
  • 15. 2- Open-ended Questions: Why? How? Empirical Post-Coding of free responses (Drawbacks of this type of processing) Coder bias: Coder bias is added to interviewer bias, since the coder makes decisions and formulates interpretations, introducing a «personal touch ». Alteration of form: Information is destroyed in its form and often weakened in its content: quality of expression, level of vocabulary, and general interview tonality are lost. Weakening of content: (case of responses that are composed, complex, vague and diversified). Infrequent responses are eliminated a priori. 15
  • 16. 2- Open-ended Questions: Why? How? Example 1: Open Question « Life » (international sample surveys) The following open-ended question was asked : "What is the single most important thing in life for you?" It was followed by the probe: "What other things are very important to you?". This question was included in a multinational survey conducted in seven countries (Japan, France, Germany, Italy, Nederland, United Kingdom, USA) in the late nineteen eighties (Hayashi et al., 1992). Our illustrative example is limited to the British sample (Sample size: 1043). 16
  • 17. 2- Open-ended Questions: Why? How? Example 1: Open Question « Life »: Examples of responses GenderEduc. Age Responses 1 1 4 happiness in people around me, contented family, would make me happy 1 2 2 my own time, not dictated by other people 1 2 2 freedom of choice as to what I do in my leisure time 1 3 2 I suppose work 1 2 1 firm, my work, which is my dad's firm 2 1 6 just the memory of my last husband 2 2 6 well-being of my handicapped son 1 1 5 my wife, she gave me courage to carry on even in the bad times 2 2 3 my sons, my kids are very important to me, being on my own, I am responsible for their education 1 3 3 job, being a teacher I love my job, for the well-being of the children 17
  • 18. 2- Open-ended Questions: Why? How? Example 2: Open Questions / Copy-Test Following a viewing of a television commercial on breakfast cereals (copy-test), several open questions were asked. One of them, which we shall use as our example, is : What was the main idea of this commercial? In addition a number of closed questions were also asked (socio-demographic characteristics of respondents, purchase intent toward product seen). Purchase intent being an important issue, this question plays a major role in the discussions that follow. Two examples of responses to that open question. 1 - That it has complex carbohydrates in it, it has energy releaser and it tastes good... It showed people eating grape nuts. 2 - It gives you energy in the morning, nothing else.
  • 19. 2- Open-ended Questions: Why? How? Example 3: An international survey (Tokyo Gas Company) A survey in three cities (Tokyo, New York, Paris) about dietary habits. The common open-ended questions were: "What dishes do you like and eat often? (With a probe: "Any other dishes you like and eat often?"). “ What would be an ideal meal?” Akuto H.(Ed.) (1992). International Comparison of Dietary Cultures, Nihon Keizai Shimbun, Tokyo. Akuto H., Lebart L. (1992). Le Repas Idéal. Analyse de Réponses Libres en Anglais, Français, Japonais. Les Cahiers de l'Analyse des Données, vol XVII, n°3, Dunod, Paris
  • 20. 2- Open-ended Questions: Why? How? Example 3: An international survey (continuation) Four responses (New York) "What dishes do you like and eat often? “What would be an ideal meal?” ---- 1 SPAGHETTI,CHINESE ++++ CAESAR SALAD,LOBSTER TAILS,BAKED POTATO, CHOCOLATE MOUSSE ---- 2 SEAFOOD,GREEN SALAD,CHINESE FOOD ++++ CHAMPAGNE,CAVIAR,GREEN SALAD,GRILLED SEAFOOD ---- 3 CHINESE FOOD ++++ CHINESE FOOD,FRENCH FOOD,VEAL,BREAD ---- 4 PASTA ++++ BEARNAISE BEEF,CHINESE FOOD,ITALIAN FOOD,PASTA
  • 21. 2- Open-ended Questions: Why? How? Example 4: Evaluación de vinos mediante notas y comentarios Guia de Catas de Castilla y León (2005) 522 vinos de Castilla y León pertenecientes a 207 bodegas 5 denominaciones: Bierzo, Cigales, Ribera del Duero, Rueda, Toro
  • 22. 2- Open-ended Questions: Why? How? Example 4: Evaluación de vinos mediante notas y comentarios Example of two texts ---- Nota= 80 Valdelosfriales-2003 Joven típico, con notas de tempranillo y balsámicos; en boca amable y frutoso. ---- Nota=91 Tares P3-2001 premium Mucho terruño se detecta en el bouquet de este gran tinto; pólvora, sílex, pizarra, cascajo caliente con el contraste de tierra húmeda y mucha fruta madura de hueso. concentrado, tacto graso sobre el paladar; impresionante viscosidad en la lengua, otra vez impresiones de tierra húmeda y pólvora en el largo final.
  • 23. Text Mining and Open-ended Questions in Sample Surveys 1) Principles of Data Mining and Text mining: A reminder 2) Open-ended Questions: Why? How? 3) From texts to numerical data 4) Basic statistical tools: Visualization, Characteristic words. 5) Applications: Open questions, sample surveys, texts 6) About textual data in general 7) Conclusions
  • 24. 3- From texts to numerical data  Statistical units derived from texts  CORPUS Texts Sentences or responses Segments or quasi-segments Words, lemmas, n-grams Characters s 24
  • 25. 3- From texts to numerical data Ambiguity of frequencies: statistical frequency versus « linguistic frequency  » Sample surveys (statistical frequency) Closed questions Open questions ouvertes Texts ( linguistic frequency) 25
  • 26. 3- From texts to numerical data Example 1: Question « Life » - continuation The counts for the first phase of numeric coding are as follows: Out of 1043 responses, there are 13 669 occurrences, with 1 413 distinct words. When the words appearing at least 16 times are selected, there remain 10 357 occurrences of these words, with 135 distinct words (types). The same questionnaire also had a number of closed-end questions (among them, the socio-demographic characteristics of the respondents, which play a major role). In this example we focus on a partitioning of the sample into nine categories, obtained by cross-tabulating age (three categories) with educational level (three categories). 26
  • 27. 3- From texts to numerical data Example 1: Selected statistical units Words Appearing at Least Sixteen Times (Alphabetic Order) in the 1043 responses to the open question Word Frequency Word Frequency Word Frequency I 248 go 19 of 312 I'm 22 going 26 on 59 a 298 good 303 other 33 able 55 grandchildren 30 others 17 about 31 happiness 227 our 29 after 26 happy 137 out 34 all 86 have 99 own 16 and 504 having 70 peace 77 anything 19 health 609 people 61 are 65 healthy 45 really 28 27
  • 28. 3- From texts to numerical data Example of morpho-syntactic information Gender Educ. Age Tagged responses 1 1 4 happiness/NN in/IN people/NNS around/IN me/PRP ,/, contented/VBN family/NN ,/, would/MD make/VB me/PRP happy/JJ 1 2 2 my/PRP$ own/JJ time/NN ,/, not/RB dictated/VBN by/IN other/JJ people/NNS 1 2 2 freedom/NN of/IN choice/NN as/IN to/TO what/WP I/PRP do/VB in/IN my/PRP$ leisure/NN time/NN 1 3 2 I/PRP suppose/VBP work/NN 1 2 1 firm/NN ,/, my/PRP$ work/NN ,/, which/WDT is/VBZ my/PRP$ dad's/NNS firm/NN 2 1 6 just/RB the/DT memory/NN of/IN my/PRP$ last/JJ husband/NN 2 2 6 wellbeing/NN of/IN my/PRP$ handicapped/JJ son/NN 1 1 5 my/PRP$ wife/NN ,/, she/PRP gave/VBD me/PRP courage/NN to/TO carry/VB on/IN even/RB in/IN the/DT bad/JJ times/NNS 28
  • 29. 3- From texts to numerical data Example of a lexical contingency table - First partition: three age categories - less than 30 years [noted -30], - between 30 years and 55 years [-55] - over 55 years [+ 55] . - Second partition: three educational levels - No degree or Low [noted L], - Medium [M], - High level [H] 29
  • 30. 3- From texts to numerical data Example 1: A lexical contingency table Partial listing of lexical table cross-tabulating 135 words of frequency greater than or equal to 16 with 9 age-education categories L-30 L-55 L+55 M-30 M-55 M+55 H-30 H-55 H+55 I 2 46 92 30 25 19 11 21 2 I'm 2 5 9 3 2 1 0 0 0 a 10 56 66 54 44 19 20 22 7 able 1 9 16 9 7 4 4 5 0 about 0 3 13 7 1 2 4 1 0 after 1 8 11 3 1 2 0 0 0 all 1 24 19 8 18 6 3 5 2 and 8 89 148 86 73 30 25 32 13 anything 0 4 9 1 3 0 1 1 0 30
  • 31. 3- From texts to numerical data Example 2: "What is the main idea in this commercial" Words appearing more than 9 times (100 responses) Number Word Frequency Number Word Frequency 1 I 14 25 in 27 2 a 59 26 is 37 3 about 15 27 it 133 4 all 21 28 it's 28 5 and 42 29 long 14 6 are 25 30 morning 9 7 been 12 37 nothing 25 8 carbohydrate 14 32 nutritional 9 9 carbohydrates33 33 nutritious 12 10 cereal 34 34 nuts 25 11 complex 25 35 of 25 12 crunchy 9 36 people 28 13 eaten 10 37 showed 11 14 eating 19 38 taste 11 15 energy 33 39 that 80 16 for 57 40 that's 13 17 give 9 41 he 82 18 gives 11 42 they 50 19 good 52 43 to 32 20 grape 25 44 was 19 21 has 30 45 with 11 22 have 27 46 years 11 23 healthy 23 47 you 81 24 how 9
  • 32. 3- From texts to numerical data Example 2: "What is the main idea in this commercial" Examples of repeated segments SEGM FREQ LENGTH "TEXT of SEGMENT" ------------------------------------- -----------------------------------------a 1 8 3 a long time -----------------------------------------are 2 6 4 are good for you -----------------------------------------carbohydrates 3 5 3 carbohydrates in it -----------------------------------------complex 4 15 2 complex carbohydrates -----------------------------------------for 5 37 2 for you -----------------------------------------give 6 7 3 give you energy -----------------------------------------gives 7 11 2 gives you 8 9 3 gives you energy -----------------------------------------good 9 24 2 good for 10 22 3 good for you -----------------------------------------grape 11 25 2 grape nuts -----------------------------------------have 12 6 3 have been eating -----------------------------------------healthy 13 6 3 healthy for you -----------------------------------------is 14 9 4 is good for you -----------------------------------------it 15 26 2 it has 16 19 2 it is 17 14 2 it was 18 8 3 it gives you 19 8 3 it has a 20 6 3 it has complex 21 5 3 it is good 22 6 4 it gives you energy -----------------------------------------people
  • 33. 3- From texts to numerical data Example 3: An international survey (Tokyo Gas Company) The common open-ended question : "What dishes do you like and eat often? (With a probe: "Any other dishes you like and eat often?"). - Sub-sample 1 (city of Tokyo) : 1008 individuals. The global corpus of open responses contains 6219 occurrences of 832 distinct words. 139 words appear at least 7 times, leading to 4975 occurrences. - Sub-sample 2 (city of New-York) contains 634 individuals. (6511 occurrences of 638 distinct words). The processing takes into account the 83 words appearing at least 12 times. - Sub-sample 3 (city of Paris) contains 1000 individuals. The global corpus contains 11108 occurrences of 1229 distinct words. The processing takes into account the 112 words appearing at least 18 times, leading to 7806 occurrences. - The three sets of respondents are broken down into into six categories (three categories of age, combined with the gender).
  • 34. 3- From texts to numerical data Example 3: An international survey (Tokyo Gas Company) !------------------------------------! ! words (frequency order) ! !-------!---------------------!------! ! num. ! used words ! freq.! !-------!---------------------!------! ! 12 ! CHICKEN ! 254 ! ! 73 ! STEAK ! 101 ! ! ! 49 ! PASTA 22 ! FISH ! ! 95 ! 87 ! City of New York ! 60 ! SALAD ! 85 ! ! 1 ! AND ! 85 ! ! ! 23 ! FOOD 52 ! PIZZA ! ! 82 ! 62 ! The common open-ended question : ! 79 ! VEGETABLES ! 57 ! "What dishes do you like and eat ! 4 ! BEEF ! 56 ! ! 71 ! SPAGHETTI ! 55 ! often? ! ! 13 ! CHINESE 80 ! WITH ! ! 54 ! 48 ! (With a probe: "Any other dishes you ! 59 ! ROAST ! 47 ! like and eat often?"). ! 58 ! RICE ! 45 ! ! 67 ! SHRIMP ! 45 ! ! 43 ! MACARONI ! 42 ! ! 56 ! POTATOES ! 39 ! ! 35 ! HAMBURGERS ! 36 ! ! 75 ! TUNA ! 35 ! ! 26 ! FRIED ! 33 ! ! 77 ! VEAL ! 33 ! ! 38 ! ITALIAN ! 31 ! ! 2 ! BAKED ! 29 ! ! 48 ! PARMESAN ! 29 ! ! 55 ! POTATO ! 27 ! ! 46 ! MEATBALLS ! 25 ! ! 3 ! BEANS ! 24 ! ! 45 ! MEAT ! 24 ! ! 76 ! TURKEY ! 24 ! ! 14 ! CHOPS ! 23 ! ! 34 ! HAMBURGER ! 22 ! !------------------------------------!
  • 35. 3- From texts to numerical data Example 4: Evaluación de vinos mediante notas y comentarios Pos. Palabra Frec 1 de 891 Lematización: 2 y 806 -Singulares y plurales 3 en 694 - Masculino y femenino - Formas verbales a infinitivo 4 boca 433 -… 5 con 356 Eliminados artículos, 6 fruta 334 preposiciones … 7 un 308 8 nariz 246 Conservadas palabras utilizadas al menos 8 veces. 9 muy 237 10 la 211 Quedan 10 notas 211 -250 palabras 12 que 168 -443 vinos 13 taninos 167 P1 P2 ... P250 14 el 159 Vino 1 0 1 ... 2 15 una 152 Vino 2 1 0 ... 1 16 madera 140 Vino 3 0 0 ... 1 17 bien 116 . . . . . . . . . . . 18 toque 108 Vino 443 1 2 ... 0
  • 36. 3- From texts to numerical data Example 4: Evaluación de vinos mediante notas y comentarios (Continuation) boca (mouth) fruta (fruit) muy (very) nariz (nose) nota (note) taninos (tannins) madera (wood) buen (good) rojo (red) negro (black) toque (hint) bien (well) final (end) maduro (ripened) balsámico (balsamic) vino (wine) todavía (still) elegante (elegant) agradable (pleasant) jugoso (juicy) medio (medium) fino (fine) algo (some/something) cereza (cherry) ser (to be) ligero (light) suave (mild) potente (powerful) acidez (acidity) 0 50 100 150 200 250 300 350 400 Frequency
  • 37. Text Mining and Open-ended Questions in Sample Surveys 1) Principles of Data Mining and Text mining: A reminder 2) Open-ended Questions: Why? How? 3) From texts to numerical data 4) Basic statistical tools: Visualization, Characteristic words. 5) Applications: Open questions, sample surveys, texts 6) About textual data in general 7) Conclusions
  • 38. 4) Basic statistical tools: Visualization, Characteristic words. ✔ Applying visualization tools to lexical tables ✔ ● Principal axes analyses of lexical tables ● Classification (clustering) of words and texts ✔ ✔ Selecting characteristic units and responses ✔ (or: sentences) ✔ ● Characteristic units (words, segments, lemmas) ● Selecting « Modal responses » 38
  • 39. 4) Basic statistical tools: Visualization, Characteristic words. Briefly, one can summarize the principles of methods for performing these data reductions: Principal axes methods, largely based upon linear algebra, produce graphical representations on which the geometric proximities among row- points and among column-points translate statistical associations among rows and among columns. Correspondence analysis belongs to this family of methods. Clustering or classification methods that create groupings of rows or of columns into clusters (or into families of hierarchical clusters) including the SOM (Self Organizing Maps, or Kohonen maps). These two families of methods can be used on the same data matrix and they complement one another very effectively. Selection of characteristic units and responses (or: sentences) Characteristic units (words, segments, lemmas)Selecting « Modal responses »
  • 40. 4) Basic statistical tools: Visualization, Characteristic words. General factor for individual i x = aj f + ε i j i j i Value of variable j for individual i Residual (hopefully small) Coefficient of variable j Known = Unknown Visualization through principal coordinates, « a breakthrough in 1904 ». Charles Spearman (1904) – “General intelligence, objectively determined and measured”. Amer. Journal of Psychology, 15, p 201-293. 40
  • 41. 4) Basic statistical tools: Visualization, Characteristic words. Garnett J.-C. (1919) - General ability, cleverness and purpose. British J. of Psych., 9, p 345-366. Thurstone L. L. (1947) - Multiple Factor Analysis. The University of Chicago Press, Chicago. x = a j f + b j g + ... + ε i j i i j i 41
  • 42. 4) Basic statistical tools: Visualization, Characteristic words. Singular Values Decomposition is a theorem, not a model Eckart C., Young G. (1936) - The approximation of one matrix by another of lower rank. Psychometrika, l, p 211-218. Eckart C., Young G. (1939) - A principal axis transformation for non- hermitian matrices. Bull. Amer. Math. Assoc., 45, p 118-121. = λ1 × + ... + λ α × + ... + λ p × X v1 u'1 vα u'α vp u'p A precursor: Pearson K. (1901) - On lines and planes of closest fit to systems of points in space. Phil. Mag. 2, n°ll, p 559-572. 42
  • 43. 4) Basic statistical tools: Visualization, Characteristic words. Image “Cheetah” (Data Compression, Mark Nelson) and table (200 x 320) containing levels of grey. 95 88 88 87 95 88 95 95 95 106 95 78 65 71 78 77 77 etc. 143 144 151 151 153 170 183 181 162 140 116 128 133 144 159 166 170 153 151 162 166 162 151 126 117 128 143 147 175 181 170 166 132 116 143 144 133 130 143 153 159 175 192 201 188 162 135 116 101 106 118 123 112 116 130 143 147 162 183 166 135 123 120 116 116 129 140 159 133 151 162 166 170 188 166 128 116 132 140 126 143 151 144 155 176 160 168 166 159 135 101 93 98 120 128 126 147 154 158 176 181 181 154 155 153 144 126 106 118 133 136 153 159 153 162 162 154 143 128 159 153 147 159 150 154 155 153 158 170 159 147 130 136 140 150 150 151 144 147 176 188 170 166 183 170 166 153 130 132 154 162 120 135 155 181 183 162 144 147 147 144 126 120 123 129 130 112 101 135 150 166 147 129 123 133 144 133 117 109 118 132 112 109 120 136 120 136 136 130 136 147 147 140 136 144 140 132 129 151 153 140 128 153 147 130 133 140 124 136 152 166 147 144 151 159 140 123 130 123 109 112 126 120 143 145 162 153 155 175 154 144 136 130 120 112 123 123 144 144 159 155 155 162 166 158 147 140 147 126 123 132 135 136 144 147 136 143 162 175 136 110 112 135 120 118 126 151 150 130 129 133 147 133 151 143 106 85 93 128 136 140 140 144 143 126 117 116 129 124 ……………………………..etc. 43
  • 44. 4) Basic statistical tools: Visualization, Characteristic words. Eigenvalues of the Correspondence Analysis of the previous table Trace before diagonalization: 0.15930 Trace after diagonalization: 0.15930 eigenvalues 1: 0.045 28.549 28.549 ************************************************** 2: 0.028 17.695 46.243 ****************************** 3: 0.019 12.205 58.448 ********************* 4: 0.012 7.306 65.754 ************ 5: 0.007 4.674 70.428 ******** 6: 0.006 3.516 73.944 ****** 7: 0.005 2.944 76.888 ***** 8: 0.003 2.179 79.067 *** 9: 0.003 1.869 80.936 *** 10: 0.002 1.531 82.467 ** 11: 0.002 1.371 83.838 ** 12: 0.002 1.106 84.944 * 13: 0.002 1.066 86.010 * 14: 0.002 0.956 86.965 * 15: 0.001 0.791 87.756 * 16: 0.001 0.758 88.514 * 17: 0.001 0.690 89.204 * 18: 0.001 0.567 89.771 19: 0.001 0.554 90.325 20: 0.001 0.477 90.801 21: 0.001 0.422 91.223 22: 0.001 0.406 91.629 23: 0.001 0.384 92.013 24: 0.001 0.339 92.352 44
  • 45. 4) Basic statistical tools: Visualization, Characteristic words. Reconstitution of the Cheetah with 2, 4, 6, 8, 10, 12, 20, 30, 40 principal axes 45
  • 46. 4) Basic statistical tools: Visualization, Characteristic words. A pedagogical example: Description of « Textual Graphs » 46
  • 47. 4) Basic statistical tools: Visualization, Characteristic words. Each area “answers” to the fictitious “open-question” : Which are your neighbouring areas? **** Ain Ain Isere Jura Rhone Hte_Saone Savoie Hte_Savoie **** Aisne Aisne Ardennes Marne Nord Oise Seine_Marne Somme **** Allier Allier Cher Creuse Loire Nievre Puy_de_Dome Hte_Saone **** Alpes_Prov Alpes_Prov Alpes_Hautes Alpes_Marit Drome Var Vaucluse **** Alpes_Hautes Alpes_Hautes Alpes_Prov Drome Isere Savoie **** Alpes_Marit Alpes_Marit Alpes_Prov Var **** Ardeche Ardeche Drome Gard Loire Hte_Loire Lozere **** Ardennes Ardennes Aisne Marne Meuse ………………………. 47
  • 48. 4) Basic statistical tools: Visualization, Characteristic words. The idea: When a pattern exists within a text, some techniques may This map is blindly detect it and exhibit it. produced from the previous texts. 48
  • 49. 4) Basic statistical tools: Visualization, Characteristic words. Example: CORRESPONDENCE ANALYSIS of a simple lexical table Correspondence analysis can be presented in several different ways. • It is difficult to trace the method's history accurately (see, e.g., Hill, 1974 ; Benzecri, 1976 ; Nishisato, 1980; Gifi, 1990). •The underlying theory probably dates back to... •Fisher (1936) , Guttman (1941), and Hayashi (1956). 49
  • 50. 4) Basic statistical tools: Visualization, Characteristic words. • Correspondence analysis and principal components analysis are used under different circumstances: • Principal components analysis is used for tables consisting of continuous measurements. • Correspondence analysis is best applied to contingency tables (cross-tabulations) frequently encountered when analyzing textual data. • By extension, it also provides a satisfactory description of data tables with binary coding. 50
  • 51. 4) Basic statistical tools: Visualization, Characteristic words. • Cross-tabulations or contingency tables are among the most common data structures used for analyzing qualitative data. • By looking simultaneously at two partitions at a time of a population or sample, a cross-tabulation enables us to work with variations in the data by response categories, a necessary step for the interpretation of results. 51
  • 52. 4) Basic statistical tools: Visualization, Characteristic words. • We will use as a leading example a small contingency table. • However, this kind of exploratory method is chiefly useful when we are dealing with very large data tables • (pedagogical paradox) • In the following table, the 14 rows are words used in responses to an open-ended question given by 2000 respondents. •The 5 columns are the educational levels of the respondents. 52
  • 53. 4) Basic statistical tools: Visualization, Characteristic words. A contingency table crossing words and education level No Elem.Trade High Coll- Total Words degr. Sch. Sch. Sch. ege Money 51 64 32 29 17 193 Future 53 90 78 75 22 318 Unemployment 71 111 50 40 11 283 Decision 1 7 5 5 4 22 Difficult 7 11 4 3 2 27 Economic 7 13 12 11 11 54 Selfishness 21 37 14 26 9 107 Occupation 12 35 19 6 7 79 Finances 10 7 7 3 1 28 War 4 7 7 6 2 26 Housing 8 22 7 10 5 52 Fear 25 45 38 38 13 159 Health 18 27 20 19 9 93 Work 35 61 29 14 12 151 Total 323 537 322 285 125 1592 53
  • 54. 4) Basic statistical tools: Visualization, Characteristic words. Row-profiles of the same table No Elem. Trade High Coll Total Words degree Sch. Sch. Sch. ege Money 26.4 33.2 16.6 15.0 8.8 100.0 Future 16.7 28.3 24.5 23.6 6.9 100.0 Unemployment 25.1 39.2 17.7 14.1 3.9 100.0 Decision 4.5 31.8 22.7 22.7 18.2 100.0 Difficult 25.9 40.7 14.8 11.1 7.4 100.0 Economic 13.0 24.1 22.2 20.4 20.4 100.0 Selfishness 19.6 34.6 13.1 24.3 8.4 100.0 Occupation 15.2 44.3 24.1 7.6 8.9 100.0 Finances 35.7 25.0 25.0 10.7 3.6 100.0 War 15.4 26.9 26.9 23.1 7.7 100.0 Housing 15.4 42.3 13.5 19.2 9.6 100.0 Fear 15.7 28.3 23.9 23.9 8.2 100.0 Health 19.4 29.0 21.5 20.4 9.7 100.0 Work 23.2 40.4 19.2 9.3 7.9 100.0 Total 20.3 33.7 20.2 17.9 7.9 100.0 54
  • 55. 4) Basic statistical tools: Visualization, Characteristic words. Column-profiles of the same table No Elem. Trade High Coll- Total Words degree Sch. Sch. Sch. ege Money 15.8 11.9 9.9 10.2 13.6 12.1 Future 16.4 16.8 24.2 26.3 17.6 20.0 Unemployment 22.0 20.7 15.5 14.0 8.8 17.8 Decision .3 1.3 1.6 1.8 3.2 1.4 Difficult 2.2 2.0 1.2 1.1 1.6 1.7 Economic 2.2 2.4 3.7 3.9 8.8 3.4 Selfishness 6.5 6.9 4.3 9.1 7.2 6.7 Occupation 3.7 6.5 5.9 2.1 5.6 5.0 Finances 3.1 1.3 2.2 1.1 .8 1.8 War 1.2 1.3 2.2 2.1 1.6 1.6 Housing 2.5 4.1 2.2 3.5 4.0 3.3 Fear 7.7 8.4 11.8 13.3 10.4 10.0 Health 5.6 5.0 6.2 6.7 7.2 5.8 Work 10.8 11.4 9.0 4.9 9.6 9.5 Total 100.0 100.0 100.0 100.0 100.0 100.0 55
  • 56. 4) Basic statistical tools: Visualization, Characteristic words. 1 j p 1 general term of the contingency table Symmetry F= i fij (n,p) of the two spaces: n row-profiles column-profiles rows and 1 j p j j' columns 1 i i' i n p n n points in R p points in R • •• • • • • • • • • •• • • • • ••• • • • • • •• • •• • • • •• • •• • • •• • • • • • • • • • • • • • • • • • • • • • •• • • • • • p Rn • • • R 56
  • 57. 4) Basic statistical tools: Visualization, Characteristic words. A x is 2 (21 % ) . F in a n c e s .1 5 H IG H . F u tu r e . W ar N o DE GRE E . . F ea r . . U n em p l o y m e n t T R A D E S e lfi s h n e s s . . A x is 1 . H e a l th - .2 - .1 0 .1 . .2 M on ey (57 % ) . . EL EM D iff ic u lt H o u s in g . W o rk . - .1 5 .D e c i s i o n O c c u p a tio n . E c o n o m ic C O LL E G E . . 57
  • 58. 4) Basic statistical tools: Visualization, Characteristic words. Characteristic elements (words, lemmas, segments) The corpus contains several parts (categories of respondents). Notations: kij -sub-frequency of word i in the part j of the corpus; ki. -frequency of word i in the whole corpus; k.j -frequency (size) of part j; k.. -size of the corpus (or, simply, k). We are interested in the statistical significance of sub-frequency kij . Is the word i abnormally frequent in part j ? Is it abnormally rare? The comparison between the relative frequency of word i in part j and the relative frequency of word i in the entire corpus leads to a classical statistical test using either the hypergeometric distribution or its normal 58 approximation.
  • 59. 4) Basic statistical tools: Visualization, Characteristic words. The 4 parameters for computing characteristic elements T E X T P A R T S W O R D S ki j ki . k. j k. . k. . size of corpus ki . frequency of word in corpus ki j frequency of word in text part k. j size of text part 59
  • 60. Text Mining and Open-ended Questions in Sample Surveys Summary / Outline 1) Principles of Data Mining and Text mining: A reminder 2) Open-ended Questions: Why? How? 3) From texts to numerical data 4) Basic statistical tools: Visualization, Characteristic words. 5) Applications: Open questions, sample surveys, texts 6) About textual data in general 7) Conclusions 60
  • 61. 5) Applications: Open questions, sample surveys, texts Example 1: Open Question « Life » (International sample surveys) The following open-ended question was asked : "What is the single most important thing in life for you?" It was followed by the probe: "What other things are very important to you?". The two forthcoming diapositives show the principal plane produced by a correspondence analysis of the previous lexical contingency table (section 3). Proximity between 2 category-points (columns) means similarity of lexical profiles of the 2 categories. Proximity between 2 word-points (rows) means similarity of lexical profiles of these words.
  • 62. c h u r c h w e lf a r e Example 1 (« Life » question) m in d son E 3-A G E 3 if Correspondence th em Words - Categories k id s fo r p ea ce E 2 -A G E 3 s e c u rity co n ten tm en t c h il d r e n E 2-A G E 2 E 1 -A G E 2 a re E 3-A G E 2 f a m i ly f ro m sh o u ld g e n e ra l v er y w if e m e le is u re fr e e d o m h a p p in ess sta n d a rd e m plo y m e nt d og h elp tim e house w o rld d a u g h ter m u sic a nd w o u ld e d u catio n w o rk I m one y be w ell to E 1 -A G E 3 not a n y thin g E 2 -A G E 1 you s a t is f . lo v e da y E 1 -A G E 1 jo b ha ve g o in g ke e p c o m f o r ta b l y t h in g s c o m f o r t a b le m o re m u ch E 3 -A G E 1 f rie nd s th in k c an f u tu re ca r out go 62
  • 63. p ea c e o f m in d Example 1 (« Life » question) E3-A G E3 Location of p e a c e in t h e w o r ld E 2 -A G E 3 Segments w e lfa r e o f m y fa m ily E2-A G E2 h a p p in e ss , g o o d h e a lth E 1 -A G E 2 E3-A G E2 la w a n d o r d e r a n ic e h o m e a g o o d st a n d a r d o f liv in g I d o n 't kn o w h a v in g e n o u g h m o n e y to l ive E1-A G E3 E 2 -A G E 1 E1-A G E1 a g o o d jo b c a n 't t h i n k o f a n y t h i n g e l s e fr ie n d s a n d fa m i ly E 3 -A G E 1 63
  • 64. 5) Applications: Open questions, sample surveys, texts Example 1 (« Life » question) Characteristic words words %W %glob Fr.W Fr.glob TestValue H-30 = -30 * high (Young, High education) 1 f rie n d s 2 .8 7 1 .1 1 17 11 6 3 .4 4 2 do 1 .3 5 .4 5 8 4 7 2 .6 0 3 w ant 1 .0 1 .3 0 6 3 1 2 .4 4 4 b e in g 2 .1 9 1 .1 1 13 11 6 2 .1 8 5 jo b 2 .5 3 1 .3 6 15 14 2 2 .1 6 6 h a v in g 1 .5 2 .6 7 9 7 0 2 .1 1 7 t h in g s .8 4 .2 7 5 2 8 2 .0 6 - -- - -- - -- - -- - -- - 2 w ife .0 0 .6 5 0 68 -2 .1 0 1 h e a lt h 2 .7 0 5 .8 5 16 609 -3 .5 9 H+55 = +55 * high (Older, High education) 1 m in d 2 .5 5 .4 5 5 47 2 .9 1 2 w e lfa r e 1 .5 3 .2 1 3 22 2 .4 2 3 peace 2 .5 5 .7 4 5 77 2 .1 7 64
  • 65. 5) Applications: Open questions, sample surveys, texts Example 1 (« Life » question) : Modal Responses Category 7 Less than 30 years, high level of education 1 .3 3 - 1 f r i e n d s , f r i e n d s , m y h o m e l i f e 1 .1 2 - 2 b e i n g c o n t e n t h a v i n g e n o u g h m o n e y t o d o w h a t y o u w a n t t o d o , w ith in re a s o n , h a v in g g o o d frie n d s , h a v in g a fu lfillin g jo b to d o , h a v in g s o m e id e a o f w h a t y o u w a n t to d o a n d th e fre e d o m to c h o o s e , p ro te c tio n o f th e e n v iro n m e n t 1 .0 5 - 3 to h a v e g o o d frie n d s a ro u n d h a v in g a g o o d jo b , liv in g in a g o o d a re a , h a v in g lo ts o f fre e d o m to d o th e th in g s y o u w a n t to d o .9 3 - 4 g o o d l i v i n g e d u c a t i o n , g o o d j o b , m o n e y Category 9 Over 55 years, high level of education .9 7 - 1 to g e th e rn e s s , p e a c e o f m in d , g o o d h e a lth , re lig io n , n o .6 4 - 2 n o t t o d i e , p e a c e o f m i n d , d o n 't l i k e p e o p l e l i v i n g e n v i o u s o f e a c h o th e r .6 3 - 3 p e a c e o f m in d g o o d h e a lth , h a p p in e s s , e n o u g h m o n e y to k e e p a s ta n d a rd o f liv in g .3 8 - 4 w e lfa re o f m y fa m ily w o rk , s a tis fa c tio n , g o o d h e a lth , tra v e l 65
  • 66. 5) Applications: Open questions, sample surveys, texts Example 1: Similar survey in Japan (Same open question, same categories of respondents) 66
  • 67. 5) Applications: Open questions, sample surveys, texts Example 1: Similar survey in Japan: visualization of the characteristic words for 2 categories 67
  • 68. 5) Applications: Open questions, sample surveys, texts Example 2: Open Questions / Copy-Test
  • 69. 5) Applications: Open questions, sample surveys, texts E x a m p le s o f 2 C h a ra c te ris tic re s p o n s e s fo r 4 c a te g o rie s TEXT 1 PWNB = Prob.w.n.buy Example 2: Open -- 1 to tell you about how long people have eaten them. Questions / Copy- -- 1 the complex carbohydrate that are in this cereal. -- 1 the people who eat this cereal and the product. that's all. Test -- 2 it's supposed to be healthy, it has good carbohydrates in it. TEXT 2 Hesi = Hesitates . -- 1 it gives you energy in the morning. nothing else. . -- 2 grape nuts cereal gives you energy -- 2 it has complex carbohydrates. they showed the man eating it with -- 2 strawberries and bananas. TEXT 3 PWB = Probably would buy -- 1 it's nutritious for you. nothing else. -- 2 that,is good for you, that,s all it said to me TEXT 4 DW B = Definitely would buy -- 1 they are bigger nuggets. low in carbohydrates, that's all. -- 2 it has nutty flavor, it is nutritious
  • 70. 5) Applications: Open questions, sample surveys, texts Example 3: International survey (Tokyo Gas Company). A survey in three cities (Tokyo, New York, Paris) about dietary habits. Open question: "What dishes do you like and eat often? New York: First principal plane. Table crossing words and age x gender categories
  • 71. 5) Applications: Open questions, sample surveys, texts Example 3: International survey (continuation). Question: "What dishes do you like and eat often? New York: First principal plane. Example of confidence areas for categories (Bootstrap)
  • 72. 5) Applications: Open questions, sample surveys, texts Example 3: International survey (continuation). Question: "What dishes do you like and eat often? New York: First principal plane. Example of confidence areas for words (Bootstrap)
  • 73. 5) Applications: Open questions, sample surveys, texts Example 3: International survey (continuation). Question: "What dishes do you like and eat often? New York: First principal plane. Example of Kohonen Map (Self Organizing map).
  • 74. 5) Applications: Open questions, sample surveys, texts Axis 2 : 1.75% First Principal Plane Mesoneros de Castilla (03) 6.0 WINES & MARKS Example 4: Comments about 522 Spanish wines. Nota y commentarios activos 4.5 Valdelosfrailes (03) Gran Reserva Fuentenarro (02) Tinto joven 3.0 Torondos (02) Tinto crianza Gayubar (02) Vega Sicilia 'Único' (94) Viña Sastre Pesus(01) 1.5 Valdetán (02) 94 Eje de calidad 78 79 93 80 82 88 97 Jaros Chafandín (01) 81 83 90 91 92 -3.0 -1.5 84 85 86 1.5 Axis 1: 3.52% 89 87 San Román (01) Numanthia (02) Bienvenida Sitio de El Palo (01) Bienvenida Sitio de El Palo (02) 95 Termanthia (02) Tares P3 (01) Carramimbre (03) Gran Elías Mora (00) -1.5 Viña Eremos (03) Marqués de Peñamonte (01) Tinto reserva Tinto roble
  • 75. 5) Applications: Open questions, sample surveys, texts Example 4: Comments about 522 Spanish wines (continuation) lowest marks highest marks agradable reducido potente estilo denso vez frutal sobremadurez discreto puro concentrado salado graso crianza sequedad frutosidad dejar necesitar impresionante tuestes torrefacto algo medio ensamblado mineral potencial cierto granuloso limpio tempranillo seco primer sabroso abierto rojo gran enérgico ligero ligeramente clásico moderno sorprende algún típico salino tiempo americano beber demasiado dominar expresión fino carnoso tacto evolucionar capa franco donde amargo complejo todo compotado fácil mucho largo noble suave ser cascajo tradicional Ribera cesta bouquet rústico coco sílex joven toque pólvora intenso roble voluptuoso firme lineal magnífico vino corto amable chocolate herbáceo consistencia -1,9 -1,5 -1,1 -0,7 -0,3 0,1 0,5 0,9 1,3 Mark81 82 83 84 85 86 87 88 89 90 Average mark: 85.16
  • 76. 5) Applications: Open questions, sample surveys, texts Example 4: Comments about 522 Axis2 Variables suplementarias Spanish wines (continuation) Mesoneros de Castilla (03) Jaros Chafandín (01) Vega Sicilia 'Único' (94) 4.5 Viña Sastre Pesus(01) Valdelosfrailes (03) Gran Reserva Punta Esencia (01) Fuentenarro (02) Astrales (02) 3.0 20-24,9€ 0-4,9€ 5-9,9€ Torondos (02) Valdecuadrón (02) 15-19,9€ 25-29,9€ Tinto joven 10-14,9€ 50-99,9€ Gayubar (02) Tinto crianza 1.5 Viñatorondos (03) 94 Valdetán (02) 79 100-300€ 78 93 81 91 97 80 Viña Valdable (03) 82 83 88 30-49,9€ 90 92 - 3.0 - 1.5 84 89 1.5 Axis1 Marqués de Olivara (98) 85 86 87 Rauda (01) El Marqués (02) 95 Carramimbre (03) Valsotillo (01) - 1.5 Viña Eremos (03) San Román (01) Tinto reserva Numanthia (02) Marqués de Peñamonte (01) Bienvenida Sitio de El Palo (01) Termanthia (02) Tinto roble Gran Elías Mora (00) Tares P3 (01) Bienvenida Sitio de El Palo (02)
  • 77. 5) Applications: Open questions, sample surveys, texts Example 4: Comments about 522 Spanish wines (continuation) ---- Wine 212 (mark= 85) Legaris-2001 Tuestes, gominolas y buenos balsámicos marcan la intensidad media frutal de este crianza. En boca aparece muy lineal, con consistencia media; el retrogusto frutal todavía tapado por una madera algo rústica. ---- Wine 30 (mark=91) Tares P3-2001 premium Mucho terruño se detecta en el bouquet de este gran tinto; pólvora, sílex, pizarra, cascajo caliente con el contraste de tierra húmeda y mucha fruta madura de hueso. concentrado, tacto graso sobre el paladar; impresionante viscosidad en la lengua, otra vez impresiones de tierra húmeda y pólvora en el largo final. ---- Wine 314 (mark=97) Vega Sicilia 'Único-1994 Hay que realizar un ejercicio de disciplina gustativa de primer rango para describir este gran vino. el bouquet es fresco, bien armado de fruta roja que se ve potenciada por tintes de chocolates, tabacos, notas de sotobosque y una madera que se manifiesta pero que resulta difícil de localizar y menos de concretar. Tenemos el caso raro de un tinto que sale ileso del paso del tiempo sin lucir su armadura, que es la barrica. En boca joven, aunque ya tiene su cuerpo vigoroso y enérgico bastante ensamblado, con la excepción de algunos taninos saltamontes que quedan para domesticar. Largo y vibrante final que mezcla madurez con una notable finura fresca.
  • 78. Text Mining and Open-ended Questions in Sample Surveys 1) Principles of Data Mining and Text mining: A reminder 2) Open-ended Questions: Why? How? 3) From texts to numerical data 4) Basic statistical tools: Visualization, Characteristic words. 5) Applications: Open questions, sample surveys, texts 6) About textual data in general 7) Conclusions
  • 79. 6) About textual data in general Processing Strategy A priori Grouping (Lexical contingency table) Juxtaposition of Lexical contingency tables Instrumental Partition Direct Analysis of the sparse Lexical table 79
  • 80. 6) About textual data in general Importance of Meta-data Meta-data linguistics Grammar / Syntax Textual data Semantics networks External Corpora externes Other a priori structures sociolinguistics, chronology, etc. 80
  • 81. 6) About textual data in general The four phases of a linguistic analysis (A bxg flower) A big flower A bug flower Morphology A bag flower A bog flower Syntax The spoon speaks (The speaks) Semantics A man thinks (A stone thinks) Pragmatics A challenge to I.A. 81
  • 82. 6) About textual data in general Homography, Polysemy, Synonymy To bear Homographs: BORE A tedious person To bore Task Polysemous words: DUTY Tax DRUG medicine Addicting product 82
  • 83. 6) About textual data in general Semantic content of a lexical profile Distributional linguistics (Z. Harris) X is sometimes purring X mews X has whiskers X likes milk X likes chasing mice At the end, the point « X » will be superimposed with the point « CAT» 83
  • 84. 6) About textual data in general Semantic similarity is not a transitive relationship Example of semantic chains: (1) calm–wisdom–discretion–wariness–fear–panic, (2) fact–feature –aspect–appearance–illusion . 84
  • 85. 6) About textual data in general New additional variables Nouns (proper, common) Verbs (auxiliary, modal, usual…) Adjective Pronoun Determiner Adverb Preposition Conjunction 85
  • 86. 6) About textual data in general new variables, new metrics 86
  • 87. Text Mining and Open-ended Questions in Sample Surveys Summary / Outline 1) Principles of Data Mining and Text mining: A reminder 2) Open-ended Questions: Why? How? 3) From texts to numerical data 4) Basic statistical tools: Visualization, Characteristic words. 5) Applications: Open questions, sample surveys, texts 6) About textual data in general 7) Conclusions
  • 88. 7) Conclusions As a conclusion... For each open-ended question, and for each partition of the sample of respondents, we obtain, without any preliminary coding or other intervention: • A visualization of proximities between words and categories. • Characteristic elements or words for each category . • Modal responses for each category (a kind of automatic summary). [Remember also that the open question “Why” following a closed question provides an indispensable assessment of the real understanding of the question].
  • 89. 7) Conclusions As a conclusion... (continuation) All these processing are carried out under the supervision of robust assessment procedures: - Non-parametric statistical tests, - Bootstrap validation. We are not dealing here with a novel sophisticated modelling. It is rather a painstaking effort to stick to the real concerns of the respondent, i.e.: the customer, the user, the client. With the rapid development of online surveys, the spreading of e-mails and blogs, the presented set of tools is expected to be a noteworthy component in a new methodology of customer knowledge.
  • 90. 7) Conclusions – Short Bibliography - Akuto H. (1992). International Comparison of Dietary Culture. Nihon Keizai Simbun, Tokyo. - Bécue M., Lebart L. (1996). Clustering of texts using semantic graphs. Application to open-ended questions in surveys, Proceedings of the IFCS 96 Symposium, Kobe, Springer Verlag, Tokyo (in press). - Bécue-Bertaut M., Pagès J., Alvarez-Esteban R., Vásquez Burguete J.L. (2006) Détermination d’une note globale, synthèse d’une évaluation numérique et d’appréciations libres. Application aux études de marché. (in French) Actes des JADT-2006. - Bécue-Bertaut, M., Álvarez Esteban R., Pagès (2008,) http://www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2006/tocJADT2006.htm Rating of products through scores and free-text assertions. Comparing and combining both. Food Quality and Preference, 19/1, 122-134. - Belson W.A., Duncan J.A. (1962): A Comparison of the check-list and the open response questioning system, Applied Statistics, 2, 120-132. - Benzécri J.-P. (1992). Correspondence Analysis Handbook. Marcel Dekker, New York. - Biber D. (1995). Dimensions of register variation. Cambridge Univ. Press, Cambridge. - Bradburn N., Sudman S., and associates (1979): Improving Interview Method and Questionnaire Design, Jossey Bass, San Francisco. - Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R. (1990). Indexing by latent semantic analysis, J. of the Amer. Soc. for Information Science, 41 (6), 391-407. - Habert B., Nazarenko A., Salem A. (1997). Les linguistiques de corpus. Armand colin, Paris. - Hayashi C., Suzuki T., Sasaki M. (1992): Data Analysis for Social Comparative research: International Perspective, North-Holland, Amsterdam. - Lebart L. (1982). Exploratory analysis of large sparse matrices, with application to textual data, COMPSTAT, Physica Verlag, 67-76. - Lebart L., Salem A., Bécue M., (2000), Análisis estadístico de textos, Editorial Milenio, Lleida. - Lebart L., Salem A., Berry E. (1998). Exploring Textual Data. Kluwer, Dordrecht. - Lebart L., Morineau A., Warwick K. (1984). Multivariate Descriptive Statistical Analysis. John Wiley. N.Y. - Ritter H., Kohonen T. (1989). Self Organizing Semantic Maps. Biol. Cybern. 61, 241-254. - Salem A. (1984). La typologie des segments répétés dans un corpus, fondée sur l'analyse d'un tableau croisant mots et textes, Cahiers de l'Analyse des Données, 489-500. - Sasaki M., Suzuki T. (1989): New directions in the study of general social attitudes : trends and cross-national perspectives, Behaviormetrika, 26, 9-30. - Schuman H., Presser F. (1981): Question and Answers in Attitude Surveys, Academic Press, New York. - Sudman S., Bradburn N. (1974): Response Effects in Survey, Aldine, Chicago.
  • 91. Surveys data and software (DtmVic) can be downloaded from www.dtm-vic.com
  • 92. Thank You Gracias Grazie Obrigado Merci 92