SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Blog Sentiment Analysis using
                                   NOSQL techniques

                        S. Kartik, Ph.D
                        EMC Distinguished Engineer
                        Global Field CTO,
                        Data Computing Division, EMC


© Copyright 2011 EMC Corporation. All rights reserved.       1
NOSQL is
                  Not Only SQL!

© Copyright 2011 EMC Corporation. All rights reserved.   2
What is Sentiment Analysis?
• Sentiment analysis or opinion mining refers to the
  application of natural language
  processing, computational linguistics, and text
  analytics to identify and extract subjective information
  in source materials.
                                                         • Source: Wikipedia

• For bulk text data such as blogs, Sentiment Analysis
  requires a combination of natural language processing
  and statistical techniques to quantify the results
          – Hence NOSQL techniques are used for this sort of analysis



© Copyright 2011 EMC Corporation. All rights reserved.                         3
What are our customers saying about us?
                                                         • Discern trends and categories
                                                           in on-line conversations?
                                                           – Search for relevant blogs
                                                           – ‘Fingerprinting’ based on word
                                                             frequencies
                                                           – Similarity Measure
                                                           – Identify ‘clusters’ of documents




© Copyright 2011 EMC Corporation. All rights reserved.                                          4
Natural Language Processing
    • Non trivial!
    • Natural Language is hard to interpret
              – Ontologies are very specific to industries and
                technologies (eg. medical vs telco,…)
              – Abbreviations and modern geek-speak is hard to
                interpret (LOL, RTFM, ROTFL, BSOD,….)
    • Natural language is inherently ambiguous
              – “The thieves stole the paintings. They were then sold.”
              – “The thieves stole the paintings. They were then caught”
    • Strongly recommend excellent book on NLTK
              – Natural Language Processing with Python – O’Reilly


© Copyright 2011 EMC Corporation. All rights reserved.                     5
Tools and Techniques
• Greenplum Map/Reduce for blog text processing in
  parallel
• Python Natural Language Tool Kit (nltk) to parse blogs
• Histograms for word frequencies using Greenplum SQL
• Construct metrics describing document similarity
  using SQL
• Use statistical analysis with clustering techniques to
  group blogs with similar word usage




© Copyright 2011 EMC Corporation. All rights reserved.     6
What are our customers saying about us?

                                                                         Method


                                                         • Construct document histograms


                                                         • Transform histograms into document
                                                           “fingerprints”


                                                         • Use clustering techniques to discover
                                                           similar documents.




© Copyright 2011 EMC Corporation. All rights reserved.                                             7
What are our customers saying about us?
                                                          Constructing document histograms


                                                         • Parsing & extract html files
                                                         • Using natural language processing
                                                           for tokenization and stemming
                                                         • Cleansing inconsistencies
                                                         • Transforming unstructured data into
                                                           structured data



© Copyright 2011 EMC Corporation. All rights reserved.                                           8
What are our customers saying about us?
                                                         “Fingerprinting”
                                                         - Term frequency of words within a
                                                           document vs. frequency that those
                                                           words occur in all documents
                                                         - Term frequency-inverse document
                                                           frequency (tf-idf weight)
                                                         - Easily calculated based on formulas
                                                           over the document histograms.
                                                         - The result is a vector in n-space.



© Copyright 2011 EMC Corporation. All rights reserved.                                           9
What are our customers saying about us?
                                                         k-means clustering:
                                                         - Iterative algorithm for finding items
                                                           that are similar within an n-
                                                           dimensional space


                                                         - Two fundamental steps:
                                                           1. Measuring distance to a centroid (the
                                                              center of a cluster)
                                                           2. Moving the center of a cluster to
                                                              minimize the sum distance to all the
                                                              members of the cluster.


© Copyright 2011 EMC Corporation. All rights reserved.                                                10
What are our customers saying about us?




© Copyright 2011 EMC Corporation. All rights reserved.   11
What are our customers saying about us?




© Copyright 2011 EMC Corporation. All rights reserved.   12
What are our customers saying about us?




© Copyright 2011 EMC Corporation. All rights reserved.   13
What are our customers saying about us?

           • innovation
           • leader
           • design



                                                         • bug
                                                         • installation
                                                         • download

           • speed
           • graphics
           • improvement


© Copyright 2011 EMC Corporation. All rights reserved.                    14
Accessing the data
     • find blogsplog/model -exec echo "$PWD/{}" ; > filelist.txt
     • Build the directory list into a set of files that we will access:
                 -INPUT:
                    NAME: filelist
                    FILE:
                      - maple:/Users/demo/blogsplog/filelist.txt
                    COLUMNS:
                      - path text


     • For each record in the list "open()" the file and read it in its entirety
                 -MAP:
                   NAME:      read_data
                   PARAMETERS: [path text]
                   RETURNS: [id int, path text, body text]
                   LANGUAGE: python
                   FUNCTION: |
                     (_, fname) = path.rsplit('/', 1)
                     (id, _) = fname.split('.')
                     body     = f.open(path).read()…

       id |                path                 |             body
      ------+---------------------------------------+------------------------------------
       2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC "...
          1 | /Users/demo/blogsplog/model/1.html | <!DOCTYPE html PUBLIC "...
         10 | /Users/demo/blogsplog/model/1000.html | <!DOCTYPE html PUBLIC "...
       2484 | /Users/demo/blogsplog/model/2484.html | <!DOCTYPE html PUBLIC "...
      ...



© Copyright 2011 EMC Corporation. All rights reserved.                                      15
How do we deal with HTML?
    • First, strip out HTML tags in the blog
              – Use standard NLTK HTML Parsers with some method
                overrides
    • Tokenizing is breaking up the text into distinct
      pieces (usually words)
    • Stemming gets rid of ending variation
              – Stemming, stemmed, stems  stem
    • Stop Words are removed (and, but, if,…)
    • Get rid of non-Unicode characters
    • Lastly, discard very small words (<4 characters)

© Copyright 2011 EMC Corporation. All rights reserved.            16
Parse the documents into word lists
• Convert HTML documents into parsed, tokenized, stemmed, term lists with stop-word
  removal
• Use the HTMLParser library to parse the html documents and extract titles and body
  contents:
                                              if 'parser' not in SD:
                                                      from HTMLParser import HTMLparser
                                                      ...
                                                      class MyHTMLParser(HTMLParser):
                                                        def __init(self):
                                                          HTMLParser.__init__(self)
                                                          ...
                                                        def handle_data(self, data):
                                                          data = data.strip()
                                                          if self.inhead:
                                                            if self.tag == 'title':
                                                               self.title = data
                                                            if self.inbody:
                                                      ...
                                              parser = SD['parser']
                                              parser.reset()
                                              ...




 © Copyright 2011 EMC Corporation. All rights reserved.                                   17
Parse the documents into word lists
• Use nltk to tokenize, stem, and remove common terms:

                                             if 'parser' not in SD:
                                                    from nltk import WordTokenizer, PorterStemmer, corpus
                                                    ...
                                                    class MyHTMLParser(HTMLParser):
                                                      def __init(self):
                                                        ...
                                                        self.tokenizer = WordTokenizer()
                                                        self.stemmer = PorterStemmer()
                                                        self.stopwords = dict(map(lambda x: (x, True),
                                                                         corpus.stopwords.words()))
                                                     ...
                                                     def handle_data(self, data):
                                                        ...
                                                        if self.inbody:
                                                           tokens = self.tokenizer.tokenize(data)
                                                           stems = map(self.stemmer.stem, tokens)
                                                           for x in stems:
                                                            if len(x) < 4: continue
                                                            x = x.lower()
                                                            if x in self.stopwords: continue
                                                            self.doc.append(x)
                                                     ...
                                             parser = SD['parser']
                                             parser.reset()
                                             ...




© Copyright 2011 EMC Corporation. All rights reserved.                                                      18
Parse the documents into word lists
• Use nltk to tokenize, stem, and remove common terms:

                                             if 'parser' not in SD:
                                              from nltk import WordTokenizer, PorterStemmer, corpus
                                              ...
                                              class MyHTMLParser(HTMLParser):
                                                def __init(self):
                                              ...
                                                  self.tokenizer = WordTokenizer()
shell$ gpmapreduce -f blog-terms.ymlself.stemmer = PorterStemmer()
mapreduce_75643_run_1                             self.stopwords = dict(map(lambda x: (x, True), corpus.stopwords.words()))
DONE                                           def handle_data(self, data):
                                                  ...
                                               if self.inbody:
sql# SELECT id, title, doc FROM blog_terms LIMIT 5;
                                                  tokens = self.tokenizer.tokenize(data)
                                                  stems = map(self.stemmer.stem, tokens)
 id |      title       |                                  doc
                                                  for x in stems:
------+------------------+-----------------------------------------------------------------
                                                    if len(x) < 4: continue
 2482 | noodlepie             | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,...
                                                    x = x.lower()
    1 | Bhootakannadi | {bhootakannadi,2005,unifi,feed,gener,comment,final,integr,...
                                                    if x in self.stopwords: continue
   10 | Tea Set           | {novelti,dish,goldilock,bear,bowl,lide,contain,august,...
                                                    self.doc.append(x)
...                                            ...
                                     parser = SD['parser']
                                     parser.reset()
                                     ...




© Copyright 2011 EMC Corporation. All rights reserved.                                                                        19
Create histograms of word frequencies
Extract a term-dictionary of terms that show up in at least ten blogs

                   sql# SELECT term, sum(c) AS freq, count(*) AS num_blogs
                      FROM (
                        SELECT id, term, count(*) AS c
                        FROM (
                          SELECT id, unnest(doc) AS term
                          FROM blog_terms
                        ) term_unnest
                        GROUP BY id, term
                      ) doc_terms
                      WHERE term IS NOT NULL
                      GROUP BY term
                      HAVING count(*) > 10;

                     term | freq | num_blogs
                   ----------+------+-----------
                    sturdi | 19 |           13
                    canon | 97 |              40
                    group | 48 |             17
                    skin | 510 |            152
                    linger | 19 |           17
                    blunt | 20 |            17




© Copyright 2011 EMC Corporation. All rights reserved.                       20
Create histograms of word frequencies
Use the term frequencies to construct the term dictionary…

       sql#      SELECT array(SELECT term FROM blog_term_freq) dictionary;

                                dictionary
       ---------------------------------------------------------------------
       {sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...




…then use the term dictionary to construct feature vectors for every document, mapping document
  terms to the features in the dictionary:
       sql#     SELECT id, gp_extract_feature_histogram(dictionary, doc)
               FROM blog_terms, blog_features;

        id |                     term_count
       -----+----------------------------------------------------------------
       2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...}
          1 | {41,1,34,1,22,1,125,1,387,...}:{0,9,0,1,0,1,0,1,0,3,0,2,...}
         10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...}
       ...




© Copyright 2011 EMC Corporation. All rights reserved.                                            21
Create histograms of word frequencies
Format of a sparse vector
     id |                    term_count
    -----+----------------------------------------------------------------------
    ...
      10 | {3,1,40,...}:{0,2,0,...}
    ...




   Dense representation of the vector
     id |                     term_count
    -----+----------------------------------------------------------------------
    ...
      10 | {0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...}
    ...
                            dictionary
    ----------------------------------------------------------------------------
          {sturdi,canon,group,skin,linger,blunt,detect,giver,...}




      Representing the document
    {skin, skin, ...}




© Copyright 2011 EMC Corporation. All rights reserved.                             22
Transform the blog terms into statistically
    useful measures
Use the feature vectors to construct tf-idf (term frequency inverse document frequency)
  vectors:
These are a measure of the importance of terms.

       sql# SELECT id, (term_count*logidf) tfxidf
           FROM blog_histogram, (
             SELECT log(count(*)/count_vec(term_count)) logidf
             FROM blog_histogram
           ) blog_logidf;

        id |                       tfxidf
       -----+-------------------------------------------------------------------
       2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...}
          1 | {41,1,34,1,22,1,125,1,387,...}:{0,0.771999985977529,0,1.999427...}
         10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2.95439664949608,0,3.2006935...}
       ...




© Copyright 2011 EMC Corporation. All rights reserved.                                    23
Create document clusters around iteratively
    defined centroids
Now that we have TFxIDFs we have something that is a statistically significant metric,
  which enables all sorts of real analytics.
The current example is k-means clustering which requires two operations.


First, we compute a distance metric between the documents and a random selection
   of centroids, for instance cosine similarity:

sql# SELECT id, tfxidf, cid,
     ACOS((tfxidf %*% centroid) /
       (svec_l2norm(tfxidf) * svec_l2norm(centroid))
     ) AS distance
    FROM blog_tfxidf, blog_centroids;

  id |                       tfxidf                              | cid | distance
 -----+-------------------------------------------------------------------+-----+------------
 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977
 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 2 | 1.55720354
 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 3 | 1.55040145




© Copyright 2011 EMC Corporation. All rights reserved.                                          24
Create document clusters around iteratively
    defined centroids
Next, use an averaging metric to re-center the mean of a cluster:


sql# SELECT cid, sum(tfxidf)/count(*) AS centroid
    FROM (
      SELECT id, tfxidf, cid,
       row_number() OVER (PARTITION BY id ORDER BY distance, cid) rank
      FROM blog_distance
    ) blog_rank
    WHERE rank = 1
    GROUP BY cid;

 cid |                              centroid
-----+------------------------------------------------------------------------
   3 | {1,1,1,1,1,1,1,1,1,...}:{0.157556041103536,0.0635233900749665,0.050...}
   2 | {1,1,1,1,1,1,3,1,1,...}:{0.0671131209568817,0.332220028552986,0,0.0...}
   1 | {1,1,1,1,1,1,1,1,1,...}:{0.103874521481016,0.158213686890834,0.0540...}



Repeat the previous two operations until the centroids converge, and you have k-means clustering.




© Copyright 2011 EMC Corporation. All rights reserved.                                              25
Summary

• Accessing the data (MapReduce)
                                                                    id |               path                 |                        body
                                                                   ------+---------------------------------------+------------------------------------
                                                                    2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC ”...

• Parse the documents into word lists (MapReduce)
                                                          id |      title      |                                 doc
                                                         ------+------------------+-----------------------------------------------------------------
                                                          2482 | noodlepie            | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,...

• Create histograms of word frequencies (SQL)
                                                                                       id |                    term_count
                                                                                      -----+----------------------------------------------------------------
                                                                                      2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...}

• Transform the blog terms into statistically useful measures (SQL)
                                                                                  id |                       tfxidf
                                                                                 -----+-------------------------------------------------------------------
                                                                                 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...}

• Create document clusters around iteratively defined centroids (MADlib or SQL window functions)
                                                    id |                       tfxidf                             | cid | distance
                                                   -----+-------------------------------------------------------------------+-----+------------
                                                   2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977




© Copyright 2011 EMC Corporation. All rights reserved.                                                                                                         26
Conclusions
    • Simple exercise on Natural Language Text
      Processing
    • Technique uses a combination of non-SQL
      processing (Map/Reduce) and SQL Processing
      (Greenplum)
    • Takes qualitative, subjective text data and converts
      the problem into a form amenable to statistical
      analysis using k-means clustering
    • Rich possibilities for business impact using these
      techniques emerging across industries.


© Copyright 2011 EMC Corporation. All rights reserved.       27

Weitere ähnliche Inhalte

Andere mochten auch

Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
RWDG Webinar: Achieving Data Quality Through Data Governance
RWDG Webinar: Achieving Data Quality Through Data GovernanceRWDG Webinar: Achieving Data Quality Through Data Governance
RWDG Webinar: Achieving Data Quality Through Data GovernanceDATAVERSITY
 
CDO Slides: Real World Data Strategy Success Stories
CDO Slides: Real World Data Strategy Success StoriesCDO Slides: Real World Data Strategy Success Stories
CDO Slides: Real World Data Strategy Success StoriesDATAVERSITY
 
Webinar: Initiating a Customer MDM/Data Governance Program
Webinar: Initiating a Customer MDM/Data Governance ProgramWebinar: Initiating a Customer MDM/Data Governance Program
Webinar: Initiating a Customer MDM/Data Governance ProgramDATAVERSITY
 
LDM Webinar: Data Modeling & Metadata Management
LDM Webinar: Data Modeling & Metadata ManagementLDM Webinar: Data Modeling & Metadata Management
LDM Webinar: Data Modeling & Metadata ManagementDATAVERSITY
 
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data ManagementSmart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data ManagementDATAVERSITY
 
Data-Ed Slides: Best Practices in Data Stewardship (Technical)
Data-Ed Slides: Best Practices in Data Stewardship (Technical)Data-Ed Slides: Best Practices in Data Stewardship (Technical)
Data-Ed Slides: Best Practices in Data Stewardship (Technical)DATAVERSITY
 
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachSlides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachDATAVERSITY
 
RWDG Slides: Corporate Data Governance - The CDO is the Data Governance Chief
RWDG Slides: Corporate Data Governance - The CDO is the Data Governance ChiefRWDG Slides: Corporate Data Governance - The CDO is the Data Governance Chief
RWDG Slides: Corporate Data Governance - The CDO is the Data Governance ChiefDATAVERSITY
 

Andere mochten auch (9)

Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
RWDG Webinar: Achieving Data Quality Through Data Governance
RWDG Webinar: Achieving Data Quality Through Data GovernanceRWDG Webinar: Achieving Data Quality Through Data Governance
RWDG Webinar: Achieving Data Quality Through Data Governance
 
CDO Slides: Real World Data Strategy Success Stories
CDO Slides: Real World Data Strategy Success StoriesCDO Slides: Real World Data Strategy Success Stories
CDO Slides: Real World Data Strategy Success Stories
 
Webinar: Initiating a Customer MDM/Data Governance Program
Webinar: Initiating a Customer MDM/Data Governance ProgramWebinar: Initiating a Customer MDM/Data Governance Program
Webinar: Initiating a Customer MDM/Data Governance Program
 
LDM Webinar: Data Modeling & Metadata Management
LDM Webinar: Data Modeling & Metadata ManagementLDM Webinar: Data Modeling & Metadata Management
LDM Webinar: Data Modeling & Metadata Management
 
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data ManagementSmart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
 
Data-Ed Slides: Best Practices in Data Stewardship (Technical)
Data-Ed Slides: Best Practices in Data Stewardship (Technical)Data-Ed Slides: Best Practices in Data Stewardship (Technical)
Data-Ed Slides: Best Practices in Data Stewardship (Technical)
 
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachSlides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
 
RWDG Slides: Corporate Data Governance - The CDO is the Data Governance Chief
RWDG Slides: Corporate Data Governance - The CDO is the Data Governance ChiefRWDG Slides: Corporate Data Governance - The CDO is the Data Governance Chief
RWDG Slides: Corporate Data Governance - The CDO is the Data Governance Chief
 

Ähnlich wie Wed 1430 kartik_subramanian_color

Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)Sumit Raj
 
Empower your Enterprise with language intelligence_Francisco Webber
Empower your Enterprise with language intelligence_Francisco Webber Empower your Enterprise with language intelligence_Francisco Webber
Empower your Enterprise with language intelligence_Francisco Webber Dataconomy Media
 
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found..."Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...Dataconomy Media
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionMohammad Ilyas Malik
 
Data Segmenting in Anzo
Data Segmenting in AnzoData Segmenting in Anzo
Data Segmenting in AnzoLeeFeigenbaum
 
Semantic Web and Machine Learning Tutorial
Semantic Web and Machine Learning TutorialSemantic Web and Machine Learning Tutorial
Semantic Web and Machine Learning Tutorialbutest
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...Dr. Haxel Consult
 
Terminology in openEHR
Terminology in openEHRTerminology in openEHR
Terminology in openEHRPablo Pazos
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Fabien Coppens
 
RESTful SOA and the Spring Framework (EMCWorld 2011)
RESTful SOA and the Spring Framework (EMCWorld 2011)RESTful SOA and the Spring Framework (EMCWorld 2011)
RESTful SOA and the Spring Framework (EMCWorld 2011)EMC
 
On the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF ModelsOn the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF ModelsFilip Krikava
 
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...devismileyrockz
 
pre-defence.pptx
pre-defence.pptxpre-defence.pptx
pre-defence.pptxkhanz8
 
Ecm mythbusters the_real_story_behind_vendor_marketing
Ecm mythbusters the_real_story_behind_vendor_marketingEcm mythbusters the_real_story_behind_vendor_marketing
Ecm mythbusters the_real_story_behind_vendor_marketingQuestexConf
 
Engineering Interoperable and Reliable Systems
Engineering Interoperable and Reliable SystemsEngineering Interoperable and Reliable Systems
Engineering Interoperable and Reliable SystemsRick Warren
 
Natural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyNatural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyAkshayaNagarajan10
 
EclipseConEurope2012 SOA - Models As Operational Documentation
EclipseConEurope2012 SOA - Models As Operational DocumentationEclipseConEurope2012 SOA - Models As Operational Documentation
EclipseConEurope2012 SOA - Models As Operational DocumentationMarc Dutoo
 
Ai Brain Docs Solution Oct 2012
Ai Brain Docs Solution Oct 2012Ai Brain Docs Solution Oct 2012
Ai Brain Docs Solution Oct 2012tom_marsh
 
Towards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software DataTowards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software DataFernando Silva Parreiras
 

Ähnlich wie Wed 1430 kartik_subramanian_color (20)

Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Empower your Enterprise with language intelligence_Francisco Webber
Empower your Enterprise with language intelligence_Francisco Webber Empower your Enterprise with language intelligence_Francisco Webber
Empower your Enterprise with language intelligence_Francisco Webber
 
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found..."Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognition
 
eZ Product Vision Keynote
eZ Product Vision KeynoteeZ Product Vision Keynote
eZ Product Vision Keynote
 
Data Segmenting in Anzo
Data Segmenting in AnzoData Segmenting in Anzo
Data Segmenting in Anzo
 
Semantic Web and Machine Learning Tutorial
Semantic Web and Machine Learning TutorialSemantic Web and Machine Learning Tutorial
Semantic Web and Machine Learning Tutorial
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
Terminology in openEHR
Terminology in openEHRTerminology in openEHR
Terminology in openEHR
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?
 
RESTful SOA and the Spring Framework (EMCWorld 2011)
RESTful SOA and the Spring Framework (EMCWorld 2011)RESTful SOA and the Spring Framework (EMCWorld 2011)
RESTful SOA and the Spring Framework (EMCWorld 2011)
 
On the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF ModelsOn the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF Models
 
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
 
pre-defence.pptx
pre-defence.pptxpre-defence.pptx
pre-defence.pptx
 
Ecm mythbusters the_real_story_behind_vendor_marketing
Ecm mythbusters the_real_story_behind_vendor_marketingEcm mythbusters the_real_story_behind_vendor_marketing
Ecm mythbusters the_real_story_behind_vendor_marketing
 
Engineering Interoperable and Reliable Systems
Engineering Interoperable and Reliable SystemsEngineering Interoperable and Reliable Systems
Engineering Interoperable and Reliable Systems
 
Natural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyNatural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A Survey
 
EclipseConEurope2012 SOA - Models As Operational Documentation
EclipseConEurope2012 SOA - Models As Operational DocumentationEclipseConEurope2012 SOA - Models As Operational Documentation
EclipseConEurope2012 SOA - Models As Operational Documentation
 
Ai Brain Docs Solution Oct 2012
Ai Brain Docs Solution Oct 2012Ai Brain Docs Solution Oct 2012
Ai Brain Docs Solution Oct 2012
 
Towards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software DataTowards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software Data
 

Mehr von DATAVERSITY

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data LiteracyDATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for YouDATAVERSITY
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?DATAVERSITY
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling FundamentalsDATAVERSITY
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectDATAVERSITY
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?DATAVERSITY
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsDATAVERSITY
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayDATAVERSITY
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise AnalyticsDATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best PracticesDATAVERSITY
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?DATAVERSITY
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesDATAVERSITY
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY
 

Mehr von DATAVERSITY (20)

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 

Kürzlich hochgeladen

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

Wed 1430 kartik_subramanian_color

  • 1. Blog Sentiment Analysis using NOSQL techniques S. Kartik, Ph.D EMC Distinguished Engineer Global Field CTO, Data Computing Division, EMC © Copyright 2011 EMC Corporation. All rights reserved. 1
  • 2. NOSQL is Not Only SQL! © Copyright 2011 EMC Corporation. All rights reserved. 2
  • 3. What is Sentiment Analysis? • Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials. • Source: Wikipedia • For bulk text data such as blogs, Sentiment Analysis requires a combination of natural language processing and statistical techniques to quantify the results – Hence NOSQL techniques are used for this sort of analysis © Copyright 2011 EMC Corporation. All rights reserved. 3
  • 4. What are our customers saying about us? • Discern trends and categories in on-line conversations? – Search for relevant blogs – ‘Fingerprinting’ based on word frequencies – Similarity Measure – Identify ‘clusters’ of documents © Copyright 2011 EMC Corporation. All rights reserved. 4
  • 5. Natural Language Processing • Non trivial! • Natural Language is hard to interpret – Ontologies are very specific to industries and technologies (eg. medical vs telco,…) – Abbreviations and modern geek-speak is hard to interpret (LOL, RTFM, ROTFL, BSOD,….) • Natural language is inherently ambiguous – “The thieves stole the paintings. They were then sold.” – “The thieves stole the paintings. They were then caught” • Strongly recommend excellent book on NLTK – Natural Language Processing with Python – O’Reilly © Copyright 2011 EMC Corporation. All rights reserved. 5
  • 6. Tools and Techniques • Greenplum Map/Reduce for blog text processing in parallel • Python Natural Language Tool Kit (nltk) to parse blogs • Histograms for word frequencies using Greenplum SQL • Construct metrics describing document similarity using SQL • Use statistical analysis with clustering techniques to group blogs with similar word usage © Copyright 2011 EMC Corporation. All rights reserved. 6
  • 7. What are our customers saying about us? Method • Construct document histograms • Transform histograms into document “fingerprints” • Use clustering techniques to discover similar documents. © Copyright 2011 EMC Corporation. All rights reserved. 7
  • 8. What are our customers saying about us? Constructing document histograms • Parsing & extract html files • Using natural language processing for tokenization and stemming • Cleansing inconsistencies • Transforming unstructured data into structured data © Copyright 2011 EMC Corporation. All rights reserved. 8
  • 9. What are our customers saying about us? “Fingerprinting” - Term frequency of words within a document vs. frequency that those words occur in all documents - Term frequency-inverse document frequency (tf-idf weight) - Easily calculated based on formulas over the document histograms. - The result is a vector in n-space. © Copyright 2011 EMC Corporation. All rights reserved. 9
  • 10. What are our customers saying about us? k-means clustering: - Iterative algorithm for finding items that are similar within an n- dimensional space - Two fundamental steps: 1. Measuring distance to a centroid (the center of a cluster) 2. Moving the center of a cluster to minimize the sum distance to all the members of the cluster. © Copyright 2011 EMC Corporation. All rights reserved. 10
  • 11. What are our customers saying about us? © Copyright 2011 EMC Corporation. All rights reserved. 11
  • 12. What are our customers saying about us? © Copyright 2011 EMC Corporation. All rights reserved. 12
  • 13. What are our customers saying about us? © Copyright 2011 EMC Corporation. All rights reserved. 13
  • 14. What are our customers saying about us? • innovation • leader • design • bug • installation • download • speed • graphics • improvement © Copyright 2011 EMC Corporation. All rights reserved. 14
  • 15. Accessing the data • find blogsplog/model -exec echo "$PWD/{}" ; > filelist.txt • Build the directory list into a set of files that we will access: -INPUT: NAME: filelist FILE: - maple:/Users/demo/blogsplog/filelist.txt COLUMNS: - path text • For each record in the list "open()" the file and read it in its entirety -MAP: NAME: read_data PARAMETERS: [path text] RETURNS: [id int, path text, body text] LANGUAGE: python FUNCTION: | (_, fname) = path.rsplit('/', 1) (id, _) = fname.split('.') body = f.open(path).read()… id | path | body ------+---------------------------------------+------------------------------------ 2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC "... 1 | /Users/demo/blogsplog/model/1.html | <!DOCTYPE html PUBLIC "... 10 | /Users/demo/blogsplog/model/1000.html | <!DOCTYPE html PUBLIC "... 2484 | /Users/demo/blogsplog/model/2484.html | <!DOCTYPE html PUBLIC "... ... © Copyright 2011 EMC Corporation. All rights reserved. 15
  • 16. How do we deal with HTML? • First, strip out HTML tags in the blog – Use standard NLTK HTML Parsers with some method overrides • Tokenizing is breaking up the text into distinct pieces (usually words) • Stemming gets rid of ending variation – Stemming, stemmed, stems  stem • Stop Words are removed (and, but, if,…) • Get rid of non-Unicode characters • Lastly, discard very small words (<4 characters) © Copyright 2011 EMC Corporation. All rights reserved. 16
  • 17. Parse the documents into word lists • Convert HTML documents into parsed, tokenized, stemmed, term lists with stop-word removal • Use the HTMLParser library to parse the html documents and extract titles and body contents: if 'parser' not in SD: from HTMLParser import HTMLparser ... class MyHTMLParser(HTMLParser): def __init(self): HTMLParser.__init__(self) ... def handle_data(self, data): data = data.strip() if self.inhead: if self.tag == 'title': self.title = data if self.inbody: ... parser = SD['parser'] parser.reset() ... © Copyright 2011 EMC Corporation. All rights reserved. 17
  • 18. Parse the documents into word lists • Use nltk to tokenize, stem, and remove common terms: if 'parser' not in SD: from nltk import WordTokenizer, PorterStemmer, corpus ... class MyHTMLParser(HTMLParser): def __init(self): ... self.tokenizer = WordTokenizer() self.stemmer = PorterStemmer() self.stopwords = dict(map(lambda x: (x, True), corpus.stopwords.words())) ... def handle_data(self, data): ... if self.inbody: tokens = self.tokenizer.tokenize(data) stems = map(self.stemmer.stem, tokens) for x in stems: if len(x) < 4: continue x = x.lower() if x in self.stopwords: continue self.doc.append(x) ... parser = SD['parser'] parser.reset() ... © Copyright 2011 EMC Corporation. All rights reserved. 18
  • 19. Parse the documents into word lists • Use nltk to tokenize, stem, and remove common terms: if 'parser' not in SD: from nltk import WordTokenizer, PorterStemmer, corpus ... class MyHTMLParser(HTMLParser): def __init(self): ... self.tokenizer = WordTokenizer() shell$ gpmapreduce -f blog-terms.ymlself.stemmer = PorterStemmer() mapreduce_75643_run_1 self.stopwords = dict(map(lambda x: (x, True), corpus.stopwords.words())) DONE def handle_data(self, data): ... if self.inbody: sql# SELECT id, title, doc FROM blog_terms LIMIT 5; tokens = self.tokenizer.tokenize(data) stems = map(self.stemmer.stem, tokens) id | title | doc for x in stems: ------+------------------+----------------------------------------------------------------- if len(x) < 4: continue 2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,... x = x.lower() 1 | Bhootakannadi | {bhootakannadi,2005,unifi,feed,gener,comment,final,integr,... if x in self.stopwords: continue 10 | Tea Set | {novelti,dish,goldilock,bear,bowl,lide,contain,august,... self.doc.append(x) ... ... parser = SD['parser'] parser.reset() ... © Copyright 2011 EMC Corporation. All rights reserved. 19
  • 20. Create histograms of word frequencies Extract a term-dictionary of terms that show up in at least ten blogs sql# SELECT term, sum(c) AS freq, count(*) AS num_blogs FROM ( SELECT id, term, count(*) AS c FROM ( SELECT id, unnest(doc) AS term FROM blog_terms ) term_unnest GROUP BY id, term ) doc_terms WHERE term IS NOT NULL GROUP BY term HAVING count(*) > 10; term | freq | num_blogs ----------+------+----------- sturdi | 19 | 13 canon | 97 | 40 group | 48 | 17 skin | 510 | 152 linger | 19 | 17 blunt | 20 | 17 © Copyright 2011 EMC Corporation. All rights reserved. 20
  • 21. Create histograms of word frequencies Use the term frequencies to construct the term dictionary… sql# SELECT array(SELECT term FROM blog_term_freq) dictionary; dictionary --------------------------------------------------------------------- {sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,... …then use the term dictionary to construct feature vectors for every document, mapping document terms to the features in the dictionary: sql# SELECT id, gp_extract_feature_histogram(dictionary, doc) FROM blog_terms, blog_features; id | term_count -----+---------------------------------------------------------------- 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,9,0,1,0,1,0,1,0,3,0,2,...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...} ... © Copyright 2011 EMC Corporation. All rights reserved. 21
  • 22. Create histograms of word frequencies Format of a sparse vector id | term_count -----+---------------------------------------------------------------------- ... 10 | {3,1,40,...}:{0,2,0,...} ... Dense representation of the vector id | term_count -----+---------------------------------------------------------------------- ... 10 | {0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...} ... dictionary ---------------------------------------------------------------------------- {sturdi,canon,group,skin,linger,blunt,detect,giver,...} Representing the document {skin, skin, ...} © Copyright 2011 EMC Corporation. All rights reserved. 22
  • 23. Transform the blog terms into statistically useful measures Use the feature vectors to construct tf-idf (term frequency inverse document frequency) vectors: These are a measure of the importance of terms. sql# SELECT id, (term_count*logidf) tfxidf FROM blog_histogram, ( SELECT log(count(*)/count_vec(term_count)) logidf FROM blog_histogram ) blog_logidf; id | tfxidf -----+------------------------------------------------------------------- 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,0.771999985977529,0,1.999427...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2.95439664949608,0,3.2006935...} ... © Copyright 2011 EMC Corporation. All rights reserved. 23
  • 24. Create document clusters around iteratively defined centroids Now that we have TFxIDFs we have something that is a statistically significant metric, which enables all sorts of real analytics. The current example is k-means clustering which requires two operations. First, we compute a distance metric between the documents and a random selection of centroids, for instance cosine similarity: sql# SELECT id, tfxidf, cid, ACOS((tfxidf %*% centroid) / (svec_l2norm(tfxidf) * svec_l2norm(centroid)) ) AS distance FROM blog_tfxidf, blog_centroids; id | tfxidf | cid | distance -----+-------------------------------------------------------------------+-----+------------ 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 2 | 1.55720354 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 3 | 1.55040145 © Copyright 2011 EMC Corporation. All rights reserved. 24
  • 25. Create document clusters around iteratively defined centroids Next, use an averaging metric to re-center the mean of a cluster: sql# SELECT cid, sum(tfxidf)/count(*) AS centroid FROM ( SELECT id, tfxidf, cid, row_number() OVER (PARTITION BY id ORDER BY distance, cid) rank FROM blog_distance ) blog_rank WHERE rank = 1 GROUP BY cid; cid | centroid -----+------------------------------------------------------------------------ 3 | {1,1,1,1,1,1,1,1,1,...}:{0.157556041103536,0.0635233900749665,0.050...} 2 | {1,1,1,1,1,1,3,1,1,...}:{0.0671131209568817,0.332220028552986,0,0.0...} 1 | {1,1,1,1,1,1,1,1,1,...}:{0.103874521481016,0.158213686890834,0.0540...} Repeat the previous two operations until the centroids converge, and you have k-means clustering. © Copyright 2011 EMC Corporation. All rights reserved. 25
  • 26. Summary • Accessing the data (MapReduce) id | path | body ------+---------------------------------------+------------------------------------ 2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC ”... • Parse the documents into word lists (MapReduce) id | title | doc ------+------------------+----------------------------------------------------------------- 2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,... • Create histograms of word frequencies (SQL) id | term_count -----+---------------------------------------------------------------- 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...} • Transform the blog terms into statistically useful measures (SQL) id | tfxidf -----+------------------------------------------------------------------- 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...} • Create document clusters around iteratively defined centroids (MADlib or SQL window functions) id | tfxidf | cid | distance -----+-------------------------------------------------------------------+-----+------------ 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977 © Copyright 2011 EMC Corporation. All rights reserved. 26
  • 27. Conclusions • Simple exercise on Natural Language Text Processing • Technique uses a combination of non-SQL processing (Map/Reduce) and SQL Processing (Greenplum) • Takes qualitative, subjective text data and converts the problem into a form amenable to statistical analysis using k-means clustering • Rich possibilities for business impact using these techniques emerging across industries. © Copyright 2011 EMC Corporation. All rights reserved. 27