SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
News and Blog Analysis
      with Lydia
 Charles Ward – Stony Brook University
 Karthik Balaji, Levon Lloyd – General Sentiment




October 2nd, 2009
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Large-Scale News/Blog Analysis
  The Lydia news/blog analysis system does a daily
  analysis of over 1000+ English and foreign-language
  online newspapers, plus blogs, and other text sources.
  We currently track tens of millions of named entities in
  the news and blogs, providing spatial, temporal,
  relational and sentiment analysis.
  Customer's track entities of interest using reports
  generated in our user interface.
Lydia Text Analysis Phases
  Lydia performs named entity recognition and analysis over large
  text corpora.
      Spidering: Lydia spiders and parses thousands of online news
      sources. We also handle the feed of social media provided by
      Spinn3r.
      Named Entity Recognition: Lydia identifies and classifies
      occurrences of named entities (people, places, companies,
      etc.)
      Sentiment Analysis: Lydia assigns sentiment scores to
      identified entities using shallow NLP techniques.
      Entity Statistics Aggregation: Lydia digests marked-up text
      and produces usable entity statistics.
      Data Exploration: Aggregated entity statistics are made
      available through user interfaces and programming APIs for
      detailed exploration of the data.
Lydia Architecture
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Frequency Time Series
  Michael Vick references (2004-2009)




  Mel Gibson references (2004-2009)
Sentiment Analysis
  Michael Phelps sentiment score (June 2008-Feb
  2009)




  David Paterson sentiment score (Jan 2008-Jul 2009)
Comparative Analysis
  Peyton Manning vs. Eli Manning
Heatmaps




 Arnold Schwarzenegger   Alabama
Ethnic Biases in News Coverage




 Frequency of coverage of entities Percentage of population self-
    with Hispanic names in the        reporting as Hispanic in the 2000
    U.S. news, 2004-2008              census. Courtesy of Wikipedia.
Ethnic Biases in News Coverage


  (a) African
  (b) Hispanic
  (c) East Asian
  (d) Indian
  (e) Eastern
  European
  (f) Muslim
Juxtaposition Analysis
   Top Juxtapositions for Barack Obama




   Juxtapositions between Barack Obama and John McCain
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Hadoop in Lydia
  The legacy Perl NLP pipeline runs in parallel on
  Hadoop Streaming, generating articles with marked-up
  entities which are stored as compressed XML in
  HDFS.

  To build or update Lydia entity statistics and indexes
  for a single text corpus, over 80 map-reduce jobs are
  necessary.

  We have developed a custom workflow management
  framework in Amazon EC2 to manage our data and
  processing.
Lydia Workflow Framework
 High-level concepts:
   A depository is a statistics dataset derived from a
   text corpus. It consists of artifacts.
     Stored as a directory structure in HDFS
   An artifact is a homogeneous dataset of a specific
   type.
     Examples:
        Key-value artifacts, e.g. entity name -> frequency time
        series
        Lucene index artifacts (entity and article indexes)
     Stored as a directory in HDFS containing several map-
     reduce job output subdirectories named as date ranges
     (we do updates on a daily granularity).
Artifact Dependencies
 Most Lydia artifacts are derived from other artifacts:
Artifact Storage

 Lydia artifacts are stored in HDFS inside the
   depository directory:
   /dailies (depository name)
     /EXACT_DUP_ARTICLES (artifact name)
        /2004_11_01-2009_03_31 (date range-named MR output)
           /part-00000
           ...
           /part-00017
        /2009_04_01-2009_04_02
           ...
        /2009_04_03. . .
Job Input Selection
   Artifact updates are incrementally propagated through
   the dependency graph:




   Multiple date ranges (sometimes overlapping) typically
   exist for each artifact.
   Some small artifacts get fully rebuilt on every update.
Depository Build Scheduling
   The same tool is used for the initial depository build
   and for updating it with new data.

   Any set of target artifacts to build can be specified,
   similarly to a makefile. Prerequisites of the targets are
   automatically identified.

   Artifacts are built in the correct order according to
   dependencies.

   The build process runs as a sequence of Hadoop
   map-reduce jobs and occasional serial jobs.
Amazon EC2
  We run Hadoop on Amazon EC2.
  – Quickly scale capacity as requirements change.
  10 extra large nodes for weekly data processing.
  Amazon S3 is our persistent data store.
  All our web services are hosted in dedicated amazon
  nodes.
  S3 is not meeting our required level-of-service
   – Moving to EBS
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Depository Server
  Random access to the Lydia depository, e.g.:
    Monthly frequency time series of Barack Obama in all
    U.S. sources
    Top juxtapositions for Continental Airlines in February
    2009
    Sentiment time series for Michael Phelps in all U.S.
    sources
  Uses the mapfiles generated by map-reduce jobs.
  Currently is not distributed (but we can put
  different depositories on different machines).
  Provides a caching subsystem to reduce the
  number of HDFS accesses.
Artifact Date Range Merging

   The depository server combines results from
   multiple groups of mapfiles on the fly.
   (MR output = date range = mapfile group)
     This may result in performance problems and
     memory shortage (direct memory buffers).
     Solution: limit the number of covering date ranges
     to be O(log N) after N daily updates.
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Conclusion

  Great improvement (up to 20x) in the
  Lydia system performance and
  scalability from using Hadoop.
  Lydia w/ Hadoop makes new types of
  automated analysis of web-scale content
  possible.

Weitere ähnliche Inhalte

Was ist angesagt?

Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your datasetTuri, Inc.
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
MongoDB and Hadoop Handling for Big Data
MongoDB and Hadoop Handling for Big DataMongoDB and Hadoop Handling for Big Data
MongoDB and Hadoop Handling for Big DataMuhammad zubair
 
Talis Platform: A Linked Data Engine
Talis Platform: A Linked Data EngineTalis Platform: A Linked Data Engine
Talis Platform: A Linked Data EngineLeigh Dodds
 
ORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE IndonesiaORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE IndonesiaCrossref
 
Big data presentation
Big data presentationBig data presentation
Big data presentationSreeSowmya7
 
Collecting and using funding data in your publications
Collecting and using funding data in your publicationsCollecting and using funding data in your publications
Collecting and using funding data in your publicationsCrossref
 
Automated creation of analytic catalog records for born digital journal articles
Automated creation of analytic catalog records for born digital journal articlesAutomated creation of analytic catalog records for born digital journal articles
Automated creation of analytic catalog records for born digital journal articlesNASIG
 
Collecting and Using Funding Data Crossref
Collecting and Using Funding Data CrossrefCollecting and Using Funding Data Crossref
Collecting and Using Funding Data CrossrefRelawan Jurnal Indonesia
 
Microsoft and Revolution Analytics -- what's the add-value? 20150629
Microsoft and Revolution Analytics -- what's the add-value? 20150629Microsoft and Revolution Analytics -- what's the add-value? 20150629
Microsoft and Revolution Analytics -- what's the add-value? 20150629Mark Tabladillo
 
Big Data - Linked In_DEEPU
Big Data - Linked In_DEEPUBig Data - Linked In_DEEPU
Big Data - Linked In_DEEPUDeepu M
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataTrieu Nguyen
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web CorpusRobert Meusel
 

Was ist angesagt? (20)

Hadoop
HadoopHadoop
Hadoop
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your dataset
 
Big data computing
Big data computingBig data computing
Big data computing
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
MongoDB and Hadoop Handling for Big Data
MongoDB and Hadoop Handling for Big DataMongoDB and Hadoop Handling for Big Data
MongoDB and Hadoop Handling for Big Data
 
Talis Platform: A Linked Data Engine
Talis Platform: A Linked Data EngineTalis Platform: A Linked Data Engine
Talis Platform: A Linked Data Engine
 
ORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE IndonesiaORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE Indonesia
 
Big data presentation
Big data presentationBig data presentation
Big data presentation
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Collecting and using funding data in your publications
Collecting and using funding data in your publicationsCollecting and using funding data in your publications
Collecting and using funding data in your publications
 
Automated creation of analytic catalog records for born digital journal articles
Automated creation of analytic catalog records for born digital journal articlesAutomated creation of analytic catalog records for born digital journal articles
Automated creation of analytic catalog records for born digital journal articles
 
Collecting and Using Funding Data Crossref
Collecting and Using Funding Data CrossrefCollecting and Using Funding Data Crossref
Collecting and Using Funding Data Crossref
 
Microsoft and Revolution Analytics -- what's the add-value? 20150629
Microsoft and Revolution Analytics -- what's the add-value? 20150629Microsoft and Revolution Analytics -- what's the add-value? 20150629
Microsoft and Revolution Analytics -- what's the add-value? 20150629
 
Big Data - Linked In_DEEPU
Big Data - Linked In_DEEPUBig Data - Linked In_DEEPU
Big Data - Linked In_DEEPU
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Hadoop
HadoopHadoop
Hadoop
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web Corpus
 

Andere mochten auch

Natural language processing
Natural language processingNatural language processing
Natural language processingHansi Thenuwara
 
Natural language processing
Natural language processingNatural language processing
Natural language processingprashantdahake
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
2.2 Demonstrate the understanding of Programming Life Cycle
2.2 Demonstrate the understanding of Programming Life Cycle2.2 Demonstrate the understanding of Programming Life Cycle
2.2 Demonstrate the understanding of Programming Life CycleFrankie Jones
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language ProcessingJaganadh Gopinadhan
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingrohitnayak
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Basics of Robotics
Basics of RoboticsBasics of Robotics
Basics of RoboticsAmeya Gandhi
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentationlpaviglianiti
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligenceu053675
 

Andere mochten auch (16)

Nlp
NlpNlp
Nlp
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
NLP
NLPNLP
NLP
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
AI Robotics
AI RoboticsAI Robotics
AI Robotics
 
2.2 Demonstrate the understanding of Programming Life Cycle
2.2 Demonstrate the understanding of Programming Life Cycle2.2 Demonstrate the understanding of Programming Life Cycle
2.2 Demonstrate the understanding of Programming Life Cycle
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Artificial inteligence
Artificial inteligenceArtificial inteligence
Artificial inteligence
 
Basics of Robotics
Basics of RoboticsBasics of Robotics
Basics of Robotics
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentation
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 

Ähnlich wie Hw09 Understanding Natural Language

Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?Mahmoud Sabri
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Darko Marjanovic
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Architecting Data Lakes on AWS
Architecting Data Lakes on AWSArchitecting Data Lakes on AWS
Architecting Data Lakes on AWSSajith Appukuttan
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadDeborah Gastineau
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azureEyal Ben Ivri
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 

Ähnlich wie Hw09 Understanding Natural Language (20)

INTRODUCTION OF BIG DATA
INTRODUCTION OF BIG DATAINTRODUCTION OF BIG DATA
INTRODUCTION OF BIG DATA
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Ess1000 glossary
Ess1000 glossaryEss1000 glossary
Ess1000 glossary
 
Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Architecting Data Lakes on AWS
Architecting Data Lakes on AWSArchitecting Data Lakes on AWS
Architecting Data Lakes on AWS
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) Had
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azure
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Database Systems Concepts, 5th Ed
Database Systems Concepts, 5th EdDatabase Systems Concepts, 5th Ed
Database Systems Concepts, 5th Ed
 
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Kürzlich hochgeladen (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Hw09 Understanding Natural Language

  • 1. News and Blog Analysis with Lydia Charles Ward – Stony Brook University Karthik Balaji, Levon Lloyd – General Sentiment October 2nd, 2009
  • 2. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 3. Large-Scale News/Blog Analysis The Lydia news/blog analysis system does a daily analysis of over 1000+ English and foreign-language online newspapers, plus blogs, and other text sources. We currently track tens of millions of named entities in the news and blogs, providing spatial, temporal, relational and sentiment analysis. Customer's track entities of interest using reports generated in our user interface.
  • 4. Lydia Text Analysis Phases Lydia performs named entity recognition and analysis over large text corpora. Spidering: Lydia spiders and parses thousands of online news sources. We also handle the feed of social media provided by Spinn3r. Named Entity Recognition: Lydia identifies and classifies occurrences of named entities (people, places, companies, etc.) Sentiment Analysis: Lydia assigns sentiment scores to identified entities using shallow NLP techniques. Entity Statistics Aggregation: Lydia digests marked-up text and produces usable entity statistics. Data Exploration: Aggregated entity statistics are made available through user interfaces and programming APIs for detailed exploration of the data.
  • 6. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 7. Frequency Time Series Michael Vick references (2004-2009) Mel Gibson references (2004-2009)
  • 8. Sentiment Analysis Michael Phelps sentiment score (June 2008-Feb 2009) David Paterson sentiment score (Jan 2008-Jul 2009)
  • 9. Comparative Analysis Peyton Manning vs. Eli Manning
  • 11. Ethnic Biases in News Coverage Frequency of coverage of entities Percentage of population self- with Hispanic names in the reporting as Hispanic in the 2000 U.S. news, 2004-2008 census. Courtesy of Wikipedia.
  • 12. Ethnic Biases in News Coverage (a) African (b) Hispanic (c) East Asian (d) Indian (e) Eastern European (f) Muslim
  • 13. Juxtaposition Analysis Top Juxtapositions for Barack Obama Juxtapositions between Barack Obama and John McCain
  • 14. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 15. Hadoop in Lydia The legacy Perl NLP pipeline runs in parallel on Hadoop Streaming, generating articles with marked-up entities which are stored as compressed XML in HDFS. To build or update Lydia entity statistics and indexes for a single text corpus, over 80 map-reduce jobs are necessary. We have developed a custom workflow management framework in Amazon EC2 to manage our data and processing.
  • 16. Lydia Workflow Framework High-level concepts: A depository is a statistics dataset derived from a text corpus. It consists of artifacts. Stored as a directory structure in HDFS An artifact is a homogeneous dataset of a specific type. Examples: Key-value artifacts, e.g. entity name -> frequency time series Lucene index artifacts (entity and article indexes) Stored as a directory in HDFS containing several map- reduce job output subdirectories named as date ranges (we do updates on a daily granularity).
  • 17. Artifact Dependencies Most Lydia artifacts are derived from other artifacts:
  • 18. Artifact Storage Lydia artifacts are stored in HDFS inside the depository directory: /dailies (depository name) /EXACT_DUP_ARTICLES (artifact name) /2004_11_01-2009_03_31 (date range-named MR output) /part-00000 ... /part-00017 /2009_04_01-2009_04_02 ... /2009_04_03. . .
  • 19. Job Input Selection Artifact updates are incrementally propagated through the dependency graph: Multiple date ranges (sometimes overlapping) typically exist for each artifact. Some small artifacts get fully rebuilt on every update.
  • 20. Depository Build Scheduling The same tool is used for the initial depository build and for updating it with new data. Any set of target artifacts to build can be specified, similarly to a makefile. Prerequisites of the targets are automatically identified. Artifacts are built in the correct order according to dependencies. The build process runs as a sequence of Hadoop map-reduce jobs and occasional serial jobs.
  • 21. Amazon EC2 We run Hadoop on Amazon EC2. – Quickly scale capacity as requirements change. 10 extra large nodes for weekly data processing. Amazon S3 is our persistent data store. All our web services are hosted in dedicated amazon nodes. S3 is not meeting our required level-of-service – Moving to EBS
  • 22. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 23. Depository Server Random access to the Lydia depository, e.g.: Monthly frequency time series of Barack Obama in all U.S. sources Top juxtapositions for Continental Airlines in February 2009 Sentiment time series for Michael Phelps in all U.S. sources Uses the mapfiles generated by map-reduce jobs. Currently is not distributed (but we can put different depositories on different machines). Provides a caching subsystem to reduce the number of HDFS accesses.
  • 24. Artifact Date Range Merging The depository server combines results from multiple groups of mapfiles on the fly. (MR output = date range = mapfile group) This may result in performance problems and memory shortage (direct memory buffers). Solution: limit the number of covering date ranges to be O(log N) after N daily updates.
  • 25. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 26. Conclusion Great improvement (up to 20x) in the Lydia system performance and scalability from using Hadoop. Lydia w/ Hadoop makes new types of automated analysis of web-scale content possible.