SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Search	
  	
  	
  	
  	
  	
  	
  Discover	
  	
  	
  	
  	
  	
  	
  Analyze	
  




Large	
  Scale	
  Search,	
  Discovery	
  and	
  
Analy5cs	
  with	
  Solr,	
  Mahout	
  and	
  
Hadoop	
  




Grant	
  Ingersoll	
  
Chief	
  Scien:st	
  
Lucid	
  Imagina:on	
  


                                                                                                              	
  	
  	
  	
  	
  	
  |	
     	
  1	
  	
  
Search	
  is	
  Dead,	
  Long	
  Live	
  Search	
  



l    Good	
  keyword	
  search	
  is	
  a	
                       Documents
      commodity	
  and	
  easy	
  to	
  get	
  
      up	
  and	
  running	
  


l    The	
  Bar	
  is	
  Raised	
                   Content                      User
                                                   Relationships               Interaction
       –  Relevance	
  is	
  (always	
  will	
  
          be?)	
  hard	
  


l    Holis:c	
  view	
  of	
  the	
  data	
  
      AND	
  the	
  users	
  is	
  cri:cal	
  
                                                                    Access




                                                                                      	
  	
  	
  	
  	
  	
  |	
     	
  2	
  	
  
Topics	
  



l    Quick	
  Background	
  and	
  needs	
  
l    Architecture	
  
       –  Abstract	
  
       –  Prac:cal	
  

l    SDA	
  In	
  Prac:ce	
  
       –  Components	
  
       –  Challenges	
  and	
  Lessons	
  Learned	
  

l    Wrap	
  Up	
  




                                                        	
  	
  	
  	
  	
  	
  |	
     	
  3	
  	
  
Why	
  Search,	
  Discovery	
  and	
  Analy8cs	
  (SDA)?	
  

                                     l    User	
  Needs	
  
                                            –  Real-­‐:me,	
  ad	
  hoc	
  access	
  to	
  content	
  
                                            –  Aggressive	
  Priori:za:on	
  based	
  on	
  Importance	
  
             Search
                                            –  Serendipity	
  
                                            –  Feedback/Learning	
  from	
  past	
  



                                     l    Business	
  Needs	
  
 Analytics             Discovery            –  Deeper	
  insight	
  into	
  users	
  
                                            –  Leverage	
  exis:ng	
  internal	
  knowledge	
  
                                            –  Cost	
  effec:ve	
  




                                                                                                  	
  	
  	
  	
  	
  	
  |	
     	
  4	
  	
  
What	
  Do	
  Developers	
  Need	
  for	
  SDA?	
  



l    Fast, efficient, scalable search
       –  Bulk and Near Real Time Indexing
       –  Handle billions of records w/ sub-second search and faceting

l    Large scale, cost effective storage and processing capabilities
       –  Need whole data consumption and analysis
       –  Experimentation/Sampling tools
       –  Distributed In Memory where appropriate

l    NLP and machine learning tools that scale to enhance discovery and
      analysis




                                                                         	
  	
  	
  	
  	
  	
  |	
     	
  5	
  	
  
Abstract	
  -­‐>	
  Prac8cal	
  SDA	
  Architecture	
  
                              Access (API, UI,Visualization)

                                     Search, Discovery and Analytics              Glue
                           Stats Mahout, R, GATE, Others
                              Pig, Machine   Docs     User                       Admin
                          Package Learning  Access Modeling

                                           Experiment Mgmt                       Service
                                                                                  Mgmt
         Content                      Computation and Storage
        Acquisition
                                                 DB
                                                                  Dist.          Data
                         Search                 NoSQL
                                                                 Process         Mgmt
                                                 KV

                         Shards                  Shards                Shards
                          Shards                  Shards                Shards
                            Shards                   Logs                  DFS




                      Provisioning, Monitoring, Infrastructure


                                                                                           	
  	
  	
  	
  	
  	
  |	
     	
  6	
  	
  
Computa8on	
  and	
  Storage	
  


         Solr                              Hadoop                                 HBase

•  Document Index                 •  Stores Logs,                      •  Metric Storage
•  Document                          Raw files,                        •  User Histories
   Storage?                          intermediate                      •  Document
                                     files, etc.                          Storage?
•  SolrCloud makes                •  WebHDFS
   sharding easy
                                  •  Small file are an
                                     unnatural act

Challenges	
  
      •  Who	
  is	
  the	
  authorita:ve	
  store?	
  Solr	
  or	
  HBase?	
  
      •  Real	
  :me	
  vs.	
  Batch	
  
      •  Where	
  should	
  analysis	
  be	
  done?	
  
                                                                                          	
  	
  	
  	
  	
  	
  |	
     	
  7	
  	
  
Search	
  In	
  Prac8ce	
  



l    Three	
  primary	
  concerns	
  
       –  Performance/Scaling	
  


       –  Relevance	
  


       –  Opera:ons:	
  monitoring,	
  failover,	
  etc.	
  


l    Business	
  typically	
  cares	
  more	
  about	
  relevance	
  
l    Devs	
  more	
  about	
  performance	
  (and	
  then	
  ops)	
  




                                                                         	
  	
  	
  	
  	
  	
  |	
     	
  8	
  	
  
Search	
  with	
  Solr:	
  Scaling	
  and	
  NRT	
  



l    SolrCloud	
  takes	
  care	
  of	
  distributed	
  indexing	
  and	
  search	
  needs	
  
       –  Transac:on	
  logs	
  for	
  recovery	
  
       –  Automa:c	
  leader	
  elec:on,	
  so	
  no	
  more	
  master/worker	
  
       –  Have	
  to	
  declare	
  number	
  of	
  shards	
  now,	
  but	
  spliang	
  coming	
  soon	
  
       –  Use	
  CloudSolrServer	
  in	
  SolrJ	
  

l    NRT	
  Config	
  :ps:	
  
       –  1	
  second	
  sod	
  commits	
  for	
  NRT	
  updates	
  
       –  1	
  minute	
  hard	
  commits	
  (no	
  searcher	
  reopen)	
  




                                                                                                            	
  	
  	
  	
  	
  	
  |	
     	
  9	
  	
  
Search:	
  Relevance	
  



l    ABT	
  –	
  Always	
  Be	
  Tes:ng	
  
       –  Experiment	
  management	
  is	
  cri:cal	
  
       –  Top	
  X	
  +	
  Random	
  Sampling	
  of	
  Long	
  Tail	
  
       –  Click	
  logs	
  
l    Track	
  Everything!	
  
       –  Queries	
  
       –  Clicks	
  
       –  Displayed	
  Documents	
  	
  
       –  Mouse/Scroll	
  tracking???	
  

l    Phrases	
  are	
  your	
  friend	
  




                                                                          	
  	
  	
  	
  	
  	
  |	
     	
  10	
  	
  
Discovery	
  Components	
  

       Serendipity                      Organization                        Data Quality

•    Trends                        •  Importance                     •  Document factor
•    Topics                        •  Clustering                        Distributions
•    Recommendations               •  Classification                    •  Length
•    Related Items                    •  Named Entities                 •  Boosts
•    More Like This                •  Time Factors                   •  Duplicates
•    Did you mean?                 •  Faceting
•    Stat. Interesting
     Phrases

Challenges	
  
        •  Many	
  of	
  these	
  are	
  intense	
  calcula:ons	
  or	
  itera:ve	
  
        •  Many	
  are	
  subjec:ve	
  and	
  require	
  a	
  lot	
  of	
  experimenta:on	
  


                                                                                           	
  	
  	
  	
  	
  	
  |	
     	
  11	
  	
  
Discovery	
  with	
  Mahout	
  



l    Mahout’s	
  3	
  “C”s	
  provide	
  tools	
  for	
  helping	
  across	
  many	
  aspects	
  of	
  discovery	
  
        –  Collabora:ve	
  Filtering	
  
        –  Classifica:on	
  
        –  Clustering	
  
l    Also:	
  	
  
        –  Colloca:ons	
  (Sta:s:cally	
  Interes:ng	
  Phrases)	
  
        –  SVD	
  
        –  Others	
  
l    Challenges:	
  
        –  High	
  cost	
  to	
  itera:ve	
  machine	
  learning	
  algorithms	
  
        –  Mahout	
  is	
  very	
  command	
  line	
  oriented	
  
        –  Some	
  areas	
  less	
  mature	
  

                                                                                                                 	
  	
  	
  	
  	
  	
  |	
     	
  12	
  	
  
Aside:	
  Experiment	
  Management	
  



l    Plan	
  for	
  running	
  experiments	
  from	
  the	
  beginning	
  across	
  Search	
  and	
  
      Discovery	
  components	
  
       –  Your	
  analy:cs	
  engine	
  should	
  help!	
  
l    Types	
  of	
  Experiments	
  to	
  consider	
  
       –  Indexing/Analysis	
  
       –  Query	
  parsing	
  
       –  Scoring	
  formulas	
  
       –  Machine	
  Learning	
  Models	
  
       –  Recommenda:ons,	
  many	
  more	
  
l    Make	
  it	
  easy	
  to	
  do	
  A/B	
  tes:ng	
  across	
  all	
  experiments	
  and	
  compare	
  and	
  
      contrast	
  the	
  results	
  



                                                                                                                     	
  	
  	
  	
  	
  	
  |	
     	
  13	
  	
  
Analy8cs	
  Components	
  



l    Commonly	
  used	
  components	
  
       –  Solr	
  
       –  R	
  Stats	
  
       –  Hive	
  
       –  Pig	
  
       –  Commercial	
  


l    Star:ng	
  with	
  Search	
  and	
  Discovery	
  metrics	
  and	
  analysis	
  gives	
  context	
  into	
  
      where	
  to	
  make	
  investments	
  for	
  broader	
  analy:cs	
  




                                                                                                               	
  	
  	
  	
  	
  	
  |	
     	
  14	
  	
  
Analy8cs	
  in	
  Prac8ce	
  



l    Simple	
  Counts:	
  
       –  Facets	
  
       –  Term	
  and	
  Document	
  frequencies	
  
       –  Clicks	
  
l    Search	
  and	
  Discovery	
  example	
  metrics	
  
       –  Relevance	
  measures	
  like	
  Mean	
  Reciprocal	
  Rank	
  
       –  Histograms/Drilldowns	
  around	
  Number	
  of	
  Results	
  
       –  Log	
  and	
  naviga:on	
  analysis	
  


l    Data	
  cleanliness	
  analysis	
  is	
  helpful	
  for	
  finding	
  poten:al	
  issues	
  in	
  content	
  




                                                                                                                     	
  	
  	
  	
  	
  	
  |	
     	
  15	
  	
  
Wrap	
  



l    Search,	
  Discovery	
  and	
  Analy:cs,	
  when	
  combined	
  into	
  a	
  single,	
  coherent	
  
      system	
  provides	
  powerful	
  insight	
  into	
  both	
  your	
  content	
  and	
  your	
  users	
  


l    Lucid	
  has	
  combined	
  many	
  of	
  these	
  things	
  into	
  LucidWorks	
  Big	
  Data	
  
       –  hrp://www.lucidimagina:on.com/products/lucidworks-­‐search-­‐plasorm/
          lucidworks-­‐big-­‐data	
  



l    Design	
  for	
  the	
  big	
  picture	
  when	
  building	
  search-­‐based	
  applica:ons	
  




                                                                                                                 	
  	
  	
  	
  	
  	
  |	
     	
  16	
  	
  
Find	
  It	
  



l    hrp://www.lucidimagina:on.com/products/lucidworks-­‐search-­‐plasorm/
      lucidworks-­‐big-­‐data	
  
l    hrp://www.lucidimagina:on.com	
  


l    grant@lucidimagina:on.com	
  
l    @gsingers	
  




                                                                          	
  	
  	
  	
  	
  	
  |	
     	
  17	
  	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetupiwrigley
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14iwrigley
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesMapR Technologies
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoopAdam Muise
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarPlatfora
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudRightScale
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
 

Was ist angesagt? (20)

Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
 
hadoop @ Ibmbigdata
hadoop @ Ibmbigdatahadoop @ Ibmbigdata
hadoop @ Ibmbigdata
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Treasure Data and Heroku
Treasure Data and HerokuTreasure Data and Heroku
Treasure Data and Heroku
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best Practices
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir Webinar
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the Cloud
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 

Ähnlich wie Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr

Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinarTed Dunning
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR Technologies
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopDataWorks Summit
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceTed Dunning
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012Umesh Ramalingachar
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data LakesKiran Kamreddy
 
Streaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionStreaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionDATAVERSITY
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solrLucidworks (Archived)
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
Intergen Twilight - Corralling the Document Chaos
Intergen Twilight - Corralling the Document ChaosIntergen Twilight - Corralling the Document Chaos
Intergen Twilight - Corralling the Document ChaosIntergen
 
Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceRobert H. McDonald
 
Piloting agile project management
Piloting agile project managementPiloting agile project management
Piloting agile project managementNatalie Collins
 
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012Calpont Corporation
 

Ähnlich wie Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr (20)

Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinar
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over Hadoop
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
 
FAST Search for SharePoint
FAST Search for SharePointFAST Search for SharePoint
FAST Search for SharePoint
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data Lakes
 
Streaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionStreaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise Adoption
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Intergen Twilight - Corralling the Document Chaos
Intergen Twilight - Corralling the Document ChaosIntergen Twilight - Corralling the Document Chaos
Intergen Twilight - Corralling the Document Chaos
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability Science
 
Piloting agile project management
Piloting agile project managementPiloting agile project management
Piloting agile project management
 
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Kürzlich hochgeladen (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr

  • 1. Search              Discover              Analyze   Large  Scale  Search,  Discovery  and   Analy5cs  with  Solr,  Mahout  and   Hadoop   Grant  Ingersoll   Chief  Scien:st   Lucid  Imagina:on              |    1    
  • 2. Search  is  Dead,  Long  Live  Search   l  Good  keyword  search  is  a   Documents commodity  and  easy  to  get   up  and  running   l  The  Bar  is  Raised   Content User Relationships Interaction –  Relevance  is  (always  will   be?)  hard   l  Holis:c  view  of  the  data   AND  the  users  is  cri:cal   Access            |    2    
  • 3. Topics   l  Quick  Background  and  needs   l  Architecture   –  Abstract   –  Prac:cal   l  SDA  In  Prac:ce   –  Components   –  Challenges  and  Lessons  Learned   l  Wrap  Up              |    3    
  • 4. Why  Search,  Discovery  and  Analy8cs  (SDA)?   l  User  Needs   –  Real-­‐:me,  ad  hoc  access  to  content   –  Aggressive  Priori:za:on  based  on  Importance   Search –  Serendipity   –  Feedback/Learning  from  past   l  Business  Needs   Analytics Discovery –  Deeper  insight  into  users   –  Leverage  exis:ng  internal  knowledge   –  Cost  effec:ve              |    4    
  • 5. What  Do  Developers  Need  for  SDA?   l  Fast, efficient, scalable search –  Bulk and Near Real Time Indexing –  Handle billions of records w/ sub-second search and faceting l  Large scale, cost effective storage and processing capabilities –  Need whole data consumption and analysis –  Experimentation/Sampling tools –  Distributed In Memory where appropriate l  NLP and machine learning tools that scale to enhance discovery and analysis            |    5    
  • 6. Abstract  -­‐>  Prac8cal  SDA  Architecture   Access (API, UI,Visualization) Search, Discovery and Analytics Glue Stats Mahout, R, GATE, Others Pig, Machine Docs User Admin Package Learning Access Modeling Experiment Mgmt Service Mgmt Content Computation and Storage Acquisition DB Dist. Data Search NoSQL Process Mgmt KV Shards Shards Shards Shards Shards Shards Shards Logs DFS Provisioning, Monitoring, Infrastructure            |    6    
  • 7. Computa8on  and  Storage   Solr Hadoop HBase •  Document Index •  Stores Logs, •  Metric Storage •  Document Raw files, •  User Histories Storage? intermediate •  Document files, etc. Storage? •  SolrCloud makes •  WebHDFS sharding easy •  Small file are an unnatural act Challenges   •  Who  is  the  authorita:ve  store?  Solr  or  HBase?   •  Real  :me  vs.  Batch   •  Where  should  analysis  be  done?              |    7    
  • 8. Search  In  Prac8ce   l  Three  primary  concerns   –  Performance/Scaling   –  Relevance   –  Opera:ons:  monitoring,  failover,  etc.   l  Business  typically  cares  more  about  relevance   l  Devs  more  about  performance  (and  then  ops)              |    8    
  • 9. Search  with  Solr:  Scaling  and  NRT   l  SolrCloud  takes  care  of  distributed  indexing  and  search  needs   –  Transac:on  logs  for  recovery   –  Automa:c  leader  elec:on,  so  no  more  master/worker   –  Have  to  declare  number  of  shards  now,  but  spliang  coming  soon   –  Use  CloudSolrServer  in  SolrJ   l  NRT  Config  :ps:   –  1  second  sod  commits  for  NRT  updates   –  1  minute  hard  commits  (no  searcher  reopen)              |    9    
  • 10. Search:  Relevance   l  ABT  –  Always  Be  Tes:ng   –  Experiment  management  is  cri:cal   –  Top  X  +  Random  Sampling  of  Long  Tail   –  Click  logs   l  Track  Everything!   –  Queries   –  Clicks   –  Displayed  Documents     –  Mouse/Scroll  tracking???   l  Phrases  are  your  friend              |    10    
  • 11. Discovery  Components   Serendipity Organization Data Quality •  Trends •  Importance •  Document factor •  Topics •  Clustering Distributions •  Recommendations •  Classification •  Length •  Related Items •  Named Entities •  Boosts •  More Like This •  Time Factors •  Duplicates •  Did you mean? •  Faceting •  Stat. Interesting Phrases Challenges   •  Many  of  these  are  intense  calcula:ons  or  itera:ve   •  Many  are  subjec:ve  and  require  a  lot  of  experimenta:on              |    11    
  • 12. Discovery  with  Mahout   l  Mahout’s  3  “C”s  provide  tools  for  helping  across  many  aspects  of  discovery   –  Collabora:ve  Filtering   –  Classifica:on   –  Clustering   l  Also:     –  Colloca:ons  (Sta:s:cally  Interes:ng  Phrases)   –  SVD   –  Others   l  Challenges:   –  High  cost  to  itera:ve  machine  learning  algorithms   –  Mahout  is  very  command  line  oriented   –  Some  areas  less  mature              |    12    
  • 13. Aside:  Experiment  Management   l  Plan  for  running  experiments  from  the  beginning  across  Search  and   Discovery  components   –  Your  analy:cs  engine  should  help!   l  Types  of  Experiments  to  consider   –  Indexing/Analysis   –  Query  parsing   –  Scoring  formulas   –  Machine  Learning  Models   –  Recommenda:ons,  many  more   l  Make  it  easy  to  do  A/B  tes:ng  across  all  experiments  and  compare  and   contrast  the  results              |    13    
  • 14. Analy8cs  Components   l  Commonly  used  components   –  Solr   –  R  Stats   –  Hive   –  Pig   –  Commercial   l  Star:ng  with  Search  and  Discovery  metrics  and  analysis  gives  context  into   where  to  make  investments  for  broader  analy:cs              |    14    
  • 15. Analy8cs  in  Prac8ce   l  Simple  Counts:   –  Facets   –  Term  and  Document  frequencies   –  Clicks   l  Search  and  Discovery  example  metrics   –  Relevance  measures  like  Mean  Reciprocal  Rank   –  Histograms/Drilldowns  around  Number  of  Results   –  Log  and  naviga:on  analysis   l  Data  cleanliness  analysis  is  helpful  for  finding  poten:al  issues  in  content              |    15    
  • 16. Wrap   l  Search,  Discovery  and  Analy:cs,  when  combined  into  a  single,  coherent   system  provides  powerful  insight  into  both  your  content  and  your  users   l  Lucid  has  combined  many  of  these  things  into  LucidWorks  Big  Data   –  hrp://www.lucidimagina:on.com/products/lucidworks-­‐search-­‐plasorm/ lucidworks-­‐big-­‐data   l  Design  for  the  big  picture  when  building  search-­‐based  applica:ons              |    16    
  • 17. Find  It   l  hrp://www.lucidimagina:on.com/products/lucidworks-­‐search-­‐plasorm/ lucidworks-­‐big-­‐data   l  hrp://www.lucidimagina:on.com   l  grant@lucidimagina:on.com   l  @gsingers              |    17