SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Large Scale Search, Discovery
                                and Analysis in Action

                                Grant Ingersoll
                                Chief Scientist

                                September 18, 2012




Confidential © Copyright 2012
Search is Dead, Long Live Search

• Good keyword search is a
 commodity and easy to get
 up and running                                 Documents



• The Bar is Raised
 • Relevance is (always will
    be?) hard
                                  Content                      User
• Holistic view of the data     Relationships               Interaction

 AND the users is critical

• Search, Discovery and
 Analytics are the key to
 unlocking this view of users                    Access
 and data


 Confidential and Proprietary
 © 2012 LucidWorks
Topics

• Background and needs

• Architecture
• Road Ahead

• SDA In Action
   • Components
   • Challenges and Lessons Learned

• Wrap Up
Confidential and Proprietary
© 2012 LucidWorks
Confidential © Copyright 2012
Sample Use Cases

    • Claims processing and analysis, including fraud analysis

    • Large scale content acquisition and access for:
       • Defense, intelligence and pharmaceutical applications
       • Views of data surrounding natural disasters and other tragedies
          for research, archiving and therapeutic purposes

    • Analysis of Website and social media interactions

    • Access and processing of genetic information for improved
       medical treatments

    • Log processing and fraud detection in telecommunications
    Confidential and Proprietary
5   © 2012 LucidWorks
In Focus: Personalized Medicine


                                              Alignment
                                              and other     Genetic
                                               analysis    Variations


    Patient DNA


                                                          Standard Therapies




                                                          Alternative Therapies

                                   Search and Faceting

    Confidential and Proprietary
6   © 2012 LucidWorks
In Focus: Log Processing in Telecommunications


    • Each year, large sums of money are lost due to
       fraudulent calls and poor service

    • Logs are usually semi-structured and contain vital
       information about errors and fraud

    • Deeper batch analytics can provide insight into
       patterns across vast amounts of data

    • Search of call and network information (via logs) is
       critical to providing deeper analysis and understanding
       of these errors and fraudulent activities

    Confidential and Proprietary
7   © 2012 LucidWorks
Confidential © Copyright 2012
Confidential © Copyright 2012
Confidential © Copyright 2012
Confidential © Copyright 2012
Confidential © Copyright 2012
Computation and Storage

       LucidWorks
                                     Hadoop                HBase
       Search/Solr
• Document Index                • Stores Logs,       • Metric Storage
                                  Raw files,
• Faceting                        intermediate       • User
                                  files, etc.          Histories/Profile
                                • WebHDFS
• SolrCloud
  makes sharding                                     • Document
  easy                          • Small files are      Storage
                                  an unnatural act

Challenges
    • Who is the authoritative store?
    • Real time vs. Batch
    • Where should analysis be done?
 Confidential and Proprietary
 © 2012 LucidWorks
Search In Practice

• Three primary concerns
   • Performance/Scaling

   • Relevance

   • Operations: monitoring, failover, etc.

• Business typically cares more about relevance

• Devs care more about performance at first…

Confidential and Proprietary
© 2012 LucidWorks
Search: Relevance

• Always Be Testing
  • Experiment management is critical
  • Top X + sampling
  • Click Logs
• Track Everything!
   • Queries
   • Clicks
   • Displayed Documents
   • Mouse/Scroll tracking?
• Phrases are your friends
Confidential and Proprietary
© 2012 LucidWorks
Discovery Components


          Serendipity              Organization         Data Quality

•   Trends                     • Importance         • Document factor
•   Topics                     • Clustering           Distributions
•   Recommendations            • Classification       • Length
•   Related Items                • Named Entities     • Boosts
•   More Like This             • Time Factors       • Duplicates
•   Did you mean?              • Faceting
•   Stat. Interesting
    Phrases

Challenges
• Many of these are intense calculations or iterative
• Many are subjective and require a lot of experimentation

Confidential and Proprietary
© 2012 LucidWorks
Discovery with Mahout

• Mahout’s 3 “C”s provide tools for helping across many
   aspects of discovery
   • Collaborative Filtering
   • Classification
   • Clustering
• Also:
   • Collocations (Statistically Interesting Phrases)
   • Singular Value Decomposition (SVD)
   • Others
• Challenges:
   • High cost to iterative machine learning algorithms
   • Mahout is very command line oriented
   • Some areas less mature
Confidential and Proprietary
© 2012 LucidWorks
Aside: Experiment Management

• Plan for running experiments from the beginning
   across Search and Discovery components
   • Your engine should help!
• Types of Experiments to consider
   • Indexing/Analysis
   • Query parsing
   • Scoring formulas
   • Machine Learning Models
   • Recommendations, many more
• Make it easy to do A/B testing across all experiments
   and compare and contrast the results



Confidential and Proprietary
© 2012 LucidWorks
Analytics in Practice

• Many of the components discussed provide analytical
  features
  • Leverage existing tools: R, etc.
• Simple Counts:
  • Facets
  • Term and Document frequencies
  • Clicks
• Search and Discovery example metrics
  • Relevance measures like Mean Reciprocal Rank
  • Histograms/Drilldowns around Number of Results
  • Log and navigation analysis

• Data cleanliness analysis is helpful for finding potential
   issues in content

Confidential and Proprietary
© 2012 LucidWorks
Wrap

• Search, Discovery and Analytics, when combined into
   a single, coherent system provides powerful insight
   into both your content and your users

• LucidWorks has combined many of these things into
   LucidWorks Big Data
   • http://www.lucidworks.com/products/lucidworks-big-data

• Design for the big picture when building search-based
   applications



Confidential and Proprietary
© 2012 LucidWorks
Discussion and Resources

     • Questions?


     • http://www.lucidworks.com

     • grant@lucidworks.com
     • @gsingers




     Confidential and Proprietary
21   © 2012 LucidWorks

Weitere ähnliche Inhalte

Was ist angesagt?

Writing successful data management plans
Writing successful data management plansWriting successful data management plans
Writing successful data management plansIzzyChad
 
Southampton Web Science DTC - Innovations in web publishing and services for ...
Southampton Web Science DTC - Innovations in web publishing and services for ...Southampton Web Science DTC - Innovations in web publishing and services for ...
Southampton Web Science DTC - Innovations in web publishing and services for ...David Worlock
 
Planning for Research Data Managment
Planning for Research Data ManagmentPlanning for Research Data Managment
Planning for Research Data ManagmentDaniel Crane
 
Practical and Conceptual Considerations of Research Object Preservation
Practical and Conceptual Considerations of Research Object PreservationPractical and Conceptual Considerations of Research Object Preservation
Practical and Conceptual Considerations of Research Object PreservationSEAD
 
OU Library Research Support webinar: Working with research data
OU Library Research Support webinar: Working with research dataOU Library Research Support webinar: Working with research data
OU Library Research Support webinar: Working with research dataIzzyChad
 
Managing Your Research Data for Maximum Impact -Rob Daley 300616_Shared
Managing Your Research Data for Maximum Impact -Rob Daley 300616_SharedManaging Your Research Data for Maximum Impact -Rob Daley 300616_Shared
Managing Your Research Data for Maximum Impact -Rob Daley 300616_SharedRob Daley
 
10 commandments in rdm funder compliancy
10 commandments in rdm funder compliancy10 commandments in rdm funder compliancy
10 commandments in rdm funder compliancyHannelore Vanhaverbeke
 
From Ashley madison to_panama_papers - Office 365 bringing sexy back to eDisc...
From Ashley madison to_panama_papers - Office 365 bringing sexy back to eDisc...From Ashley madison to_panama_papers - Office 365 bringing sexy back to eDisc...
From Ashley madison to_panama_papers - Office 365 bringing sexy back to eDisc...Edge Pereira
 
Open Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economicsOpen Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economicsEsther Hoorn
 
HL7 FHIR FoundationTopics for Non-Developers
HL7 FHIR FoundationTopics for Non-DevelopersHL7 FHIR FoundationTopics for Non-Developers
HL7 FHIR FoundationTopics for Non-DevelopersPeter Jordan
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...SEAD
 
Introduction to Data Management Planning
Introduction to Data Management PlanningIntroduction to Data Management Planning
Introduction to Data Management PlanningSarah Jones
 
RDAP 16 Poster: Expanding Research Data Services with Deep Blue Data
RDAP 16 Poster: Expanding Research Data Services with Deep Blue DataRDAP 16 Poster: Expanding Research Data Services with Deep Blue Data
RDAP 16 Poster: Expanding Research Data Services with Deep Blue DataASIS&T
 
Information Models & FHIR --- It’s all about content!
Information Models & FHIR --- It’s all about content!Information Models & FHIR --- It’s all about content!
Information Models & FHIR --- It’s all about content!Koray Atalag
 

Was ist angesagt? (20)

AI-SDV 2020: Biomax
AI-SDV 2020: BiomaxAI-SDV 2020: Biomax
AI-SDV 2020: Biomax
 
Writing successful data management plans
Writing successful data management plansWriting successful data management plans
Writing successful data management plans
 
Southampton Web Science DTC - Innovations in web publishing and services for ...
Southampton Web Science DTC - Innovations in web publishing and services for ...Southampton Web Science DTC - Innovations in web publishing and services for ...
Southampton Web Science DTC - Innovations in web publishing and services for ...
 
Planning for Research Data Managment
Planning for Research Data ManagmentPlanning for Research Data Managment
Planning for Research Data Managment
 
Practical and Conceptual Considerations of Research Object Preservation
Practical and Conceptual Considerations of Research Object PreservationPractical and Conceptual Considerations of Research Object Preservation
Practical and Conceptual Considerations of Research Object Preservation
 
OU Library Research Support webinar: Working with research data
OU Library Research Support webinar: Working with research dataOU Library Research Support webinar: Working with research data
OU Library Research Support webinar: Working with research data
 
Managing Your Research Data for Maximum Impact -Rob Daley 300616_Shared
Managing Your Research Data for Maximum Impact -Rob Daley 300616_SharedManaging Your Research Data for Maximum Impact -Rob Daley 300616_Shared
Managing Your Research Data for Maximum Impact -Rob Daley 300616_Shared
 
10 commandments in rdm funder compliancy
10 commandments in rdm funder compliancy10 commandments in rdm funder compliancy
10 commandments in rdm funder compliancy
 
Research Data Management Plan: How to Write One - 2017-02-01 - University of ...
Research Data Management Plan: How to Write One - 2017-02-01 - University of ...Research Data Management Plan: How to Write One - 2017-02-01 - University of ...
Research Data Management Plan: How to Write One - 2017-02-01 - University of ...
 
Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...
Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...
Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...
 
From Ashley madison to_panama_papers - Office 365 bringing sexy back to eDisc...
From Ashley madison to_panama_papers - Office 365 bringing sexy back to eDisc...From Ashley madison to_panama_papers - Office 365 bringing sexy back to eDisc...
From Ashley madison to_panama_papers - Office 365 bringing sexy back to eDisc...
 
Joy davidson-rdm-support-ual
Joy davidson-rdm-support-ualJoy davidson-rdm-support-ual
Joy davidson-rdm-support-ual
 
Open Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economicsOpen Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economics
 
HL7 FHIR FoundationTopics for Non-Developers
HL7 FHIR FoundationTopics for Non-DevelopersHL7 FHIR FoundationTopics for Non-Developers
HL7 FHIR FoundationTopics for Non-Developers
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
 
Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...
Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...
Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...
 
Introduction to Data Management Planning
Introduction to Data Management PlanningIntroduction to Data Management Planning
Introduction to Data Management Planning
 
Hadoop in Public Sector
Hadoop in Public SectorHadoop in Public Sector
Hadoop in Public Sector
 
RDAP 16 Poster: Expanding Research Data Services with Deep Blue Data
RDAP 16 Poster: Expanding Research Data Services with Deep Blue DataRDAP 16 Poster: Expanding Research Data Services with Deep Blue Data
RDAP 16 Poster: Expanding Research Data Services with Deep Blue Data
 
Information Models & FHIR --- It’s all about content!
Information Models & FHIR --- It’s all about content!Information Models & FHIR --- It’s all about content!
Information Models & FHIR --- It’s all about content!
 

Ähnlich wie Large Scale Search, Discovery and Analytics in Action

Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION
DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTIONDATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION
DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTIONivan provalov
 
Managing Your Research Data
Managing Your Research DataManaging Your Research Data
Managing Your Research DataKristin Briney
 
Research Data Mangagement Essentials, 5th July 2017
Research Data Mangagement Essentials, 5th July 2017Research Data Mangagement Essentials, 5th July 2017
Research Data Mangagement Essentials, 5th July 2017Research Data Leeds
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Curlew Research Brussels 2014 Electronic Data & Knowledge Management
Curlew Research Brussels 2014 Electronic Data & Knowledge ManagementCurlew Research Brussels 2014 Electronic Data & Knowledge Management
Curlew Research Brussels 2014 Electronic Data & Knowledge ManagementNick Lynch
 
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...Cloudera, Inc.
 
When to Consider Semantic Technology for Your Enterprise
When to Consider Semantic Technology for Your EnterpriseWhen to Consider Semantic Technology for Your Enterprise
When to Consider Semantic Technology for Your EnterpriseBlue Slate Solutions
 
When to Consider Semantic Technology for Your Enterprise
When to Consider Semantic Technology for Your Enterprise When to Consider Semantic Technology for Your Enterprise
When to Consider Semantic Technology for Your Enterprise Blue Slate Solutions
 
Digital Preservation - Manage and Provide Access
Digital Preservation - Manage and Provide AccessDigital Preservation - Manage and Provide Access
Digital Preservation - Manage and Provide AccessMichaelPaulmeno
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Cloudera, Inc.
 
Data anywhere anytime
Data anywhere anytimeData anywhere anytime
Data anywhere anytimepatmisasi
 
Enterprise ready: a look at Neo4j in production
Enterprise ready: a look at Neo4j in productionEnterprise ready: a look at Neo4j in production
Enterprise ready: a look at Neo4j in productionNeo4j
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processLouise Corti
 
Knowledge Management Best Practices within Service Management
Knowledge Management Best Practices within Service ManagementKnowledge Management Best Practices within Service Management
Knowledge Management Best Practices within Service ManagementIT Service and Support
 
Secondary data analysis with digital trace data
Secondary data analysis with digital trace dataSecondary data analysis with digital trace data
Secondary data analysis with digital trace dataAndrea Wiggins
 
10 Differences Between eDiscovery & Information Governance
10 Differences Between eDiscovery & Information Governance10 Differences Between eDiscovery & Information Governance
10 Differences Between eDiscovery & Information GovernanceEliseT2015
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use CasesInSemble
 

Ähnlich wie Large Scale Search, Discovery and Analytics in Action (20)

Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION
DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTIONDATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION
DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Managing Your Research Data
Managing Your Research DataManaging Your Research Data
Managing Your Research Data
 
Research Data Mangagement Essentials, 5th July 2017
Research Data Mangagement Essentials, 5th July 2017Research Data Mangagement Essentials, 5th July 2017
Research Data Mangagement Essentials, 5th July 2017
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Curlew Research Brussels 2014 Electronic Data & Knowledge Management
Curlew Research Brussels 2014 Electronic Data & Knowledge ManagementCurlew Research Brussels 2014 Electronic Data & Knowledge Management
Curlew Research Brussels 2014 Electronic Data & Knowledge Management
 
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
 
When to Consider Semantic Technology for Your Enterprise
When to Consider Semantic Technology for Your EnterpriseWhen to Consider Semantic Technology for Your Enterprise
When to Consider Semantic Technology for Your Enterprise
 
When to Consider Semantic Technology for Your Enterprise
When to Consider Semantic Technology for Your Enterprise When to Consider Semantic Technology for Your Enterprise
When to Consider Semantic Technology for Your Enterprise
 
Digital Preservation - Manage and Provide Access
Digital Preservation - Manage and Provide AccessDigital Preservation - Manage and Provide Access
Digital Preservation - Manage and Provide Access
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
 
Data anywhere anytime
Data anywhere anytimeData anywhere anytime
Data anywhere anytime
 
Chapter 6
Chapter 6Chapter 6
Chapter 6
 
Enterprise ready: a look at Neo4j in production
Enterprise ready: a look at Neo4j in productionEnterprise ready: a look at Neo4j in production
Enterprise ready: a look at Neo4j in production
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production process
 
Knowledge Management Best Practices within Service Management
Knowledge Management Best Practices within Service ManagementKnowledge Management Best Practices within Service Management
Knowledge Management Best Practices within Service Management
 
Secondary data analysis with digital trace data
Secondary data analysis with digital trace dataSecondary data analysis with digital trace data
Secondary data analysis with digital trace data
 
10 Differences Between eDiscovery & Information Governance
10 Differences Between eDiscovery & Information Governance10 Differences Between eDiscovery & Information Governance
10 Differences Between eDiscovery & Information Governance
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 

Mehr von Grant Ingersoll

This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineGrant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsGrant Ingersoll
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopGrant Ingersoll
 

Mehr von Grant Ingersoll (20)

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 

Kürzlich hochgeladen

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Kürzlich hochgeladen (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Large Scale Search, Discovery and Analytics in Action

  • 1. Large Scale Search, Discovery and Analysis in Action Grant Ingersoll Chief Scientist September 18, 2012 Confidential © Copyright 2012
  • 2. Search is Dead, Long Live Search • Good keyword search is a commodity and easy to get up and running Documents • The Bar is Raised • Relevance is (always will be?) hard Content User • Holistic view of the data Relationships Interaction AND the users is critical • Search, Discovery and Analytics are the key to unlocking this view of users Access and data Confidential and Proprietary © 2012 LucidWorks
  • 3. Topics • Background and needs • Architecture • Road Ahead • SDA In Action • Components • Challenges and Lessons Learned • Wrap Up Confidential and Proprietary © 2012 LucidWorks
  • 5. Sample Use Cases • Claims processing and analysis, including fraud analysis • Large scale content acquisition and access for: • Defense, intelligence and pharmaceutical applications • Views of data surrounding natural disasters and other tragedies for research, archiving and therapeutic purposes • Analysis of Website and social media interactions • Access and processing of genetic information for improved medical treatments • Log processing and fraud detection in telecommunications Confidential and Proprietary 5 © 2012 LucidWorks
  • 6. In Focus: Personalized Medicine Alignment and other Genetic analysis Variations Patient DNA Standard Therapies Alternative Therapies Search and Faceting Confidential and Proprietary 6 © 2012 LucidWorks
  • 7. In Focus: Log Processing in Telecommunications • Each year, large sums of money are lost due to fraudulent calls and poor service • Logs are usually semi-structured and contain vital information about errors and fraud • Deeper batch analytics can provide insight into patterns across vast amounts of data • Search of call and network information (via logs) is critical to providing deeper analysis and understanding of these errors and fraudulent activities Confidential and Proprietary 7 © 2012 LucidWorks
  • 13. Computation and Storage LucidWorks Hadoop HBase Search/Solr • Document Index • Stores Logs, • Metric Storage Raw files, • Faceting intermediate • User files, etc. Histories/Profile • WebHDFS • SolrCloud makes sharding • Document easy • Small files are Storage an unnatural act Challenges • Who is the authoritative store? • Real time vs. Batch • Where should analysis be done? Confidential and Proprietary © 2012 LucidWorks
  • 14. Search In Practice • Three primary concerns • Performance/Scaling • Relevance • Operations: monitoring, failover, etc. • Business typically cares more about relevance • Devs care more about performance at first… Confidential and Proprietary © 2012 LucidWorks
  • 15. Search: Relevance • Always Be Testing • Experiment management is critical • Top X + sampling • Click Logs • Track Everything! • Queries • Clicks • Displayed Documents • Mouse/Scroll tracking? • Phrases are your friends Confidential and Proprietary © 2012 LucidWorks
  • 16. Discovery Components Serendipity Organization Data Quality • Trends • Importance • Document factor • Topics • Clustering Distributions • Recommendations • Classification • Length • Related Items • Named Entities • Boosts • More Like This • Time Factors • Duplicates • Did you mean? • Faceting • Stat. Interesting Phrases Challenges • Many of these are intense calculations or iterative • Many are subjective and require a lot of experimentation Confidential and Proprietary © 2012 LucidWorks
  • 17. Discovery with Mahout • Mahout’s 3 “C”s provide tools for helping across many aspects of discovery • Collaborative Filtering • Classification • Clustering • Also: • Collocations (Statistically Interesting Phrases) • Singular Value Decomposition (SVD) • Others • Challenges: • High cost to iterative machine learning algorithms • Mahout is very command line oriented • Some areas less mature Confidential and Proprietary © 2012 LucidWorks
  • 18. Aside: Experiment Management • Plan for running experiments from the beginning across Search and Discovery components • Your engine should help! • Types of Experiments to consider • Indexing/Analysis • Query parsing • Scoring formulas • Machine Learning Models • Recommendations, many more • Make it easy to do A/B testing across all experiments and compare and contrast the results Confidential and Proprietary © 2012 LucidWorks
  • 19. Analytics in Practice • Many of the components discussed provide analytical features • Leverage existing tools: R, etc. • Simple Counts: • Facets • Term and Document frequencies • Clicks • Search and Discovery example metrics • Relevance measures like Mean Reciprocal Rank • Histograms/Drilldowns around Number of Results • Log and navigation analysis • Data cleanliness analysis is helpful for finding potential issues in content Confidential and Proprietary © 2012 LucidWorks
  • 20. Wrap • Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users • LucidWorks has combined many of these things into LucidWorks Big Data • http://www.lucidworks.com/products/lucidworks-big-data • Design for the big picture when building search-based applications Confidential and Proprietary © 2012 LucidWorks
  • 21. Discussion and Resources • Questions? • http://www.lucidworks.com • grant@lucidworks.com • @gsingers Confidential and Proprietary 21 © 2012 LucidWorks

Hinweis der Redaktion

  1. The bar is raised: when we first started Lucid, the problems were all around standing up Lucene or Solr or dealing with performance issues, now the large majority of them are around taking search to the next level: better relevance, personalization, recommendations, etc., i.e. how to have better relevance
  2. How do you gain insight?The Search boxis the UI for data these daysFeedback improvements into system for usersExtract key metrics for business understanding
  3. Make into images?
  4. All about ad hoc and bulk storage and computationAll about the analytics that drive your computationGlue to make it all work together – data where it needs to be when it needs to be thereAll are examples of ways to do this. There are actually a fair number of viable alternatives for all of these pieces, all in open sourceI tend to stick to Apache and “commercial” friendly licenses, where possible
  5. Analytics:Discovery:– Recommendations, trends, related searches
  6. Authoritative store: managing across, consistency, etc.Analysis should be done where it most makes sense given the location of the data and the type of analysis being doneHadoop and HBase stuff are all pretty straightforward
  7. Log and navigation: clicks, search trails, etc.Data cleanliness: Never viewed docs that are related to other documents
  8. Big Picture: too often devs are stuck in the weeds