SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Large Scale Search, Discovery
                                and Analysis in Action

                                Ivan Provalov
                                Research Engineer
                                Office of the Chief Scientist
                                September 25, 2012




Confidential © Copyright 2012
User Interactions With Big Data


                                           Command    System
                 Data              DFS       Line     Administrator



                                    Key
                                             Query
               Data                Value              Engineer
                                           Language
                                   Store



                                           Keyword
                 Data              Index              End User
                                            Search




    Confidential and Proprietary
2   © 2012 LucidWorks
Is Search Enough?

    • Keyword search is a
      commodity                                        endeavour shuttle bay area
    • Holistic view of the data and
                                                                           Search
      the user interactions with that
      data
    • Search, Discovery and
      Analytics are the key to
      unlocking this view of users
      and data



                                   Search, Discovery and Analytics

    Confidential and Proprietary
3   © 2012 LucidWorks
Why Search, Discovery and Analytics?

                                                        • User Needs
                                   Search                - real-time, ad hoc access to
                                                           content
                                                         - aggressive prioritization
                                                           based on importance
                                                         - serendipity
                                                         - feedback/learning from past
            Analytics                       Discovery


                                                        • Business Needs
                                                         - deeper insight into users
                                                         - leverage existing internal
                                                           knowledge
                                                         - cost effective

    Confidential and Proprietary
4   © 2012 LucidWorks
Topics

    • Background and needs
    • Architecture
    • Search, Discovery and Analytics in action
    • Road map
    • Wrap up




    Confidential and Proprietary
5   © 2012 LucidWorks
Search

    • Performance
    • Real time
    • Relevance and importance
    • Presenting results
    • Experiment management




    Confidential and Proprietary
6   © 2012 LucidWorks
Discovery

    • Content clustering
    • Discovering near duplicate documents
    • Finding ‘dark data’
    • Making recommendations
    • Uncovering trends
    • Recognizing topics
    • More like this




    Confidential and Proprietary
7   © 2012 LucidWorks
Analytics

    • Term frequency
    • Facets
    • Click analysis
    • Relevancy metrics
    • Zero results queries
    • Hot spots
    • Statistically interesting phrases




    Confidential and Proprietary
8   © 2012 LucidWorks
Some Use Cases

    • Video streaming
       - classification
       - recommendations
    • Financial, transportation,
      telecommunications
       - fraud detection
    • Social media
       - trend monitoring
    • Information technology
       - logs monitoring
    • Healthcare
       - identifying patients for clinical studies

    Confidential and Proprietary
9   © 2012 LucidWorks
In Focus: Personalized Medicine


                                               Alignment
                                               and other     Genetic
                                                analysis    Variations


     Patient DNA


                                                           Standard Therapies




                                                           Alternative Therapies

                                    Search and Faceting

     Confidential and Proprietary
10   © 2012 LucidWorks
In Focus: Log Processing in Telecommunications


     • Each year, large sums of money are lost due to
       fraudulent calls and poor service

     • Logs are usually semi-structured and contain vital
       information about errors and fraud

     • Deeper batch analytics can provide insight into patterns
       across vast amounts of data

     • Search of call and network information (via logs) is
       critical to providing deeper analysis and understanding
       of these errors and fraudulent activities
     Confidential and Proprietary
11   © 2012 LucidWorks
What Does a Search, Discovery and Analytics
     Platform Need?
     • Fast, efficient, scalable search
         - bulk and near real time indexing
         - handle billions of records with sub-second search and faceting


     • Large scale, cost effective storage and processing capabilities
         - need whole data consumption and analysis
         - experimentation/sampling tools


     • NLP and machine learning tools that scale to enhance discovery
       and analysis




     Confidential and Proprietary
12   © 2012 LucidWorks
Building a Search, Discovery and Analytics Platform

                                             API



                                 Search, Discovery, Analytics




                                                                    Management
    Inputs




Bulk &                              Processing & Storage
Real Time




                         Provisioning, Monitoring & Configuration

  Confidential and Proprietary
  © 2012 LucidWorks
LucidWorks Big Data

                                           API

Inputs

                               Search, Discovery, Analytics




                                                                  Management
                                  Processing & Storage



                       Provisioning, Monitoring & Configuration

Confidential and Proprietary
© 2012 LucidWorks
LucidWorks Big Data

                                           API

Inputs

                               Search, Discovery, Analytics




                                                                  Management
                                   Processing & Storage




                       Provisioning, Monitoring & Configuration

Confidential and Proprietary
© 2012 LucidWorks
LucidWorks Big Data

                                               API

Inputs                             Search, Discovery, Analytics
                          Analytics Service            Document Service




                                                                          Management
                                       Processing & Storage




                       Provisioning, Monitoring & Configuration

Confidential and Proprietary
© 2012 LucidWorks
LucidWorks Big Data

                                               API

Inputs                             Search, Discovery, Analytics           Mgmt
                          Analytics Service            Document Service
                                                                          Admin


                                                                          Service
                                       Processing & Storage                Mgmt


                                                                          Data
                                                                          Mgmt



                       Provisioning, Monitoring & Configuration

Confidential and Proprietary
© 2012 LucidWorks
LucidWorks Big Data

                                                API

Inputs                              Search, Discovery, Analytics            Mgmt
                          Analytics Service              Document Service
                                                                            Admin


                                                                            Service
                                        Processing & Storage                 Mgmt


                                                                            Data
                                                                            Mgmt


                               Provisioning, Monitoring & Configuration


Confidential and Proprietary
© 2012 LucidWorks
LucidWorks Big Data
                                                 API

           Big Data                           LucidWorks               Web HDFS

Inputs                              Search, Discovery, Analytics              Mgmt
                          Analytics Service                Document Service
                                                                                  Admin


                                                                              Service
                                        Processing & Storage                   Mgmt


                                                                                  Data
                                                                                  Mgmt


                               Provisioning, Monitoring & Configuration


Confidential and Proprietary
© 2012 LucidWorks
Components – LucidWorks Search

     Component                        Benefit
     LucidWorks Search (2.1.1)        Lucene/Solr 4.0-dev, sharded with
     • connector framework            SolrCloud, near-real time indexing,
     • security                       transaction logs for recovery.
     • user click framework
     • business process integration
     • administration

                                      LucidWorks Search




     Confidential and Proprietary
20   © 2012 LucidWorks
Components - Hadoop

     Component                      Benefit
     Apache Hadoop (1.0.3)          Distributed computing and
                                    processing for ETL and analytics
                                    jobs.
     Apache HBase (0.92)            Key-value store allowing fast access
                                    to the data.

     Apache Oozie (modified 3.2)    Workflow orchestration.




     Confidential and Proprietary
21   © 2012 LucidWorks
Components - Analysis/ML/NLP

     Component                             Benefit
     Apache Mahout (trunk)                 Distributed machine learning
     • k-means clustering                  processing framework.
     • statistically interesting phrases
     • similar documents
     • classification
     Apache UIMA (2.4.0)                   Text processing and annotations.

     Apache OpenNLP (1.5.2)                Machine learning toolkit for natural
     • named entity extraction             language processing.
     Behemoth (modified trunk)             Makes easier M/R data extraction,
                                           abstracts annotations frameworks.
     Apache Pig (0.9.2)                    Helps with writing analytics M/R
     • ETL                                 programs.
     • log analysis


     Confidential and Proprietary
22   © 2012 LucidWorks
Components - Middleware

     Component                      Benefit
     Apache ZooKeeper (3.4.3)       Service discovery.
     • Netflix Curator



     Apache Kafka (0.7)             Logs consumption and event-based
                                    real-time document processing
                                    framework.




     Confidential and Proprietary
23   © 2012 LucidWorks
Components - SDA Engine

     • RESTful services (Restlet 2.1)
     • ZooKeeper + Netflix Curator
     • Authentication and authorization
     • Proxies for LucidWorks and
       WebHDFS API
     • Workflow engine




     Confidential and Proprietary
24   © 2012 LucidWorks
Road Map

     • Analytics themes
         -   relevance
         -   data quality
         -   discovery
         -   integration with other packages (R)
     • Machine learning
         - NLP
         - recommendations
     • Experiment management




     Confidential and Proprietary
25   © 2012 LucidWorks
Conclusions

     • Search, Discovery and Analytics,
       when combined into a single,
       integrated system provides
       powerful insight into both your
       content and your users
     • LucidWorks has combined many
       of these things into LucidWorks
       Big Data




     Confidential and Proprietary
26   © 2012 LucidWorks
LucidWorks Big Data

     • Unified development platform for Big Data applications
     • Integrated open source stack: Lucene/Solr, Hadoop,
       Mahout, NLP
     • Single, uniform REST API
     • Pre-tuned by open source industry experts
     • Out of the box provisioning - hosted or on premise




     Confidential and Proprietary
27   © 2012 LucidWorks
Search | Discover | Analyze




                                      www.lucidworks.com/bigdata
                                    ivan.provalov@lucidworks.com
                                              @iprovalov
     Confidential and Proprietary
28   © 2012 LucidWorks

Weitere ähnliche Inhalte

Was ist angesagt?

IT Infrastructure Specialist
IT Infrastructure SpecialistIT Infrastructure Specialist
IT Infrastructure Specialistmomentuminfocare
 
A Guide to the SOA Galaxy: Strategy, Design and Best Practices
A Guide to the SOA Galaxy: Strategy, Design and Best PracticesA Guide to the SOA Galaxy: Strategy, Design and Best Practices
A Guide to the SOA Galaxy: Strategy, Design and Best PracticesDmitri Shiryaev
 
ConceptClassifier for SharePoint Turbo Charging the Public Sector
ConceptClassifier for SharePoint Turbo Charging the Public SectorConceptClassifier for SharePoint Turbo Charging the Public Sector
ConceptClassifier for SharePoint Turbo Charging the Public Sectormartingarland
 
CDM SIG: Fusion MDM for Customer Highlights [2010 OAUG Collaborate]
CDM SIG: Fusion MDM for Customer Highlights [2010 OAUG Collaborate]CDM SIG: Fusion MDM for Customer Highlights [2010 OAUG Collaborate]
CDM SIG: Fusion MDM for Customer Highlights [2010 OAUG Collaborate]Rhapsody Technologies, Inc.
 
A better waytosecureapps-finalv1
A better waytosecureapps-finalv1A better waytosecureapps-finalv1
A better waytosecureapps-finalv1OracleIDM
 
IP&A109 Next-Generation Analytics Architecture for the Year 2020
IP&A109 Next-Generation Analytics Architecture for the Year 2020IP&A109 Next-Generation Analytics Architecture for the Year 2020
IP&A109 Next-Generation Analytics Architecture for the Year 2020Anjan Roy, PMP
 
Site/Location Hubs - A Hot Trend In Master Data Management (MDM)
Site/Location Hubs - A Hot Trend In Master Data Management (MDM)Site/Location Hubs - A Hot Trend In Master Data Management (MDM)
Site/Location Hubs - A Hot Trend In Master Data Management (MDM)Rhapsody Technologies, Inc.
 
Powering Next Generation Data Architecture With Apache Hadoop
Powering Next Generation Data Architecture With Apache HadoopPowering Next Generation Data Architecture With Apache Hadoop
Powering Next Generation Data Architecture With Apache HadoopHortonworks
 
Kuali update v4 - mw
Kuali update   v4 - mwKuali update   v4 - mw
Kuali update v4 - mwsarnoa
 

Was ist angesagt? (10)

IT Infrastructure Specialist
IT Infrastructure SpecialistIT Infrastructure Specialist
IT Infrastructure Specialist
 
Secure Big Data Analytics - Hadoop & Intel
Secure Big Data Analytics - Hadoop & IntelSecure Big Data Analytics - Hadoop & Intel
Secure Big Data Analytics - Hadoop & Intel
 
A Guide to the SOA Galaxy: Strategy, Design and Best Practices
A Guide to the SOA Galaxy: Strategy, Design and Best PracticesA Guide to the SOA Galaxy: Strategy, Design and Best Practices
A Guide to the SOA Galaxy: Strategy, Design and Best Practices
 
ConceptClassifier for SharePoint Turbo Charging the Public Sector
ConceptClassifier for SharePoint Turbo Charging the Public SectorConceptClassifier for SharePoint Turbo Charging the Public Sector
ConceptClassifier for SharePoint Turbo Charging the Public Sector
 
CDM SIG: Fusion MDM for Customer Highlights [2010 OAUG Collaborate]
CDM SIG: Fusion MDM for Customer Highlights [2010 OAUG Collaborate]CDM SIG: Fusion MDM for Customer Highlights [2010 OAUG Collaborate]
CDM SIG: Fusion MDM for Customer Highlights [2010 OAUG Collaborate]
 
A better waytosecureapps-finalv1
A better waytosecureapps-finalv1A better waytosecureapps-finalv1
A better waytosecureapps-finalv1
 
IP&A109 Next-Generation Analytics Architecture for the Year 2020
IP&A109 Next-Generation Analytics Architecture for the Year 2020IP&A109 Next-Generation Analytics Architecture for the Year 2020
IP&A109 Next-Generation Analytics Architecture for the Year 2020
 
Site/Location Hubs - A Hot Trend In Master Data Management (MDM)
Site/Location Hubs - A Hot Trend In Master Data Management (MDM)Site/Location Hubs - A Hot Trend In Master Data Management (MDM)
Site/Location Hubs - A Hot Trend In Master Data Management (MDM)
 
Powering Next Generation Data Architecture With Apache Hadoop
Powering Next Generation Data Architecture With Apache HadoopPowering Next Generation Data Architecture With Apache Hadoop
Powering Next Generation Data Architecture With Apache Hadoop
 
Kuali update v4 - mw
Kuali update   v4 - mwKuali update   v4 - mw
Kuali update v4 - mw
 

Ähnlich wie DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceTed Dunning
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopDataWorks Summit
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR Technologies
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinarTed Dunning
 
SAP Explorer Visual Intelligence
SAP Explorer Visual IntelligenceSAP Explorer Visual Intelligence
SAP Explorer Visual IntelligenceEric Molner
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrDataWorks Summit
 
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012Calpont Corporation
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
Metadata Use Cases
Metadata Use CasesMetadata Use Cases
Metadata Use Casesdmurph4
 
Df2012 securing information_assets_in_saa_s_clouds_3_0
Df2012 securing information_assets_in_saa_s_clouds_3_0Df2012 securing information_assets_in_saa_s_clouds_3_0
Df2012 securing information_assets_in_saa_s_clouds_3_0debbanerjee
 
Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing DataWorks Summit
 
Manthan biim services and solutions
Manthan   biim services  and solutionsManthan   biim services  and solutions
Manthan biim services and solutionsJaikumar Karuppannan
 
Identity Insights: Social, Local and Mobile Identity
Identity Insights: Social, Local and Mobile IdentityIdentity Insights: Social, Local and Mobile Identity
Identity Insights: Social, Local and Mobile IdentityJon Bultmeyer
 
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)Ajay Ohri
 
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...Cloudera, Inc.
 
Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceRobert H. McDonald
 

Ähnlich wie DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION (20)

Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over Hadoop
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinar
 
SAP Explorer Visual Intelligence
SAP Explorer Visual IntelligenceSAP Explorer Visual Intelligence
SAP Explorer Visual Intelligence
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
 
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
Metadata Use Cases
Metadata Use CasesMetadata Use Cases
Metadata Use Cases
 
Df2012 securing information_assets_in_saa_s_clouds_3_0
Df2012 securing information_assets_in_saa_s_clouds_3_0Df2012 securing information_assets_in_saa_s_clouds_3_0
Df2012 securing information_assets_in_saa_s_clouds_3_0
 
Cloud Computing Essentials
Cloud Computing EssentialsCloud Computing Essentials
Cloud Computing Essentials
 
Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing
 
Acuma Introduction
Acuma IntroductionAcuma Introduction
Acuma Introduction
 
Manthan biim services and solutions
Manthan   biim services  and solutionsManthan   biim services  and solutions
Manthan biim services and solutions
 
Identity Insights: Social, Local and Mobile Identity
Identity Insights: Social, Local and Mobile IdentityIdentity Insights: Social, Local and Mobile Identity
Identity Insights: Social, Local and Mobile Identity
 
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
 
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
 
Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability Science
 

Kürzlich hochgeladen

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Kürzlich hochgeladen (20)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

  • 1. Large Scale Search, Discovery and Analysis in Action Ivan Provalov Research Engineer Office of the Chief Scientist September 25, 2012 Confidential © Copyright 2012
  • 2. User Interactions With Big Data Command System Data DFS Line Administrator Key Query Data Value Engineer Language Store Keyword Data Index End User Search Confidential and Proprietary 2 © 2012 LucidWorks
  • 3. Is Search Enough? • Keyword search is a commodity endeavour shuttle bay area • Holistic view of the data and Search the user interactions with that data • Search, Discovery and Analytics are the key to unlocking this view of users and data Search, Discovery and Analytics Confidential and Proprietary 3 © 2012 LucidWorks
  • 4. Why Search, Discovery and Analytics? • User Needs Search - real-time, ad hoc access to content - aggressive prioritization based on importance - serendipity - feedback/learning from past Analytics Discovery • Business Needs - deeper insight into users - leverage existing internal knowledge - cost effective Confidential and Proprietary 4 © 2012 LucidWorks
  • 5. Topics • Background and needs • Architecture • Search, Discovery and Analytics in action • Road map • Wrap up Confidential and Proprietary 5 © 2012 LucidWorks
  • 6. Search • Performance • Real time • Relevance and importance • Presenting results • Experiment management Confidential and Proprietary 6 © 2012 LucidWorks
  • 7. Discovery • Content clustering • Discovering near duplicate documents • Finding ‘dark data’ • Making recommendations • Uncovering trends • Recognizing topics • More like this Confidential and Proprietary 7 © 2012 LucidWorks
  • 8. Analytics • Term frequency • Facets • Click analysis • Relevancy metrics • Zero results queries • Hot spots • Statistically interesting phrases Confidential and Proprietary 8 © 2012 LucidWorks
  • 9. Some Use Cases • Video streaming - classification - recommendations • Financial, transportation, telecommunications - fraud detection • Social media - trend monitoring • Information technology - logs monitoring • Healthcare - identifying patients for clinical studies Confidential and Proprietary 9 © 2012 LucidWorks
  • 10. In Focus: Personalized Medicine Alignment and other Genetic analysis Variations Patient DNA Standard Therapies Alternative Therapies Search and Faceting Confidential and Proprietary 10 © 2012 LucidWorks
  • 11. In Focus: Log Processing in Telecommunications • Each year, large sums of money are lost due to fraudulent calls and poor service • Logs are usually semi-structured and contain vital information about errors and fraud • Deeper batch analytics can provide insight into patterns across vast amounts of data • Search of call and network information (via logs) is critical to providing deeper analysis and understanding of these errors and fraudulent activities Confidential and Proprietary 11 © 2012 LucidWorks
  • 12. What Does a Search, Discovery and Analytics Platform Need? • Fast, efficient, scalable search - bulk and near real time indexing - handle billions of records with sub-second search and faceting • Large scale, cost effective storage and processing capabilities - need whole data consumption and analysis - experimentation/sampling tools • NLP and machine learning tools that scale to enhance discovery and analysis Confidential and Proprietary 12 © 2012 LucidWorks
  • 13. Building a Search, Discovery and Analytics Platform API Search, Discovery, Analytics Management Inputs Bulk & Processing & Storage Real Time Provisioning, Monitoring & Configuration Confidential and Proprietary © 2012 LucidWorks
  • 14. LucidWorks Big Data API Inputs Search, Discovery, Analytics Management Processing & Storage Provisioning, Monitoring & Configuration Confidential and Proprietary © 2012 LucidWorks
  • 15. LucidWorks Big Data API Inputs Search, Discovery, Analytics Management Processing & Storage Provisioning, Monitoring & Configuration Confidential and Proprietary © 2012 LucidWorks
  • 16. LucidWorks Big Data API Inputs Search, Discovery, Analytics Analytics Service Document Service Management Processing & Storage Provisioning, Monitoring & Configuration Confidential and Proprietary © 2012 LucidWorks
  • 17. LucidWorks Big Data API Inputs Search, Discovery, Analytics Mgmt Analytics Service Document Service Admin Service Processing & Storage Mgmt Data Mgmt Provisioning, Monitoring & Configuration Confidential and Proprietary © 2012 LucidWorks
  • 18. LucidWorks Big Data API Inputs Search, Discovery, Analytics Mgmt Analytics Service Document Service Admin Service Processing & Storage Mgmt Data Mgmt Provisioning, Monitoring & Configuration Confidential and Proprietary © 2012 LucidWorks
  • 19. LucidWorks Big Data API Big Data LucidWorks Web HDFS Inputs Search, Discovery, Analytics Mgmt Analytics Service Document Service Admin Service Processing & Storage Mgmt Data Mgmt Provisioning, Monitoring & Configuration Confidential and Proprietary © 2012 LucidWorks
  • 20. Components – LucidWorks Search Component Benefit LucidWorks Search (2.1.1) Lucene/Solr 4.0-dev, sharded with • connector framework SolrCloud, near-real time indexing, • security transaction logs for recovery. • user click framework • business process integration • administration LucidWorks Search Confidential and Proprietary 20 © 2012 LucidWorks
  • 21. Components - Hadoop Component Benefit Apache Hadoop (1.0.3) Distributed computing and processing for ETL and analytics jobs. Apache HBase (0.92) Key-value store allowing fast access to the data. Apache Oozie (modified 3.2) Workflow orchestration. Confidential and Proprietary 21 © 2012 LucidWorks
  • 22. Components - Analysis/ML/NLP Component Benefit Apache Mahout (trunk) Distributed machine learning • k-means clustering processing framework. • statistically interesting phrases • similar documents • classification Apache UIMA (2.4.0) Text processing and annotations. Apache OpenNLP (1.5.2) Machine learning toolkit for natural • named entity extraction language processing. Behemoth (modified trunk) Makes easier M/R data extraction, abstracts annotations frameworks. Apache Pig (0.9.2) Helps with writing analytics M/R • ETL programs. • log analysis Confidential and Proprietary 22 © 2012 LucidWorks
  • 23. Components - Middleware Component Benefit Apache ZooKeeper (3.4.3) Service discovery. • Netflix Curator Apache Kafka (0.7) Logs consumption and event-based real-time document processing framework. Confidential and Proprietary 23 © 2012 LucidWorks
  • 24. Components - SDA Engine • RESTful services (Restlet 2.1) • ZooKeeper + Netflix Curator • Authentication and authorization • Proxies for LucidWorks and WebHDFS API • Workflow engine Confidential and Proprietary 24 © 2012 LucidWorks
  • 25. Road Map • Analytics themes - relevance - data quality - discovery - integration with other packages (R) • Machine learning - NLP - recommendations • Experiment management Confidential and Proprietary 25 © 2012 LucidWorks
  • 26. Conclusions • Search, Discovery and Analytics, when combined into a single, integrated system provides powerful insight into both your content and your users • LucidWorks has combined many of these things into LucidWorks Big Data Confidential and Proprietary 26 © 2012 LucidWorks
  • 27. LucidWorks Big Data • Unified development platform for Big Data applications • Integrated open source stack: Lucene/Solr, Hadoop, Mahout, NLP • Single, uniform REST API • Pre-tuned by open source industry experts • Out of the box provisioning - hosted or on premise Confidential and Proprietary 27 © 2012 LucidWorks
  • 28. Search | Discover | Analyze www.lucidworks.com/bigdata ivan.provalov@lucidworks.com @iprovalov Confidential and Proprietary 28 © 2012 LucidWorks

Hinweis der Redaktion

  1. How do you gain insight?The Search boxis the UI for data these daysFeedback improvements into system for usersExtract key metrics for business understanding
  2. ChallengesMany of these are intense calculations or iterativeMany are subjective and require a lot of experimentation
  3. Single nucleotide polymorphisms (SNPs) are used as markers in linkage and association studies to detect which regions in the human genome may be involved in disease.Single nucleotide polymorphism (SNP) studies and random mutagenesis projects identify amino acid substitutions in protein-coding regions. Each substitution has the potential to affect protein function. SIFT (Sorting Intolerant From Tolerant) is a program that predicts whether an amino acid substitution affects protein function so that users can prioritize substitutions for further study.
  4. Make into images?
  5. SearchStorage and processingExperiment managementToolsNLPstatistical analysisScalableLow costProduction monitoringProvisioningBulk and near real-time Handle volume in sub-second processing
  6. Solr takes care of leader election, etc. so no more master/slave1 second (default) soft commits for NRT updates1 minute (default) hard commits (no searcher reopen)Transaction logs for recovery