SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Lucene in the Cloud:
                                                       Leveraging the
                                                    Power of Search and
                                                      Big Data to Shed
                                                    Light on Government
                                                          Spending




                                                          Seshubabu Simhadri
                                                      Chief Technology Officer, GCE

Confidential, Do Not Disclose. Property of Global
           Computer Enterprises, Inc..
Background

What is USASpending.gov?

Moving to Our Big Data cloud

Some of the design decisions
   Tool Selection
   Cluster Design
   Hardware Design

Limitations and enhancements

                                        Overview
             Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
What is USASpending.gov?
 Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
U.S. Government Spending vs. Other Entities
         Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Distribution of U.S. Government Spending
       Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
• Analytics
   •  Stats
   •  Top-K


• Free Text
  Search
 (With auto
 Suggestions)


• Large
  Data
  Feeds

• APIs
              What can users do on the site?
                Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
• Public

• Media

• Congress

• Value Added
  Resellers




           Who are the users of the site?
                Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Leveraging the
                                                                               industry leading
                                                                                 open source
                                                                                  platform to
                                                                                 deliver cost
                                                                                 savings and
                                                                               scalability within
                                                                                   a Cloud
                                                                                  computing
                                                                                    model



GCE Big Data and Analytics Cloud
    Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
•  Hadoop
   − For indexing and downloads
                                                                                        Start by
•  Distributed Solr                                                                    Looking at
   − Analytics                                                                         the Usual
   − Free text search                                                                  Suspects

•  Drupal static content

•  Visualization



    What’s Inside the GCE Cloud?
       Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
The greatest
challenge is how
  to optimally
design a node –
      which
 combination of
CPUs, memory,
 and shard size
  delivers the
     desired
 performance?




                              Solr Node Sizing
              Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Multiple index types
       Different types of spending
       Varying sizes

Break complete dataset into shards as small as required to
meet the response times
      Choose shard size based on response times

Single Core with multiple cores or Multiple Solr instances each
with single core?




                            Solr Node Sizing
            Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
How do you
                                                                                 design the
                                                                                  cluster –
                                                                                 which ones
                                                                                are individual
                                                                                 nodes and
                                                                                 which ones
                                                                                      are
                                                                                aggregators?




         Solr Cluster Design
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Should all shards be treated equal?

Userà Aggregator Nodes à Shards

Different requirements for nodes collecting the data
and nodes serving a specific dataset

Aggregator Node 1,2,3 ….m
  Large Solr Instances, No local index

Shard Nodes 1,2,3,..100..n
  Small Solr Instance with index


                  Solr Cluster Design
         Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Separate Solr
                                                                                 instances

                                                                                    Multiple hard
                                                                                     drives per
                                                                                       server

                                                                                     Solid state
                                                                                       disks

                                                                                     Infiniband


What configuration did we choose?
    Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Enhanced
 Faceting:
  Enabling
aggregation
by more than
  one field

   Will be
contributed to
 Solr project




                     Solr Enhancements
            Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
When the shards
                                                                              increase,
                                                                           management of
                                                                           SQLs inside Solr
                                                                             becomes a
                                                                              challenge

                                                                               External Data
                                                                              Importer Using
                                                                                 Hadoop


Solr Data Importer: Why Not?
  Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Solr in the
                                                                                      Cloud required
                                                                                      building a cost
                                                                                       effective and
                                                                                            high
                                                                                       performance
                                                                                       infrastructure

                                                                                       Small vs. large
                                                                                        Commodity
                                                                                         servers




Utilizing Large Commodity Servers
      Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Failure of one
 node results
 in failure of
   multiple
   shards -
    careful
  design is
   required




   Disadvantages of higher capacity servers
             Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Sharded architecture

Multiple Solr instances per server each handling small
datasets

Aggregator nodes + shards

Hadoop for data indexing and data feeds

Large Commodity Servers
   •  48-core
   •  256GB RAM
   •  SSD
   •  Infiniband


                                     Summary
           Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Come build
                                                                                  the future
                                                                                 of Big Data

                                                                                GCECloud.com




                We’re hiring!
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Questions?
 ssimhadri at GCECloud.com

Visit us at www.GCECloud.com

Weitere ähnliche Inhalte

Was ist angesagt?

Sun sparc enterprise t2 systems customer presentation
Sun sparc enterprise t2 systems customer presentationSun sparc enterprise t2 systems customer presentation
Sun sparc enterprise t2 systems customer presentation
xKinAnx
 
Sun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentationSun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentation
xKinAnx
 
Prince Building Tech Talk 12102012
Prince Building Tech Talk 12102012Prince Building Tech Talk 12102012
Prince Building Tech Talk 12102012
Andy Parsons
 
The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012
DATAVERSITY
 
EvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformEvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics Platform
Sergei Dolukhanov
 
EFL Munich - February 2013 - "Conversational Big Data with Erlang"
EFL Munich - February 2013 - "Conversational Big Data with Erlang"EFL Munich - February 2013 - "Conversational Big Data with Erlang"
EFL Munich - February 2013 - "Conversational Big Data with Erlang"
darach
 
Power Optimization Through Manycore Multiprocessing
Power Optimization Through Manycore MultiprocessingPower Optimization Through Manycore Multiprocessing
Power Optimization Through Manycore Multiprocessing
chiportal
 

Was ist angesagt? (19)

HDFS - What's New and Future
HDFS - What's New and FutureHDFS - What's New and Future
HDFS - What's New and Future
 
Larry Smarr - Making Sense of Information Through Planetary Scale Computing
Larry Smarr - Making Sense of Information Through Planetary Scale ComputingLarry Smarr - Making Sense of Information Through Planetary Scale Computing
Larry Smarr - Making Sense of Information Through Planetary Scale Computing
 
CH07-Types of Storage
CH07-Types of StorageCH07-Types of Storage
CH07-Types of Storage
 
Dell PowerEdge M620 blade server solutions for virtual desktop infrastructures
Dell PowerEdge M620 blade server solutions for virtual desktop infrastructuresDell PowerEdge M620 blade server solutions for virtual desktop infrastructures
Dell PowerEdge M620 blade server solutions for virtual desktop infrastructures
 
Sun sparc enterprise t2 systems customer presentation
Sun sparc enterprise t2 systems customer presentationSun sparc enterprise t2 systems customer presentation
Sun sparc enterprise t2 systems customer presentation
 
Sun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentationSun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentation
 
TCA/TCO Benefits of Consolidating Databases and x86 Servers on IBM Enterprise...
TCA/TCO Benefits of Consolidating Databases and x86 Servers on IBM Enterprise...TCA/TCO Benefits of Consolidating Databases and x86 Servers on IBM Enterprise...
TCA/TCO Benefits of Consolidating Databases and x86 Servers on IBM Enterprise...
 
OMG DDS Tutorial - Part I
OMG DDS Tutorial - Part IOMG DDS Tutorial - Part I
OMG DDS Tutorial - Part I
 
Prince Building Tech Talk 12102012
Prince Building Tech Talk 12102012Prince Building Tech Talk 12102012
Prince Building Tech Talk 12102012
 
Webinar: eFolder Expert Series: BDR Pain Relief with Lloyd Wolf
Webinar: eFolder Expert Series: BDR Pain Relief with Lloyd WolfWebinar: eFolder Expert Series: BDR Pain Relief with Lloyd Wolf
Webinar: eFolder Expert Series: BDR Pain Relief with Lloyd Wolf
 
The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012
 
Dds interop demo_washington_dds_2011_03_01
Dds interop demo_washington_dds_2011_03_01Dds interop demo_washington_dds_2011_03_01
Dds interop demo_washington_dds_2011_03_01
 
ieee title
ieee titleieee title
ieee title
 
Netmagic Cloud Computing Services
Netmagic Cloud Computing ServicesNetmagic Cloud Computing Services
Netmagic Cloud Computing Services
 
Migrate
MigrateMigrate
Migrate
 
EvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformEvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics Platform
 
Juju
JujuJuju
Juju
 
EFL Munich - February 2013 - "Conversational Big Data with Erlang"
EFL Munich - February 2013 - "Conversational Big Data with Erlang"EFL Munich - February 2013 - "Conversational Big Data with Erlang"
EFL Munich - February 2013 - "Conversational Big Data with Erlang"
 
Power Optimization Through Manycore Multiprocessing
Power Optimization Through Manycore MultiprocessingPower Optimization Through Manycore Multiprocessing
Power Optimization Through Manycore Multiprocessing
 

Andere mochten auch

Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Lucidworks
 

Andere mochten auch (8)

Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Building a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation EngineBuilding a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation Engine
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 

Ähnlich wie How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Dr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud ComputingDr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
ikanow
 
ONS content extraction
ONS content extractionONS content extraction
ONS content extraction
KellyCheah
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloud
Khazret Sapenov
 
Hive solutions cloudviews 2010 presentation
Hive solutions cloudviews 2010 presentationHive solutions cloudviews 2010 presentation
Hive solutions cloudviews 2010 presentation
EuroCloud
 
The who, what, why, when & how of cloud computing
The who, what, why, when & how of cloud computingThe who, what, why, when & how of cloud computing
The who, what, why, when & how of cloud computing
Nari Kannan
 
When where why cloud
When where why cloudWhen where why cloud
When where why cloud
reshmaroberts
 
When Where Why Cloud
When Where Why CloudWhen Where Why Cloud
When Where Why Cloud
reshmaroberts
 

Ähnlich wie How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud (20)

The Cloud Changing the Game
The Cloud Changing the GameThe Cloud Changing the Game
The Cloud Changing the Game
 
Cloud computing basics
Cloud computing basicsCloud computing basics
Cloud computing basics
 
Cloudcomputing Nivo Consultancy 26 Mei 2009 Versie 1
Cloudcomputing Nivo Consultancy 26 Mei 2009 Versie 1Cloudcomputing Nivo Consultancy 26 Mei 2009 Versie 1
Cloudcomputing Nivo Consultancy 26 Mei 2009 Versie 1
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Massive Data Analytics and the Cloud
Massive Data Analytics and the CloudMassive Data Analytics and the Cloud
Massive Data Analytics and the Cloud
 
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud ComputingDr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
 
The Sun Cloud
The Sun CloudThe Sun Cloud
The Sun Cloud
 
Cloud computing and libraries sndt
Cloud computing and libraries sndtCloud computing and libraries sndt
Cloud computing and libraries sndt
 
ONS content extraction
ONS content extractionONS content extraction
ONS content extraction
 
Zsl cloud-application migration-8_phased_approach
Zsl cloud-application migration-8_phased_approachZsl cloud-application migration-8_phased_approach
Zsl cloud-application migration-8_phased_approach
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloud
 
NCOIC Enterprise Cloud Computing - Kevin Jackson
NCOIC Enterprise Cloud Computing - Kevin JacksonNCOIC Enterprise Cloud Computing - Kevin Jackson
NCOIC Enterprise Cloud Computing - Kevin Jackson
 
Hive solutions cloudviews 2010 presentation
Hive solutions cloudviews 2010 presentationHive solutions cloudviews 2010 presentation
Hive solutions cloudviews 2010 presentation
 
colony framework & omni platform
colony framework & omni platformcolony framework & omni platform
colony framework & omni platform
 
Comparing Ruby on Rails Public vs. Private Cloud Options
Comparing Ruby on Rails Public vs. Private Cloud OptionsComparing Ruby on Rails Public vs. Private Cloud Options
Comparing Ruby on Rails Public vs. Private Cloud Options
 
The who, what, why, when & how of cloud computing
The who, what, why, when & how of cloud computingThe who, what, why, when & how of cloud computing
The who, what, why, when & how of cloud computing
 
Vr storm cips_03nov2010
Vr storm cips_03nov2010Vr storm cips_03nov2010
Vr storm cips_03nov2010
 
When where why cloud
When where why cloudWhen where why cloud
When where why cloud
 
When Where Why Cloud
When Where Why CloudWhen Where Why Cloud
When Where Why Cloud
 

Mehr von lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Mehr von lucenerevolution (20)

Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucene
 
10 keys to Solr's Future
10 keys to Solr's Future10 keys to Solr's Future
10 keys to Solr's Future
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

  • 1. Lucene in the Cloud: Leveraging the Power of Search and Big Data to Shed Light on Government Spending Seshubabu Simhadri Chief Technology Officer, GCE Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 2. Background What is USASpending.gov? Moving to Our Big Data cloud Some of the design decisions Tool Selection Cluster Design Hardware Design Limitations and enhancements Overview Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 3. What is USASpending.gov? Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 4. U.S. Government Spending vs. Other Entities Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 5. Distribution of U.S. Government Spending Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 6. • Analytics •  Stats •  Top-K • Free Text Search (With auto Suggestions) • Large Data Feeds • APIs What can users do on the site? Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 7. • Public • Media • Congress • Value Added Resellers Who are the users of the site? Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 8. Leveraging the industry leading open source platform to deliver cost savings and scalability within a Cloud computing model GCE Big Data and Analytics Cloud Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 9. •  Hadoop − For indexing and downloads Start by •  Distributed Solr Looking at − Analytics the Usual − Free text search Suspects •  Drupal static content •  Visualization What’s Inside the GCE Cloud? Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 10. The greatest challenge is how to optimally design a node – which combination of CPUs, memory, and shard size delivers the desired performance? Solr Node Sizing Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 11. Multiple index types Different types of spending Varying sizes Break complete dataset into shards as small as required to meet the response times Choose shard size based on response times Single Core with multiple cores or Multiple Solr instances each with single core? Solr Node Sizing Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 12. How do you design the cluster – which ones are individual nodes and which ones are aggregators? Solr Cluster Design Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 13. Should all shards be treated equal? Userà Aggregator Nodes à Shards Different requirements for nodes collecting the data and nodes serving a specific dataset Aggregator Node 1,2,3 ….m Large Solr Instances, No local index Shard Nodes 1,2,3,..100..n Small Solr Instance with index Solr Cluster Design Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 14. Separate Solr instances Multiple hard drives per server Solid state disks Infiniband What configuration did we choose? Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 15. Enhanced Faceting: Enabling aggregation by more than one field Will be contributed to Solr project Solr Enhancements Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 16. When the shards increase, management of SQLs inside Solr becomes a challenge External Data Importer Using Hadoop Solr Data Importer: Why Not? Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 17. Solr in the Cloud required building a cost effective and high performance infrastructure Small vs. large Commodity servers Utilizing Large Commodity Servers Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 18. Failure of one node results in failure of multiple shards - careful design is required Disadvantages of higher capacity servers Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 19. Sharded architecture Multiple Solr instances per server each handling small datasets Aggregator nodes + shards Hadoop for data indexing and data feeds Large Commodity Servers •  48-core •  256GB RAM •  SSD •  Infiniband Summary Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 20. Come build the future of Big Data GCECloud.com We’re hiring! Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 21. Questions? ssimhadri at GCECloud.com Visit us at www.GCECloud.com