Presented by Seshu Simhadri | Global Computer Enterprises - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
A leader in bringing innovative technologies to the Federal Government, GCE looks to open source tools to drive down cost and provide the foundation for building value-added services for its customers. This talk will discus GCE’s innovative use of Lucene/Solr combined with the GCE Big Data Cloud to open up access to Federal spending data. This data is in wide use across the Federal government, Federal contracting community, media and press, as well as Capitol Hill. GCE has utilized this toolset to deliver the type of capability that users typically only find in web consumer applications. This session will highlight the technical side of the challenge in implementing these tools across a large user community and data set in a Cloud environment.
How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud
1. Lucene in the Cloud:
Leveraging the
Power of Search and
Big Data to Shed
Light on Government
Spending
Seshubabu Simhadri
Chief Technology Officer, GCE
Confidential, Do Not Disclose. Property of Global
Computer Enterprises, Inc..
2. Background
What is USASpending.gov?
Moving to Our Big Data cloud
Some of the design decisions
Tool Selection
Cluster Design
Hardware Design
Limitations and enhancements
Overview
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
3. What is USASpending.gov?
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
4. U.S. Government Spending vs. Other Entities
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
5. Distribution of U.S. Government Spending
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
6. • Analytics
• Stats
• Top-K
• Free Text
Search
(With auto
Suggestions)
• Large
Data
Feeds
• APIs
What can users do on the site?
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
8. Leveraging the
industry leading
open source
platform to
deliver cost
savings and
scalability within
a Cloud
computing
model
GCE Big Data and Analytics Cloud
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
9. • Hadoop
− For indexing and downloads
Start by
• Distributed Solr Looking at
− Analytics the Usual
− Free text search Suspects
• Drupal static content
• Visualization
What’s Inside the GCE Cloud?
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
10. The greatest
challenge is how
to optimally
design a node –
which
combination of
CPUs, memory,
and shard size
delivers the
desired
performance?
Solr Node Sizing
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
11. Multiple index types
Different types of spending
Varying sizes
Break complete dataset into shards as small as required to
meet the response times
Choose shard size based on response times
Single Core with multiple cores or Multiple Solr instances each
with single core?
Solr Node Sizing
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
12. How do you
design the
cluster –
which ones
are individual
nodes and
which ones
are
aggregators?
Solr Cluster Design
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
13. Should all shards be treated equal?
Userà Aggregator Nodes à Shards
Different requirements for nodes collecting the data
and nodes serving a specific dataset
Aggregator Node 1,2,3 ….m
Large Solr Instances, No local index
Shard Nodes 1,2,3,..100..n
Small Solr Instance with index
Solr Cluster Design
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
14. Separate Solr
instances
Multiple hard
drives per
server
Solid state
disks
Infiniband
What configuration did we choose?
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
15. Enhanced
Faceting:
Enabling
aggregation
by more than
one field
Will be
contributed to
Solr project
Solr Enhancements
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
16. When the shards
increase,
management of
SQLs inside Solr
becomes a
challenge
External Data
Importer Using
Hadoop
Solr Data Importer: Why Not?
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
17. Solr in the
Cloud required
building a cost
effective and
high
performance
infrastructure
Small vs. large
Commodity
servers
Utilizing Large Commodity Servers
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
18. Failure of one
node results
in failure of
multiple
shards -
careful
design is
required
Disadvantages of higher capacity servers
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
19. Sharded architecture
Multiple Solr instances per server each handling small
datasets
Aggregator nodes + shards
Hadoop for data indexing and data feeds
Large Commodity Servers
• 48-core
• 256GB RAM
• SSD
• Infiniband
Summary
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
20. Come build
the future
of Big Data
GCECloud.com
We’re hiring!
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..