SQL Database Design For Developers at php[tek] 2024
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
1. SOLR Under the Hood
(Our Experience)
Sumit Vadhera
Senior Manager, S&P Global Market Intelligence
#Activate18 #ActivateSearch
2. Agenda
• A little bit about myself
• A Little bit about S&P Global
• How we use SOLR
• SOLR based search Challenges
• Our Journey so far to Cloud
• Next steps
• Q&A
3. A little bit about myself
• Sumit Vadhera (Senior Manger Database Engineering)
• Big Data Solution Architect for around 12+ Years
• Certified Experience ranging from different RDBMS to NoSQL & Big Data
Technologies
• Barely knew about SOLR search until I joined S&P Global
• Now manages Big Data & NoSQL Solutions(build &design architecture)
including query platform
4. A little bit about S&P Global
• The Market Intelligence platform delivers deep industry data across a broad range of
sectors to create cutting-edge insights The Market Intelligence platform digs deeper to
deliver solutions that are sector-specific, data-rich, and hyper-targeted for your evolving
business needs.
• Our Global Coverage currently Includes:
56,000+ banks
58,000+ asset management companies
11,000+ specialty finance companies
18,000+ investment banks and broker dealers
25,000+ insurance companies
90,000+ real estate companies
240,000+ tech, media & telecommunication companies
26,000+ oil & gas companies
19,000+ electric, natural gas & water utilities
30,000+ mining & exploration companies
7. How we use SOLR
• Primary use case:
Data is more than our business, it’s our passion. Some of the datasets we provide.
Financials
Estimates
Ownership
Key developments
Private company data
Transactions
Professionals
Corporate Actions
Events and Transcripts
• Use SOLR to power some of our critical datasets and use a lot of custom code too
8. How we use SOLR cont..
• Universal Search engine for our platform powered on SOLR provides
Text (keyword) & relevancy based search capabilities to our users
Page Combinations allows our users to type in the name of all possible CIQ pages rather than navigating all of the links.
For example, a user can simply type “IBM Key Stats” in the Search box and immediately navigate to the exact page desired.
Autosuggest feature for different objects
Faceting search
Type ahead search (Filter based text)
Advanced Search
Speech to text transcripts
• Indexing-Querying client interacting with SOLR exposes different datasets indexed from
various sources including data pipeline and exposes it to users
• Multiple SOLR Clusters(hybrid) with Terabytes of data and utilize hybrid sharding techniques
including application and cloud based
• Leveraging Lucidworks Customer Support extensively
11. SOLR Based Search Challenges(legacy)..
~40-50 million docs ingestion rate &1-2 million docs per month & transaction rate of approx. 300-350
per min. Average queries hitting per week of 5 million.
• Performance challenges(bottlenecks to overall query traffic)
• Timeouts on platform applications due to complex queries choking entire clusters and creating
bottlenecks
• Relevance performance
• Indexing lags causing near real time data lags on platform. Manual exception handling.
• Fragmentation inside SOLR Cores were a primary factor
• Optimization Downtime
• Analyzing & extracting SOLR log queries stored in RDBMS.
• Re-indexing process
• GC issues & customized code and customized indexing solution
• Security and product bugs
• Single point of failure on Master-Slave
• Document exceptions(tika parser)
12. What we did..
• Extensive GC Tuning
• Extensive JVM Tuning
• IO Tuning(Trying out LOCAL disks)
• Query tuning not just limited to
Move non-scoring queries to the filter cache and improve the use of the field Value Cache – with
date descending sorts
Caching of time range queries by decreasing granularity from seconds to day to speed auto warm
times
Changing scoring algorithms(custom) & use of edismax parser to support multi language(foreign)
Cleaning off date range & phrase queries
• Turning off term vectors and switching to doc values
• Addition of more searchers(horizontal scalability)
• Automation of optimization & recycling SOLR more frequently during off hours
13. Storage
A 5 Megabyte hard drive from 1956
Being loaded into a plane
Cost: More than USD$ 100,000
14. What has changed and is changing fast
1985 Cost 2018 Cost
Storage is (nearly) free
1985 Cost 2018 Cost
Processing power doubles each year
1 TB 1 GHz
The significant problems we face today cannot be solved at the same level of thinking
We were at when we created them.
- Albert Einstein
15. Our Journey so far to cloud..
Today we utilize SOLR latest cloud architecture with hybrid cloud infrastructure
16. Our Journey so far to cloud cont..
Key benefits we see as of today..
• No single point of failure
• Increased availability (HA) and reduced TAT
• Significant performance gains(query) and improved relevancy for page searches(scale searching).
• Improvements to indexing and decreased incremental lags(scale indexing).
• Banana dashboards to identify bottlenecks
• Leverage fusion for security and auditing (authorization/authentication)
• Indexing pipelines with auto detection of failures & NEAR REAL TIME data on platform
• Type ahead, Facet, Search supporting highlighting ,recent &related terms custom searches
• Moving off SOLR logs data from async sonic message queue to new piplelines integrating with ES and Kibana.
• Improved Searched through multiple filters
• Quicker alerts to setup on our variety of searches
• Support natural language search, screening & mappings
• Improved platform search serving screening questions, quick navigation to individual workflows, surfacing pages
and documents.
17. Next steps..
Search, now Data Science
• Further improving relevancy of search results, the presentation of our information as a
whole really, makes our platform more essential to our clients
• Data science as a whole continues creating models that feeds to improve relevancy.
• Continue leverage new features & enhancements & Support of LW SOLR Cloud
• Create Scalable, Extensible, and Transparent data pipelines
• Expand data glue to query Lucene
• Continue leveraging and expanding our analytical search capabilities
• Use further machine learning with SOLR to process rule-based tasks like data extraction
and cleaning.
• Meta data driven models based on search
• Use ML/AI in search
aa