Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global

SOLR Under the Hood
(Our Experience)
Sumit Vadhera
Senior Manager, S&P Global Market Intelligence
#Activate18 #ActivateSearch

Agenda
• A little bit about myself
• A Little bit about S&P Global
• How we use SOLR
• SOLR based search Challenges
• Our Journey so far to Cloud
• Next steps
• Q&A

A little bit about myself
• Sumit Vadhera (Senior Manger Database Engineering)
• Big Data Solution Architect for around 12+ Years
• Certified Experience ranging from different RDBMS to NoSQL & Big Data
Technologies
• Barely knew about SOLR search until I joined S&P Global
• Now manages Big Data & NoSQL Solutions(build &design architecture)
including query platform

A little bit about S&P Global
• The Market Intelligence platform delivers deep industry data across a broad range of
sectors to create cutting-edge insights The Market Intelligence platform digs deeper to
deliver solutions that are sector-specific, data-rich, and hyper-targeted for your evolving
business needs.
• Our Global Coverage currently Includes:
 56,000+ banks
 58,000+ asset management companies
 11,000+ specialty finance companies
 18,000+ investment banks and broker dealers
 25,000+ insurance companies
 90,000+ real estate companies
 240,000+ tech, media & telecommunication companies
 26,000+ oil & gas companies
 19,000+ electric, natural gas & water utilities
 30,000+ mining & exploration companies

How we use SOLR
• Primary use case:
 Data is more than our business, it’s our passion. Some of the datasets we provide.
 Financials
 Estimates
 Ownership
 Key developments
 Private company data
 Transactions
 Professionals
 Corporate Actions
 Events and Transcripts
• Use SOLR to power some of our critical datasets and use a lot of custom code too

How we use SOLR cont..
• Universal Search engine for our platform powered on SOLR provides
 Text (keyword) & relevancy based search capabilities to our users
 Page Combinations allows our users to type in the name of all possible CIQ pages rather than navigating all of the links.
For example, a user can simply type “IBM Key Stats” in the Search box and immediately navigate to the exact page desired.
 Autosuggest feature for different objects
 Faceting search
 Type ahead search (Filter based text)
 Advanced Search
 Speech to text transcripts
• Indexing-Querying client interacting with SOLR exposes different datasets indexed from
various sources including data pipeline and exposes it to users
• Multiple SOLR Clusters(hybrid) with Terabytes of data and utilize hybrid sharding techniques
including application and cloud based
• Leveraging Lucidworks Customer Support extensively

Typical Platform Page Search 1

Typical Platform Page Search 2

SOLR Based Search Challenges(legacy)..
~40-50 million docs ingestion rate &1-2 million docs per month & transaction rate of approx. 300-350
per min. Average queries hitting per week of 5 million.
• Performance challenges(bottlenecks to overall query traffic)
• Timeouts on platform applications due to complex queries choking entire clusters and creating
bottlenecks
• Relevance performance
• Indexing lags causing near real time data lags on platform. Manual exception handling.
• Fragmentation inside SOLR Cores were a primary factor
• Optimization Downtime
• Analyzing & extracting SOLR log queries stored in RDBMS.
• Re-indexing process
• GC issues & customized code and customized indexing solution
• Security and product bugs
• Single point of failure on Master-Slave
• Document exceptions(tika parser)

What we did..
• Extensive GC Tuning
• Extensive JVM Tuning
• IO Tuning(Trying out LOCAL disks)
• Query tuning not just limited to
 Move non-scoring queries to the filter cache and improve the use of the field Value Cache – with
date descending sorts
 Caching of time range queries by decreasing granularity from seconds to day to speed auto warm
times
 Changing scoring algorithms(custom) & use of edismax parser to support multi language(foreign)
 Cleaning off date range & phrase queries
• Turning off term vectors and switching to doc values
• Addition of more searchers(horizontal scalability)
• Automation of optimization & recycling SOLR more frequently during off hours

Storage
A 5 Megabyte hard drive from 1956
Being loaded into a plane
Cost: More than USD$ 100,000

What has changed and is changing fast
1985 Cost 2018 Cost
Storage is (nearly) free
1985 Cost 2018 Cost
Processing power doubles each year
1 TB 1 GHz
The significant problems we face today cannot be solved at the same level of thinking
We were at when we created them.
- Albert Einstein

Our Journey so far to cloud..
Today we utilize SOLR latest cloud architecture with hybrid cloud infrastructure

Our Journey so far to cloud cont..
Key benefits we see as of today..
• No single point of failure
• Increased availability (HA) and reduced TAT
• Significant performance gains(query) and improved relevancy for page searches(scale searching).
• Improvements to indexing and decreased incremental lags(scale indexing).
• Banana dashboards to identify bottlenecks
• Leverage fusion for security and auditing (authorization/authentication)
• Indexing pipelines with auto detection of failures & NEAR REAL TIME data on platform
• Type ahead, Facet, Search supporting highlighting ,recent &related terms custom searches
• Moving off SOLR logs data from async sonic message queue to new piplelines integrating with ES and Kibana.
• Improved Searched through multiple filters
• Quicker alerts to setup on our variety of searches
• Support natural language search, screening & mappings
• Improved platform search serving screening questions, quick navigation to individual workflows, surfacing pages
and documents.

Next steps..
Search, now Data Science
• Further improving relevancy of search results, the presentation of our information as a
whole really, makes our platform more essential to our clients
• Data science as a whole continues creating models that feeds to improve relevancy.
• Continue leverage new features & enhancements & Support of LW SOLR Cloud
• Create Scalable, Extensible, and Transparent data pipelines
• Expand data glue to query Lucene
• Continue leveraging and expanding our analytical search capabilities
• Use further machine learning with SOLR to process rule-based tasks like data extraction
and cleaning.
• Meta data driven models based on search
• Use ML/AI in search
aa

We are hiring…
https://www.spglobal.com/en/careers/
Other DB & NoSQL/Big Data technologies we use…
• SOLR
• ELK
• Kafka
• Hadoop
• MySQL
• Oracle
• MSSQL
• Cassandra
• PostgreSQL
• Dynamo DB
• Redshift
Many More…..

Thank you!
Sumit Vadhera
Senior Manager, S&P Global
Database Engineering (Architecture)
Email:- sumitvadhera@spglobal.com
Linked IN:- https://www.linkedin.com/in/sumit-vadhera-993059162
#Activate18 #ActivateSearch

Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global

Ähnlich wie Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global (20)

Mehr von Lucidworks

Mehr von Lucidworks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global