SlideShare ist ein Scribd-Unternehmen logo
1 von 32
SEARCHING BILLIONS
OF PRODUCT LOGS IN
REALTIME
RyanTabora -Think Big Analytics
NoSQL Search Roadshow - June 6, 2013
WHO AM I?
RyanTabora
Think Big Analytics - Senior Data Engineer
Lover of dachshunds, bass, and zombies
OVERVIEW
Primers
What are product logs?
How do they apply to big data?
Real use case
Real issues and designs
Conclusion
PRODUCT LOGS?
• Device data
• IT, Energy, Healthcare, Manufacturing,Telecom ...
• These devices are pushing data back home (pull works too!)
• As more devices are sold/installed, more and more data
comes back to ‘home base’
• RealtimeVisualization
• Realtime Response
• Ad Hoc Analysis
• Full Historical Capture
• Blended Data Sets
POWER OF DEVICE DATA
DEVICE
DATA
TRADITIONAL APPROACHES
• SQL: PostGres, MySQL, Oracle, Microsoft
• SQL provides many of the search features required for typical
search applications
• Joins, regex, group by, sorting, etc
• But the these technologies can only scale so far...
• Hadoop
• HBase/Cassandra/Accumulo
• Search features are very limited
• HBase row scans, primary key index
• Cassandra limited secondary indexing
NEWTECHNIQUES
STORING DATA
• What is an index?
• Lucene
• Paralleling Index Creation
• MapReduce/Flume/Storm
• RealTime Search
• Searching before it hits disk
NEWTECHNIQUES
INDEXING DATA
• Solr/ElasticSearch
• Both build on top of Lucene
• Search servers
• RESTful HTTP APIs
• Easy to administer
• Add powerful text/numerical search capabilities
NEWTECHNIQUES
SEARCHING DATA
BASIC SEARCH FEATURES
• Boolean logic (AND, OR + -)
• Sorting and Group By
• Range queries
• Phrase/Prefix/Fuzzy queries
ADVANCED SEARCH
FEATURES
• Custom ranking/scoring
• More like this
• Auto suggest
• Faceting/Highlighting
• Geo-spacial search
SCALING SEARCH
• ElasticSearch and SolrCloud both have distributed features
built in
• Auto-sharding
• Replication
• Query routing
• Transaction log
USE CASE
Problem
Sample Solution
Core Design Issues
Other Solutions
THE PROBLEM
Home
Base
NetApp
Filer
NetApp
FilerDevice
Client A
NetApp
Filer
NetApp
FilerDevice
Client B
NetApp
Filer
NetApp
FilerDevice
Client C
Log
Log
Log
Logs
REST
API
Full SQL
Access
Flat File
Access
Latest
All
Applications
Engineers
& Analysts
SEARCH APPLICATION
FEATURES
• Find last three days of raw logs from an entire cluster
• Group capacity available grouped by machine serial number
and show the largest capacities first
• Search all device header lines for “FAILURE”
• View all hard disk objects that have product number 2341AB
• Find all motherboards with an associated customer ticket
SAMPLE SOLUTION
Logs
Ingestion
Parsing/Loading
CustomRESTfulSearchAPI
QueriesIndexing
HDFS
MapReduce
INGESTION
PARSING, LOADING,AND
INDEXING
Load HBase with parsed objects
Store HBase ROW_ID
Store pointer to raw file in HDFS
Index a number of desired fields
/ingestion/sequencefile1
/ingestion/sequencefile2
/ingestion/sequencefile3
1534 4562 5323 7232
4601 5105
0
0
0 1492 2987 4767 5987
INSIDE OF HBASE
...... ......
...... ...
...... .........
object5
…...
object4
...
object3object2
...
object1rowkey
THE SOLR DOCUMENT
2343sfOffset
/ingest/file2sequenceFile
1333-2241-3411cluster_id
42ADFF-BZMM
...
configs.log
...
2013/05/12
WARNING: DISK DEAD
...
header
contents
file_name
date_sent
system_id
rowkey
Solr Document
SEARCH APPLICATION
Search Application
Query
Data locations
Stored Object
User
Query Results
2
1
3
4
5
8
6
7
Raw Data
Location
Raw Data
Row_id
CORE DESIGN ISSUES
• Changing the Solr schema (manual reindex)
• Elastic shard scaling (manual reindex)
• No distributed joining (denormalizing the data)
• Replication*
• Manually managing Solr partitioning/sharding*
• Write durability*
SOLRCLOUD
• Automatic shard creation, routing
• Replication
• Limited to a fixed number of shards defined on initial creation
• ZooKeeper for coordination
• Large community
ELASTICSEARCH
• Similar feature set to Solr
• Purpose built for easily managing a distributed index
• Rapidly growing community
• Custom built coordination mechanism
• JSON based API
• Integrates Cassandra and Solr
• Automatic indexing in Solr/storing in Cassandra
• Automatic partitioning
• Automatic reindexing
• Not limited to fixed number of shards
• Proprietary and costs money
DATASTAX ENTERPRISE
+
• Collecting and analyzing device data/product logs can be a
very difficult challenge
• You can use NoSQL and search technologies like Solr or
ElasticSearch in unison...
• ...but it is not always easy to integrate search with NoSQL
CONCLUSION
QUESTIONS?
• Feel free to reach out if you have any questions or need help
with big data/search!
• http://ryantabora.com
• http://thinkbiganalytics.com
• @ryantabora
• ryan.tabora@thinkbiganalytics.com
BONUS SLIDES
HBASE AND SOLR
• Automatic partitioning/reindexing
• Automatic index updates on HBase inserts/deletes
• Mapping HBase cells to a Solr schema
• No perfect commercial/open source solution yet
• Many many many more...
HBASE + SOLR
AUTOMATIC INDEXING
• HBase coprocessors are like storedprocs/triggers
• New, powerful, and dangerous
• Triggers on HBase puts/deletes
• Mapping data to a schema?
HBASE + SOLR
WRITE DURABILITY
Solr
Shard 1
Solr
Shard 3
Solr
Shard 2
HBase Table - SOLR_QUEUE
MapReduce Indexing Application
Solr Queue
Reader
Create SolrDocument from raw ASUP1
Get oldest SolrDocument from HBase
Queue Table
3
Use custom hash algorithm to determine
which shard to add SolrDocument to
4
Query Solr, if SolrDocument
was added, then remove it
from the SOLR_QUEUE
5
SolrDocument
SolrDocument
Add SolrDocument to HBase for durability2
HBASE + SOLR
ELASTIC SHARDING
• HBase’s distributing mechanism uses the concept of regions to
split data across many nodes
• Region splitting can be automatic or manual (performance
degradation as regions split)
• Piggybacking Solr sharding on HBase Region splitting

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
 
A lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformA lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformIke Ellis
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbenchRan Wei
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data PipelineJesus Rodriguez
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatternsAnurag S
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataDataWorks Summit/Hadoop Summit
 
Spark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsSpark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsClaudiu Barbura
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)DataPad Inc.
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetCarl W. Handlin
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Future of pandas
Future of pandasFuture of pandas
Future of pandasJeff Reback
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarNilesh Shah
 

Was ist angesagt? (20)

Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
A lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformA lap around microsofts business intelligence platform
A lap around microsofts business intelligence platform
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatterns
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Spark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsSpark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internals
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache Superset
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Future of pandas
Future of pandasFuture of pandas
Future of pandas
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 

Ähnlich wie Searching Billions of Product Logs in Real Time (Use Case)

20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with HadoopJayant Shekhar
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014NoSQLmatters
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudCAMMS
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
 
Basic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupBasic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupJohannes Moser
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Lucas Jellema
 
Elasticsearch and the Database Market
Elasticsearch and the Database MarketElasticsearch and the Database Market
Elasticsearch and the Database MarketObjectRocket
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack IntroductionVikram Shinde
 
Architectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopArchitectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopAnu Ravindranath
 

Ähnlich wie Searching Billions of Product Logs in Real Time (Use Case) (20)

20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Apache drill
Apache drillApache drill
Apache drill
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
Sebastian Cohnen – Building a Startup with NoSQL - NoSQL matters Barcelona 2014
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
Basic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupBasic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB Meetup
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
 
Elasticsearch and the Database Market
Elasticsearch and the Database MarketElasticsearch and the Database Market
Elasticsearch and the Database Market
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Architectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopArchitectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoop
 

Kürzlich hochgeladen

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Searching Billions of Product Logs in Real Time (Use Case)

  • 1. SEARCHING BILLIONS OF PRODUCT LOGS IN REALTIME RyanTabora -Think Big Analytics NoSQL Search Roadshow - June 6, 2013
  • 2. WHO AM I? RyanTabora Think Big Analytics - Senior Data Engineer Lover of dachshunds, bass, and zombies
  • 3. OVERVIEW Primers What are product logs? How do they apply to big data? Real use case Real issues and designs Conclusion
  • 4. PRODUCT LOGS? • Device data • IT, Energy, Healthcare, Manufacturing,Telecom ... • These devices are pushing data back home (pull works too!) • As more devices are sold/installed, more and more data comes back to ‘home base’
  • 5. • RealtimeVisualization • Realtime Response • Ad Hoc Analysis • Full Historical Capture • Blended Data Sets POWER OF DEVICE DATA DEVICE DATA
  • 6. TRADITIONAL APPROACHES • SQL: PostGres, MySQL, Oracle, Microsoft • SQL provides many of the search features required for typical search applications • Joins, regex, group by, sorting, etc • But the these technologies can only scale so far...
  • 7. • Hadoop • HBase/Cassandra/Accumulo • Search features are very limited • HBase row scans, primary key index • Cassandra limited secondary indexing NEWTECHNIQUES STORING DATA
  • 8. • What is an index? • Lucene • Paralleling Index Creation • MapReduce/Flume/Storm • RealTime Search • Searching before it hits disk NEWTECHNIQUES INDEXING DATA
  • 9. • Solr/ElasticSearch • Both build on top of Lucene • Search servers • RESTful HTTP APIs • Easy to administer • Add powerful text/numerical search capabilities NEWTECHNIQUES SEARCHING DATA
  • 10. BASIC SEARCH FEATURES • Boolean logic (AND, OR + -) • Sorting and Group By • Range queries • Phrase/Prefix/Fuzzy queries
  • 11. ADVANCED SEARCH FEATURES • Custom ranking/scoring • More like this • Auto suggest • Faceting/Highlighting • Geo-spacial search
  • 12. SCALING SEARCH • ElasticSearch and SolrCloud both have distributed features built in • Auto-sharding • Replication • Query routing • Transaction log
  • 13. USE CASE Problem Sample Solution Core Design Issues Other Solutions
  • 14. THE PROBLEM Home Base NetApp Filer NetApp FilerDevice Client A NetApp Filer NetApp FilerDevice Client B NetApp Filer NetApp FilerDevice Client C Log Log Log Logs REST API Full SQL Access Flat File Access Latest All Applications Engineers & Analysts
  • 15. SEARCH APPLICATION FEATURES • Find last three days of raw logs from an entire cluster • Group capacity available grouped by machine serial number and show the largest capacities first • Search all device header lines for “FAILURE” • View all hard disk objects that have product number 2341AB • Find all motherboards with an associated customer ticket
  • 18. PARSING, LOADING,AND INDEXING Load HBase with parsed objects Store HBase ROW_ID Store pointer to raw file in HDFS Index a number of desired fields /ingestion/sequencefile1 /ingestion/sequencefile2 /ingestion/sequencefile3 1534 4562 5323 7232 4601 5105 0 0 0 1492 2987 4767 5987
  • 19. INSIDE OF HBASE ...... ...... ...... ... ...... ......... object5 …... object4 ... object3object2 ... object1rowkey
  • 21. SEARCH APPLICATION Search Application Query Data locations Stored Object User Query Results 2 1 3 4 5 8 6 7 Raw Data Location Raw Data Row_id
  • 22. CORE DESIGN ISSUES • Changing the Solr schema (manual reindex) • Elastic shard scaling (manual reindex) • No distributed joining (denormalizing the data) • Replication* • Manually managing Solr partitioning/sharding* • Write durability*
  • 23. SOLRCLOUD • Automatic shard creation, routing • Replication • Limited to a fixed number of shards defined on initial creation • ZooKeeper for coordination • Large community
  • 24. ELASTICSEARCH • Similar feature set to Solr • Purpose built for easily managing a distributed index • Rapidly growing community • Custom built coordination mechanism • JSON based API
  • 25. • Integrates Cassandra and Solr • Automatic indexing in Solr/storing in Cassandra • Automatic partitioning • Automatic reindexing • Not limited to fixed number of shards • Proprietary and costs money DATASTAX ENTERPRISE +
  • 26. • Collecting and analyzing device data/product logs can be a very difficult challenge • You can use NoSQL and search technologies like Solr or ElasticSearch in unison... • ...but it is not always easy to integrate search with NoSQL CONCLUSION
  • 27. QUESTIONS? • Feel free to reach out if you have any questions or need help with big data/search! • http://ryantabora.com • http://thinkbiganalytics.com • @ryantabora • ryan.tabora@thinkbiganalytics.com
  • 29. HBASE AND SOLR • Automatic partitioning/reindexing • Automatic index updates on HBase inserts/deletes • Mapping HBase cells to a Solr schema • No perfect commercial/open source solution yet • Many many many more...
  • 30. HBASE + SOLR AUTOMATIC INDEXING • HBase coprocessors are like storedprocs/triggers • New, powerful, and dangerous • Triggers on HBase puts/deletes • Mapping data to a schema?
  • 31. HBASE + SOLR WRITE DURABILITY Solr Shard 1 Solr Shard 3 Solr Shard 2 HBase Table - SOLR_QUEUE MapReduce Indexing Application Solr Queue Reader Create SolrDocument from raw ASUP1 Get oldest SolrDocument from HBase Queue Table 3 Use custom hash algorithm to determine which shard to add SolrDocument to 4 Query Solr, if SolrDocument was added, then remove it from the SOLR_QUEUE 5 SolrDocument SolrDocument Add SolrDocument to HBase for durability2
  • 32. HBASE + SOLR ELASTIC SHARDING • HBase’s distributing mechanism uses the concept of regions to split data across many nodes • Region splitting can be automatic or manual (performance degradation as regions split) • Piggybacking Solr sharding on HBase Region splitting