SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Large Scale Log Analysis with HBase and
Solr at Amadeus
Martin Alig
aligma@student.ethz.ch
Overview
     Problem
     Solution - Overview
     HBase
     Solr
     Solution - Details
     Results




Montag, 16. Juli 2012       2
Problem

 Amadeus is the worlds leading technology provider
  to the travel industry, providing marketing,
  distribution and IT services worldwide.
 The Amadeus computer reservation system (CRS)
  processed 850 million billable travel transactions in
  2010.
 Current logging framework produces 100'000 -
  1'000'000 messages per second




Montag, 16. Juli 2012                                     3
Problem - Log Messages

 Messages with 1 KB average size
 Message can be anything: XML, Edifact, HEX
  dump, ...
 A few fixed attributes per message given:
  Timestamp, source, various ids.




Montag, 16. Juli 2012                          4
Problem - Current Solution

 Write log messages in plain text files.
 Split, compress and copy to SAN.




      Queries? Search? Statistics?




Montag, 16. Juli 2012                       5
Solution Overview

 Use Apache HBase for storage and instant random
  access
 Apache MapReduce for complex queries.
 Apache Solr as full text search engine for queries on
  the log messages.




Montag, 16. Juli 2012                                 6
Apache HBase

 Open source, non-relational, distributed database.
 Modeled after Google's BigTable
 Runs on top of Hadoop Distributed Filesystem
  (HDFS)




Montag, 16. Juli 2012                                  7
HBase - Terms

 Region
        Contigous ranges of rows stored together
        Dynamically split / merged and distributed
 RegionServer (slave)
        Serves regions, e.g. data for reads and writes
 HMaster (master)
        Responsible for coordination
        Assigns regions to Region Servers, detects failures
        Admin functions




Montag, 16. Juli 2012                                          8
HBase - Architecture

                         ZooKeeper
                                         HMaster
       Client            ZooKeeper
                                         HMaster
                         ZooKeeper




       RegionServer     RegionServer   RegionServer



                           HDFS



Montag, 16. Juli 2012                                 9
HBase - Data Access

     Java API
     REST
     Apache Avro, Apache Thrift
     Hadoop MapReduce




Montag, 16. Juli 2012              10
HBase - Secondary Indexes

 No native support for secondary indexes
 Different choices:
        Client managed: Write value in data table and index in
         index table
        Coprocessors that automatically create the secondary index
        Periodic update: Use MapReduce job to add index




Montag, 16. Juli 2012                                             11
HBase - Coprocessors

 Run arbitrary code on any node:
        Observer: RegionObserver, MasterObserver, WALObserver
         provide hooks for code execution
         (prePut, postPut, preGet, postGet, ...)
        Endpoint: Installed on nodes, executed on client request




Montag, 16. Juli 2012                                               12
Apache Solr

 Apache Lucene + many features like
        Distributed index
        Distributed search
        ...
 Apache Lucene is a high-performance, full-featured
  text search engine library




Montag, 16. Juli 2012                                  13
Solution - Details

                                 Client        Insert log messages, create
                                               secondary indexes for
                                               predefinded attributes.



                                 HBase


                        Use coprocessor functionality to index
                        log messages in Solr after insert.



                                  Solr


Montag, 16. Juli 2012                                                 14
Solution - Cluster Configuration
              Client         Zookeeper           Namenode
                                             SecondaryNamenode
                                                  HMaster




           DataNode      DataNode                    DataNode
       RegionServer     RegionServer               RegionServer
                 Solr       Solr                       Solr
                                       ...
Montag, 16. Juli 2012                                             15
Solution - HBase & MapReduce

 Very good integration of MapReduce into HBase
 Easy to use HBase as data source, data sink or both
 Provides helper classes




Montag, 16. Juli 2012                               16
Solution - Problems

 Can Solr keep up with HBase?
 Is Solr full text search practical for log messages?
  (XML, other formats, ...)




Montag, 16. Juli 2012                                    17
Results

 Not many, yet.
 Generic experiments with random data
 Experiments with real log data just started




Montag, 16. Juli 2012                           18
Results - Write Random Data - HBase
Only
 Insert random data, 1KB records.
 Cluster configuration:
        5 Nodes:
               RAM: 24 GiB
               CPU: Intel Xeon L5520 2.26
               HD: 2x 15k RPM Sas 73 GB (RAID1)
        1. Node: Master (Namenode, HMaster, Zookeeper)
        2. - 5. Node: Slaves (Datanode, RegionServer)
 Client on seperate node
 Experiment executed with and without secondary
  indexes. (5 additional indexes)


Montag, 16. Juli 2012                                     19
Results - Write Random Data - HBase
Only

                   No secondary indexes   Secondary indexs
                   avg. inserts/sec       avg. inserts/sec (not counting
                                          index inserts
                    ~30'000                ~6'000




Montag, 16. Juli 2012                                                      20
Results - Write Read Data - HBase & Solr

 No real numbers
 First tests: Single Solr instance indexes ~1000 log
  messages per second.




Montag, 16. Juli 2012                                   21
Questions




Montag, 16. Juli 2012   22
Montag, 16. Juli 2012   23
HBase - Architecure




                        Source: HBase - The Definitive Guide
Montag, 16. Juli 2012                                    24
HBase - Key Design




                        Source: HBase - The Definitive Guide
Montag, 16. Juli 2012                                    25
HBase - Hardware

 Master
        Ram: 24 GB
        CPU: Dual quad-core
        Disks: 4 x 1 TB SATA, RAID 0+1
 Slave
        Ram: 24 GB or more
        CPU: Dual quad-core
        Disks: 6 + 1 TB SATA, JBOD




Montag, 16. Juli 2012                     26
HBase - Monitoring

 Ganglia is a scalable distributed monitoring system
  for high-performance computing systems such as
  clusters and Grids.
 HBase provides metrics for Ganglia.




Montag, 16. Juli 2012                                   27
Log Message Example (1)

      2012/05/15 04:33:04.783757 sitst201 srvT2M-838059 Trace
      name: all0302
      Message sent [con=19104962 (FE_EXT_TCIL-ISO9735_ETK-
      310_OPK2_ETK-REQ), cxn=1498840662
      (172.17.39.174:13101), addr=0x1db58830, len=354,
      CorrID=000100E1A1EU42,
      MsgID=SQ8ZK36LG3TJ12JE6XMU2O8]
      UNB^]IATB^_1^]1AETH^_^_LY^]CDBETICKET^_^_LY^]1205
      15^_0433^]00JNQPH79K0001^]^]^]O^UNH^]1^]TKCREQ^
      _08^_5^_1A^]000100E1A1EU42^DCX^]134^]<DCC
      VERS="1.0"><MW><UKEY VAL="EXRU$3013#GJ12V4K#1IZ"
      TRXNB="1"/><$



Montag, 16. Juli 2012                                       28
Log Message Example (2)

      2012/05/15 04:33:04.783671 sitst201 srvT2M-838059 Trace
      name: all0302
      Query [SAP=1ASICDBETK, DCXID=EXRU$3013#GJ12V4K#1IZ,
      TRXNB=1, CorrID=000100E1A1EU42,
      MsgID=SQ8ZK36LG3TJ12JE6XMU2O8]




Montag, 16. Juli 2012                                       29
Log Message Example (3)

      2012/05/15 04:32:42.289282 sitmt301 muxT2-332108 Trace
      name: all0302
      Message received [con=17697 (inSrvT2_TCIL_1),
      cxn=1626671045 (194.156.170.210:8000),
      addr=0x13e9b830, len=1710, CorrID=09B5840E,
      MsgID=OX7E09RYABBLS61HR2DXTL]
      +----- ADDR -----+--------------- HEX ---------------+----- ASCII ----
      +---- EBCDIC ----+
       0000000013e9b830 554e421d 49415442 1f311d31
      4153494c UNB.IATB.1.1ASIL .+.............<
      0000000013e9b840 53533243 53544e1d 3141304c
      53534352 SS2CSTN.1A0LSSCR ......+....<....
      0000000013e9b850 591d3132 30353135 1f303433 321d3030
      Y.120515.0432.00 ................ 0000000013e9b860 39 ...
Montag, 16. Juli 2012                                                     30

Weitere ähnliche Inhalte

Andere mochten auch

Pittaro open stackloganalysis_20130416
Pittaro open stackloganalysis_20130416Pittaro open stackloganalysis_20130416
Pittaro open stackloganalysis_20130416
OpenStack Foundation
 
Lily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionLily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC edition
NGDATA
 
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best Practices
MongoDB
 

Andere mochten auch (20)

Pittaro open stackloganalysis_20130416
Pittaro open stackloganalysis_20130416Pittaro open stackloganalysis_20130416
Pittaro open stackloganalysis_20130416
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Lily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionLily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC edition
 
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
STAC Summit 2014 - Building a multitenant Big Data infrastructure
STAC Summit 2014 - Building a multitenant Big Data infrastructureSTAC Summit 2014 - Building a multitenant Big Data infrastructure
STAC Summit 2014 - Building a multitenant Big Data infrastructure
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best Practices
 

Ähnlich wie Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich)

Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or question
DataWorks Summit
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
 

Ähnlich wie Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich) (20)

Big data overview by Edgars
Big data overview by EdgarsBig data overview by Edgars
Big data overview by Edgars
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file system
 
Apache drill
Apache drillApache drill
Apache drill
 
Hadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or QuestionHadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or Question
 
Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or question
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
 
Final proj
Final projFinal proj
Final proj
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Node Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js TutorialNode Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js Tutorial
 
mar07-redis.pdf
mar07-redis.pdfmar07-redis.pdf
mar07-redis.pdf
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...
MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...
MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...
 
Mdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_searchMdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_search
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 

Mehr von Swiss Big Data User Group

Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
Swiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
Swiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
Swiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
Swiss Big Data User Group
 

Mehr von Swiss Big Data User Group (20)

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operator
 
Data Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2CData Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2C
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Educating Data Scientists of the Future
Educating Data Scientists of the FutureEducating Data Scientists of the Future
Educating Data Scientists of the Future
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data Warehouse
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexity
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
 
In-Store Analysis with Hadoop
In-Store Analysis with HadoopIn-Store Analysis with Hadoop
In-Store Analysis with Hadoop
 
Big Data Visualization With ParaView
Big Data Visualization With ParaViewBig Data Visualization With ParaView
Big Data Visualization With ParaView
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich)

  • 1. Large Scale Log Analysis with HBase and Solr at Amadeus Martin Alig aligma@student.ethz.ch
  • 2. Overview  Problem  Solution - Overview  HBase  Solr  Solution - Details  Results Montag, 16. Juli 2012 2
  • 3. Problem  Amadeus is the worlds leading technology provider to the travel industry, providing marketing, distribution and IT services worldwide.  The Amadeus computer reservation system (CRS) processed 850 million billable travel transactions in 2010.  Current logging framework produces 100'000 - 1'000'000 messages per second Montag, 16. Juli 2012 3
  • 4. Problem - Log Messages  Messages with 1 KB average size  Message can be anything: XML, Edifact, HEX dump, ...  A few fixed attributes per message given: Timestamp, source, various ids. Montag, 16. Juli 2012 4
  • 5. Problem - Current Solution  Write log messages in plain text files.  Split, compress and copy to SAN. Queries? Search? Statistics? Montag, 16. Juli 2012 5
  • 6. Solution Overview  Use Apache HBase for storage and instant random access  Apache MapReduce for complex queries.  Apache Solr as full text search engine for queries on the log messages. Montag, 16. Juli 2012 6
  • 7. Apache HBase  Open source, non-relational, distributed database.  Modeled after Google's BigTable  Runs on top of Hadoop Distributed Filesystem (HDFS) Montag, 16. Juli 2012 7
  • 8. HBase - Terms  Region  Contigous ranges of rows stored together  Dynamically split / merged and distributed  RegionServer (slave)  Serves regions, e.g. data for reads and writes  HMaster (master)  Responsible for coordination  Assigns regions to Region Servers, detects failures  Admin functions Montag, 16. Juli 2012 8
  • 9. HBase - Architecture ZooKeeper HMaster Client ZooKeeper HMaster ZooKeeper RegionServer RegionServer RegionServer HDFS Montag, 16. Juli 2012 9
  • 10. HBase - Data Access  Java API  REST  Apache Avro, Apache Thrift  Hadoop MapReduce Montag, 16. Juli 2012 10
  • 11. HBase - Secondary Indexes  No native support for secondary indexes  Different choices:  Client managed: Write value in data table and index in index table  Coprocessors that automatically create the secondary index  Periodic update: Use MapReduce job to add index Montag, 16. Juli 2012 11
  • 12. HBase - Coprocessors  Run arbitrary code on any node:  Observer: RegionObserver, MasterObserver, WALObserver provide hooks for code execution (prePut, postPut, preGet, postGet, ...)  Endpoint: Installed on nodes, executed on client request Montag, 16. Juli 2012 12
  • 13. Apache Solr  Apache Lucene + many features like  Distributed index  Distributed search  ...  Apache Lucene is a high-performance, full-featured text search engine library Montag, 16. Juli 2012 13
  • 14. Solution - Details Client Insert log messages, create secondary indexes for predefinded attributes. HBase Use coprocessor functionality to index log messages in Solr after insert. Solr Montag, 16. Juli 2012 14
  • 15. Solution - Cluster Configuration Client Zookeeper Namenode SecondaryNamenode HMaster DataNode DataNode DataNode RegionServer RegionServer RegionServer Solr Solr Solr ... Montag, 16. Juli 2012 15
  • 16. Solution - HBase & MapReduce  Very good integration of MapReduce into HBase  Easy to use HBase as data source, data sink or both  Provides helper classes Montag, 16. Juli 2012 16
  • 17. Solution - Problems  Can Solr keep up with HBase?  Is Solr full text search practical for log messages? (XML, other formats, ...) Montag, 16. Juli 2012 17
  • 18. Results  Not many, yet.  Generic experiments with random data  Experiments with real log data just started Montag, 16. Juli 2012 18
  • 19. Results - Write Random Data - HBase Only  Insert random data, 1KB records.  Cluster configuration:  5 Nodes:  RAM: 24 GiB  CPU: Intel Xeon L5520 2.26  HD: 2x 15k RPM Sas 73 GB (RAID1)  1. Node: Master (Namenode, HMaster, Zookeeper)  2. - 5. Node: Slaves (Datanode, RegionServer)  Client on seperate node  Experiment executed with and without secondary indexes. (5 additional indexes) Montag, 16. Juli 2012 19
  • 20. Results - Write Random Data - HBase Only No secondary indexes Secondary indexs avg. inserts/sec avg. inserts/sec (not counting index inserts ~30'000 ~6'000 Montag, 16. Juli 2012 20
  • 21. Results - Write Read Data - HBase & Solr  No real numbers  First tests: Single Solr instance indexes ~1000 log messages per second. Montag, 16. Juli 2012 21
  • 23. Montag, 16. Juli 2012 23
  • 24. HBase - Architecure Source: HBase - The Definitive Guide Montag, 16. Juli 2012 24
  • 25. HBase - Key Design Source: HBase - The Definitive Guide Montag, 16. Juli 2012 25
  • 26. HBase - Hardware  Master  Ram: 24 GB  CPU: Dual quad-core  Disks: 4 x 1 TB SATA, RAID 0+1  Slave  Ram: 24 GB or more  CPU: Dual quad-core  Disks: 6 + 1 TB SATA, JBOD Montag, 16. Juli 2012 26
  • 27. HBase - Monitoring  Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.  HBase provides metrics for Ganglia. Montag, 16. Juli 2012 27
  • 28. Log Message Example (1) 2012/05/15 04:33:04.783757 sitst201 srvT2M-838059 Trace name: all0302 Message sent [con=19104962 (FE_EXT_TCIL-ISO9735_ETK- 310_OPK2_ETK-REQ), cxn=1498840662 (172.17.39.174:13101), addr=0x1db58830, len=354, CorrID=000100E1A1EU42, MsgID=SQ8ZK36LG3TJ12JE6XMU2O8] UNB^]IATB^_1^]1AETH^_^_LY^]CDBETICKET^_^_LY^]1205 15^_0433^]00JNQPH79K0001^]^]^]O^UNH^]1^]TKCREQ^ _08^_5^_1A^]000100E1A1EU42^DCX^]134^]<DCC VERS="1.0"><MW><UKEY VAL="EXRU$3013#GJ12V4K#1IZ" TRXNB="1"/><$ Montag, 16. Juli 2012 28
  • 29. Log Message Example (2) 2012/05/15 04:33:04.783671 sitst201 srvT2M-838059 Trace name: all0302 Query [SAP=1ASICDBETK, DCXID=EXRU$3013#GJ12V4K#1IZ, TRXNB=1, CorrID=000100E1A1EU42, MsgID=SQ8ZK36LG3TJ12JE6XMU2O8] Montag, 16. Juli 2012 29
  • 30. Log Message Example (3) 2012/05/15 04:32:42.289282 sitmt301 muxT2-332108 Trace name: all0302 Message received [con=17697 (inSrvT2_TCIL_1), cxn=1626671045 (194.156.170.210:8000), addr=0x13e9b830, len=1710, CorrID=09B5840E, MsgID=OX7E09RYABBLS61HR2DXTL] +----- ADDR -----+--------------- HEX ---------------+----- ASCII ---- +---- EBCDIC ----+ 0000000013e9b830 554e421d 49415442 1f311d31 4153494c UNB.IATB.1.1ASIL .+.............< 0000000013e9b840 53533243 53544e1d 3141304c 53534352 SS2CSTN.1A0LSSCR ......+....<.... 0000000013e9b850 591d3132 30353135 1f303433 321d3030 Y.120515.0432.00 ................ 0000000013e9b860 39 ... Montag, 16. Juli 2012 30