SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Running a Realtime Stats
   Service on MySQL
       Cybozu Labs, Inc.
         Kazuho Oku
Background




                                                         2
Apr. 23 2009   Running Realtime Stats Service on MySQL
Who am I?

    Name: Kazuho Oku (          )
    Original Developer of Palmscape / Xiino
          The oldest web browser for Palm OS
    Working at Cybozu Labs since 2005
          Research subsidiary of Cybozu, Inc.
          Cybozu is a leading groupware vendor in Japan
          My weblog: tinyurl.com/kazuho



                                                                 3
Apr. 23 2009           Running Realtime Stats Service on MySQL
Introduction of Pathtraq




                                                             4
Apr. 23 2009       Running Realtime Stats Service on MySQL
What is Pathtraq?

    Started in Aug. 2007
    Web ranking service
          One of Japan’s largest
                 10,000 users submit access information
                 1,000,000 access infomation per day
          like Alexa, but semi-realtime, and per-page




                                                                     5
Apr. 23 2009               Running Realtime Stats Service on MySQL
What is Pathtraq? (cont'd)

    Automated Social News Service
          find what's hot
          like Google News + Digg
          calculate relevance from access stats
    Search by...
          no filtering (all the Internet)
          by category
          by keyword
          by URL (per-domain, etc.)
                                                                   6
Apr. 23 2009             Running Realtime Stats Service on MySQL
How to Provide Real-time Analysis?

    Data Set (as of Apr. 23 2009)
          # of URLs: 147,748,546
          # of total accesses: 413,272,527
    Sharding is not a good option
          since we need to join the tables and aggregate
                prefix-search by URL, search by keyword, then join
                 with access data table
          core tables should be stored on RAM
                not on HDD, due to lots of random access

                                                                      9
Apr. 23 2009                Running Realtime Stats Service on MySQL
Our Decision was to...

    Keep URL and access stats on RAM
          compression for size and speed
    Create a new message queue
    Limit Pre-computation Load
    Create our own cache, with locks
          to minimize database access
    Fulltext-search database on SSD

                                                                 10
Apr. 23 2009           Running Realtime Stats Service on MySQL
Our Servers

    Main Server
          Opteron 2218 x2, 64GB Mem
          MySQL, Apache
    Fulltext Search Server
          Opteron 240EE, 2GB Mem, Intel SSD
          MySQL (w. Tritonn/Senna)
    Helper Servers
          for Content Analysis
          for Screenshot Generation
                                                                 11
Apr. 23 2009           Running Realtime Stats Service on MySQL
The Long Tail of the Internet
                                             -0.44
                            y=C x
                # of URLs with 1/10 hits: x2.75




                                                              12
Apr. 23 2009        Running Realtime Stats Service on MySQL
Compressing URLs




                                                          13
Apr. 23 2009    Running Realtime Stats Service on MySQL
Compressing URLs

    The Challenges:
          URLs are too short for gzip, etc.
          URLs should be prefix-searchable in compressed
           form
                How to run like 'http://www.mysql.com/%' on a
                 compressed URL?

    The Answer:
          Static PPM + Range Coder


                                                                       14
Apr. 23 2009                 Running Realtime Stats Service on MySQL
Static PPM

    PPM: Prediction by Partial Matching
          What is the next character after quot;.coquot;?
                The answer is quot;mquot;!
          PPM is used by 7-zip, etc.
    Static PPM is PPM with static probabilistic
     model
          Many URLs (or English words) have common
           patterns
          Suitable for short texts (like URLs)
                                                                      15
Apr. 23 2009                Running Realtime Stats Service on MySQL
Range Coder

    A fast variant of arithmetic compression
          similar to huffmann encoding, but better
          If probability of next character being quot;mquot; was
           75%, it will be encoded into 0.42 bit
    Compressed strings preserve the sort
     order of uncompressed form



                                                                 16
Apr. 23 2009           Running Realtime Stats Service on MySQL
Create Compression Functions

    Build prediction table from stored URLs
    Implement range coder
          took an open-source impl. and optimized it
                original impl. added some bits unnecessary at the tail
                use SSE instructions for faster operation
                coderepos.org/share/browser/lang/cplusplus/range_coder

    Link the coder and the table to create
     MySQL UDFs

                                                                          17
Apr. 23 2009                  Running Realtime Stats Service on MySQL
Rewriting the Server Logic

    Change schema
         url varchar(255) not null                            # with unique index

         urlc varbinary(767) not null # with unique index


    Change prefix-search form
         url like 'http://example.com/%'

         url_compress('http://example.com/')<=urlc and
           urlc<url_compress('http://example.com0')
         Note: quot;0quot; is next character of '/'
                                                                                    18
Apr. 23 2009                            Running Realtime Stats Service on MySQL
Compression Ratio

    Compression ratio: 37%
          Size of prediction table: 4MB
    Benchmark of the compression functions
          compression: 40MB/sec. (570k URLs/sec.)
          decompression: 19.3MB/sec. (280k URLs/sec.)
          fast enough since searchable in compressed form
    Prefix-search became faster
          shorter indexes lead to faster operation

                                                                 19
Apr. 23 2009           Running Realtime Stats Service on MySQL
Re InnoDB Compression

    URL Compression can coexist with
     InnoDB compression
                though we aren't using InnoDB compression on our
                 production environment

                Compression                                     Table Size
                N/A                                                    100%
                URL compression                                         57%
                InnoDB compression                                      50%
                using both                                              33%
                                                                              20
Apr. 23 2009               Running Realtime Stats Service on MySQL
Compressing the Stats Table

    Used to have two int columns: at, cnt
          it was waste of space, since...
                most cnt values are very small numbers
                most accesses to each URL occur on a short period (ex.
                 the day the blog entry was written)
                at field should be part of the indexes
                 at (hours since epoch)
          cnt (# of hits)
                 330168
                          1
                 330169
                          2
                 330173
                          1
                 330197
                          1

                                                                        21
Apr. 23 2009                  Running Realtime Stats Service on MySQL
Compressing the Stats Table (cont'd)

    Merge the rows into a sparse array
          example on the prev. page becomes:
             (offset=330197),1,0(repeated 23 times),1,2,1
    Then compress the array
          the example becomes a blob of 8 bytes
          originally was 8 bytes x 4 rows with index
    And store the array in a single column
          fewer rows lead to smaller table, faster access

                                                                 22
Apr. 23 2009           Running Realtime Stats Service on MySQL
Compressing the Stats Table (cont'd)

    Write MySQL UDFs to access the sparse
     array
                cnt_add(column,at,cnt)
                      -- adds cnt on given index (at)
                cnt_between(column,from,to)
                      -- returns # of hits between given hours
                and more...

    We use int[N] arrays for vectorized calc.
          especially when creating access charts

                                                                      23
Apr. 23 2009                Running Realtime Stats Service on MySQL
Create a new Message Queue




                                                               24
Apr. 23 2009         Running Realtime Stats Service on MySQL
Q4M

    A simple, reliable, fast message queue
          runs as a pluggable storage engine of MySQL
          GPL License; q4m.31tools.com
          presented yesterday at MySQL Conference :-p
                slides at tinyurl.com/q4m2009

    Used for relaying messages between our
     servers


                                                                      25
Apr. 23 2009                Running Realtime Stats Service on MySQL
Limiting Pre-computation Load




                                                                26
Apr. 23 2009          Running Realtime Stats Service on MySQL
Limit # of CPU-intensive Pre-computations

    Use cron & setlock
          setlock is part of daemontools by djb
    setlock
          serializes processes by using flock
          -n option: use trylock; if locked, do nothing

   # use only one CPU core for pre-computation
   */2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries
   50***       setlock /tmp/tasks.lock precompute_yesterday_data



                                                                  27
Apr. 23 2009            Running Realtime Stats Service on MySQL
Limit # of Disk-intensive Pre-computations

    Divide pre-computation to blocks and
     sleep depending on the elapsed time


   my $LOAD = 0.25;

   while (true) {
     my $start = time();
     precompute_block();
     sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD);
   }

                                                                 28
Apr. 23 2009           Running Realtime Stats Service on MySQL
Creating our own Cache System




                                                                29
Apr. 23 2009          Running Realtime Stats Service on MySQL
The Problem

    Query cache is flushed on table update
          access stats can be (should be) cached for a
           certain period
    Memcached has a thundering-herd
     problem
          all clients try to read the database when a
           cached-entry expires
          critical for us since our queries does joins,
           aggregations, and sort operations

                                                                  30
Apr. 23 2009            Running Realtime Stats Service on MySQL
Swifty and KeyedMutex

    Swifty is a mmap-based cache
          cached data shared between processes
          lock-free on read, flock on write
          notifies a single client that the accessed entry is
           going to expire within few seconds
          notified client can start updating a cache entry
           before it expires
    KeyedMutex
          a daemon used to block multiple clients issuing
           same SQL queries
                                                                  31
Apr. 23 2009            Running Realtime Stats Service on MySQL
Swifty and KeyedMutexd (cont'd)

    Source codes are available:
          coderepos.org/share/browser/lang/c/swifty
          coderepos.org/share/browser/lang/perl/Cache-Swifty
          coderepos.org/share/browser/lang/perl/KeyedMutex




                                                                   32
Apr. 23 2009             Running Realtime Stats Service on MySQL
Fulltext-search on SSD




                                                            33
Apr. 23 2009      Running Realtime Stats Service on MySQL
Senna / Tritonn

    Senna is a FTS engine popular in Japan
          might not work well with European languages
    Tritonn is a replacement of MyISAM FTS
          uses Senna as backend
          faster than MyISAM FTS
    Wrote patches to support SSD
          during our transition from RAM to SSD
          patches accepted in Senna 1.1.4 / Tritonn 1.0.12

                                                                 34
Apr. 23 2009           Running Realtime Stats Service on MySQL
FTS: RAM-based vs. SSD-based

    Size of FTS data:    20GB
    Downgraded hardware to see if SSD-
     based FTS is feasible
    Speed became ¼
          but latency of searches are well below one second

                          Old Hardware
                       New Hardware
               CPU
       Opteron 2218 (2.6GHz) x2
 Opteron 240 (1.4GHz)
               Memory
    32GB
                               2GB
               Storage
   7200rpm SATA HDD
                   SSD (Intel X25-M)
                                                                                   35
Apr. 23 2009                   Running Realtime Stats Service on MySQL
Summary




                                                         36
Apr. 23 2009   Running Realtime Stats Service on MySQL
Summary

    Use UDFs for optimization
    Sometime it is easier to scale UP
          esp. when you can estimate your data growth
    Use SSD for FTS
          Baidu (China's leading search engine) uses SSD
    Most of the things introduced are OSS
          We plan to open-source our URL compression
           table as well

                                                                 37
Apr. 23 2009           Running Realtime Stats Service on MySQL
We are Looking for...

    If you are interested in localizing
     Pathtraq to your country, please contact
     us
          we do not have resources outside of Japan
                to translate the web interface
                to ask people to install our browser extension
                to follow local regulations, etc.




                                                                       38
Apr. 23 2009                 Running Realtime Stats Service on MySQL
Thank you for listening

                   tinyurl.com/kazuho



                                                            39
Apr. 23 2009      Running Realtime Stats Service on MySQL

Weitere ähnliche Inhalte

Andere mochten auch

Database Performance With Proxy Architectures
Database  Performance With  Proxy  ArchitecturesDatabase  Performance With  Proxy  Architectures
Database Performance With Proxy ArchitecturesPerconaPerformance
 
Ict Examples Presentation 1210756047278583 9
Ict Examples Presentation 1210756047278583 9Ict Examples Presentation 1210756047278583 9
Ict Examples Presentation 1210756047278583 9laurenesam
 
Performance Instrumentation Beyond What You Do Now
Performance  Instrumentation  Beyond  What  You  Do  NowPerformance  Instrumentation  Beyond  What  You  Do  Now
Performance Instrumentation Beyond What You Do NowPerconaPerformance
 
Object Oriented Css For High Performance Websites And Applications
Object Oriented Css For High Performance Websites And ApplicationsObject Oriented Css For High Performance Websites And Applications
Object Oriented Css For High Performance Websites And ApplicationsPerconaPerformance
 
My S Q L Replication Getting The Most From Slaves
My S Q L  Replication  Getting  The  Most  From  SlavesMy S Q L  Replication  Getting  The  Most  From  Slaves
My S Q L Replication Getting The Most From SlavesPerconaPerformance
 
Galera Multi Master Synchronous My S Q L Replication Clusters
Galera  Multi Master  Synchronous  My S Q L  Replication  ClustersGalera  Multi Master  Synchronous  My S Q L  Replication  Clusters
Galera Multi Master Synchronous My S Q L Replication ClustersPerconaPerformance
 
How To Think About Performance
How To Think About PerformanceHow To Think About Performance
How To Think About PerformancePerconaPerformance
 
Stones play for wiki
Stones play for wikiStones play for wiki
Stones play for wikilaurenesam
 
Trees And More With Postgre S Q L
Trees And  More With  Postgre S Q LTrees And  More With  Postgre S Q L
Trees And More With Postgre S Q LPerconaPerformance
 
E M T Better Performance Monitoring
E M T  Better  Performance  MonitoringE M T  Better  Performance  Monitoring
E M T Better Performance MonitoringPerconaPerformance
 
Drizzles Approach To Improving Performance Of The Server
Drizzles  Approach To  Improving  Performance Of The  ServerDrizzles  Approach To  Improving  Performance Of The  Server
Drizzles Approach To Improving Performance Of The ServerPerconaPerformance
 
Boost Performance With My S Q L 51 Partitions
Boost Performance With  My S Q L 51 PartitionsBoost Performance With  My S Q L 51 Partitions
Boost Performance With My S Q L 51 PartitionsPerconaPerformance
 
Automated Performance Testing With J Meter And Maven
Automated  Performance  Testing With  J Meter And  MavenAutomated  Performance  Testing With  J Meter And  Maven
Automated Performance Testing With J Meter And MavenPerconaPerformance
 

Andere mochten auch (17)

Database Performance With Proxy Architectures
Database  Performance With  Proxy  ArchitecturesDatabase  Performance With  Proxy  Architectures
Database Performance With Proxy Architectures
 
Ict Examples Presentation 1210756047278583 9
Ict Examples Presentation 1210756047278583 9Ict Examples Presentation 1210756047278583 9
Ict Examples Presentation 1210756047278583 9
 
High Performance Erlang
High  Performance  ErlangHigh  Performance  Erlang
High Performance Erlang
 
Maailma LõPus
Maailma LõPusMaailma LõPus
Maailma LõPus
 
Performance Instrumentation Beyond What You Do Now
Performance  Instrumentation  Beyond  What  You  Do  NowPerformance  Instrumentation  Beyond  What  You  Do  Now
Performance Instrumentation Beyond What You Do Now
 
Object Oriented Css For High Performance Websites And Applications
Object Oriented Css For High Performance Websites And ApplicationsObject Oriented Css For High Performance Websites And Applications
Object Oriented Css For High Performance Websites And Applications
 
My S Q L Replication Getting The Most From Slaves
My S Q L  Replication  Getting  The  Most  From  SlavesMy S Q L  Replication  Getting  The  Most  From  Slaves
My S Q L Replication Getting The Most From Slaves
 
Galera Multi Master Synchronous My S Q L Replication Clusters
Galera  Multi Master  Synchronous  My S Q L  Replication  ClustersGalera  Multi Master  Synchronous  My S Q L  Replication  Clusters
Galera Multi Master Synchronous My S Q L Replication Clusters
 
How To Think About Performance
How To Think About PerformanceHow To Think About Performance
How To Think About Performance
 
Stones play for wiki
Stones play for wikiStones play for wiki
Stones play for wiki
 
Websites On Speed
Websites On SpeedWebsites On Speed
Websites On Speed
 
Websites On Speed
Websites On  SpeedWebsites On  Speed
Websites On Speed
 
Trees And More With Postgre S Q L
Trees And  More With  Postgre S Q LTrees And  More With  Postgre S Q L
Trees And More With Postgre S Q L
 
E M T Better Performance Monitoring
E M T  Better  Performance  MonitoringE M T  Better  Performance  Monitoring
E M T Better Performance Monitoring
 
Drizzles Approach To Improving Performance Of The Server
Drizzles  Approach To  Improving  Performance Of The  ServerDrizzles  Approach To  Improving  Performance Of The  Server
Drizzles Approach To Improving Performance Of The Server
 
Boost Performance With My S Q L 51 Partitions
Boost Performance With  My S Q L 51 PartitionsBoost Performance With  My S Q L 51 Partitions
Boost Performance With My S Q L 51 Partitions
 
Automated Performance Testing With J Meter And Maven
Automated  Performance  Testing With  J Meter And  MavenAutomated  Performance  Testing With  J Meter And  Maven
Automated Performance Testing With J Meter And Maven
 

Ähnlich wie Running A Realtime Stats Service On My Sql

Running a Realtime Stats Service on MySQL
Running a Realtime Stats Service on MySQLRunning a Realtime Stats Service on MySQL
Running a Realtime Stats Service on MySQLKazuho Oku
 
Deploying and Scaling using AWS
Deploying and Scaling using AWSDeploying and Scaling using AWS
Deploying and Scaling using AWSwr0ngway
 
豆瓣技术架构的发展历程 @ QCon Beijing 2009
豆瓣技术架构的发展历程 @ QCon Beijing 2009豆瓣技术架构的发展历程 @ QCon Beijing 2009
豆瓣技术架构的发展历程 @ QCon Beijing 2009Qiangning Hong
 
Running your Java EE 6 Applications in the Cloud
Running your Java EE 6 Applications in the CloudRunning your Java EE 6 Applications in the Cloud
Running your Java EE 6 Applications in the CloudArun Gupta
 
JFokus 2011 - Running your Java EE 6 apps in the Cloud
JFokus 2011 - Running your Java EE 6 apps in the CloudJFokus 2011 - Running your Java EE 6 apps in the Cloud
JFokus 2011 - Running your Java EE 6 apps in the CloudArun Gupta
 
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019javier ramirez
 
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011Arun Gupta
 
JavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
JavaOne India 2011 - Running your Java EE 6 Apps in the CloudJavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
JavaOne India 2011 - Running your Java EE 6 Apps in the CloudArun Gupta
 
Running your Java EE 6 applications in the Cloud
Running your Java EE 6 applications in the CloudRunning your Java EE 6 applications in the Cloud
Running your Java EE 6 applications in the CloudArun Gupta
 
Running your Java EE 6 applications in the Cloud (FISL 12)
Running your Java EE 6 applications in the Cloud (FISL 12)Running your Java EE 6 applications in the Cloud (FISL 12)
Running your Java EE 6 applications in the Cloud (FISL 12)Arun Gupta
 
Lightweight Grids With Terracotta
Lightweight Grids With TerracottaLightweight Grids With Terracotta
Lightweight Grids With TerracottaPT.JUG
 
Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010
Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010
Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010Arun Gupta
 
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudTricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudMySQLConference
 
Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008
Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008
Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008Sergio Bossa
 

Ähnlich wie Running A Realtime Stats Service On My Sql (20)

Running a Realtime Stats Service on MySQL
Running a Realtime Stats Service on MySQLRunning a Realtime Stats Service on MySQL
Running a Realtime Stats Service on MySQL
 
Deploying and Scaling using AWS
Deploying and Scaling using AWSDeploying and Scaling using AWS
Deploying and Scaling using AWS
 
20080611accel
20080611accel20080611accel
20080611accel
 
豆瓣技术架构的发展历程 @ QCon Beijing 2009
豆瓣技术架构的发展历程 @ QCon Beijing 2009豆瓣技术架构的发展历程 @ QCon Beijing 2009
豆瓣技术架构的发展历程 @ QCon Beijing 2009
 
Running your Java EE 6 Applications in the Cloud
Running your Java EE 6 Applications in the CloudRunning your Java EE 6 Applications in the Cloud
Running your Java EE 6 Applications in the Cloud
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
JFokus 2011 - Running your Java EE 6 apps in the Cloud
JFokus 2011 - Running your Java EE 6 apps in the CloudJFokus 2011 - Running your Java EE 6 apps in the Cloud
JFokus 2011 - Running your Java EE 6 apps in the Cloud
 
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019
 
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
 
JavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
JavaOne India 2011 - Running your Java EE 6 Apps in the CloudJavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
JavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
 
re7jweiss
re7jweissre7jweiss
re7jweiss
 
Running your Java EE 6 applications in the Cloud
Running your Java EE 6 applications in the CloudRunning your Java EE 6 applications in the Cloud
Running your Java EE 6 applications in the Cloud
 
Running your Java EE 6 applications in the Cloud (FISL 12)
Running your Java EE 6 applications in the Cloud (FISL 12)Running your Java EE 6 applications in the Cloud (FISL 12)
Running your Java EE 6 applications in the Cloud (FISL 12)
 
Lightweight Grids With Terracotta
Lightweight Grids With TerracottaLightweight Grids With Terracotta
Lightweight Grids With Terracotta
 
Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010
Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010
Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010
 
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudTricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
 
Web 2.0 101
Web 2.0 101Web 2.0 101
Web 2.0 101
 
Blogopolisの裏側
Blogopolisの裏側Blogopolisの裏側
Blogopolisの裏側
 
MySQL Aquarium Paris
MySQL Aquarium ParisMySQL Aquarium Paris
MySQL Aquarium Paris
 
Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008
Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008
Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008
 

Kürzlich hochgeladen

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Kürzlich hochgeladen (20)

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

Running A Realtime Stats Service On My Sql

  • 1. Running a Realtime Stats Service on MySQL Cybozu Labs, Inc. Kazuho Oku
  • 2. Background 2 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 3. Who am I?  Name: Kazuho Oku ( )  Original Developer of Palmscape / Xiino  The oldest web browser for Palm OS  Working at Cybozu Labs since 2005  Research subsidiary of Cybozu, Inc.  Cybozu is a leading groupware vendor in Japan  My weblog: tinyurl.com/kazuho 3 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 4. Introduction of Pathtraq 4 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 5. What is Pathtraq?  Started in Aug. 2007  Web ranking service  One of Japan’s largest   10,000 users submit access information   1,000,000 access infomation per day  like Alexa, but semi-realtime, and per-page 5 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 6. What is Pathtraq? (cont'd)  Automated Social News Service  find what's hot  like Google News + Digg  calculate relevance from access stats  Search by...  no filtering (all the Internet)  by category  by keyword  by URL (per-domain, etc.) 6 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 7.
  • 8.
  • 9. How to Provide Real-time Analysis?  Data Set (as of Apr. 23 2009)  # of URLs: 147,748,546  # of total accesses: 413,272,527  Sharding is not a good option  since we need to join the tables and aggregate  prefix-search by URL, search by keyword, then join with access data table  core tables should be stored on RAM  not on HDD, due to lots of random access 9 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 10. Our Decision was to...  Keep URL and access stats on RAM  compression for size and speed  Create a new message queue  Limit Pre-computation Load  Create our own cache, with locks  to minimize database access  Fulltext-search database on SSD 10 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 11. Our Servers  Main Server  Opteron 2218 x2, 64GB Mem  MySQL, Apache  Fulltext Search Server  Opteron 240EE, 2GB Mem, Intel SSD  MySQL (w. Tritonn/Senna)  Helper Servers  for Content Analysis  for Screenshot Generation 11 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 12. The Long Tail of the Internet -0.44 y=C x # of URLs with 1/10 hits: x2.75 12 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 13. Compressing URLs 13 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 14. Compressing URLs  The Challenges:  URLs are too short for gzip, etc.  URLs should be prefix-searchable in compressed form  How to run like 'http://www.mysql.com/%' on a compressed URL?  The Answer:  Static PPM + Range Coder 14 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 15. Static PPM  PPM: Prediction by Partial Matching  What is the next character after quot;.coquot;?  The answer is quot;mquot;!  PPM is used by 7-zip, etc.  Static PPM is PPM with static probabilistic model  Many URLs (or English words) have common patterns  Suitable for short texts (like URLs) 15 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 16. Range Coder  A fast variant of arithmetic compression  similar to huffmann encoding, but better  If probability of next character being quot;mquot; was 75%, it will be encoded into 0.42 bit  Compressed strings preserve the sort order of uncompressed form 16 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 17. Create Compression Functions  Build prediction table from stored URLs  Implement range coder  took an open-source impl. and optimized it  original impl. added some bits unnecessary at the tail  use SSE instructions for faster operation  coderepos.org/share/browser/lang/cplusplus/range_coder  Link the coder and the table to create MySQL UDFs 17 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 18. Rewriting the Server Logic  Change schema url varchar(255) not null # with unique index urlc varbinary(767) not null # with unique index  Change prefix-search form url like 'http://example.com/%' url_compress('http://example.com/')<=urlc and urlc<url_compress('http://example.com0') Note: quot;0quot; is next character of '/' 18 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 19. Compression Ratio  Compression ratio: 37%  Size of prediction table: 4MB  Benchmark of the compression functions  compression: 40MB/sec. (570k URLs/sec.)  decompression: 19.3MB/sec. (280k URLs/sec.)  fast enough since searchable in compressed form  Prefix-search became faster  shorter indexes lead to faster operation 19 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 20. Re InnoDB Compression  URL Compression can coexist with InnoDB compression  though we aren't using InnoDB compression on our production environment Compression Table Size N/A 100% URL compression 57% InnoDB compression 50% using both 33% 20 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 21. Compressing the Stats Table  Used to have two int columns: at, cnt  it was waste of space, since...  most cnt values are very small numbers  most accesses to each URL occur on a short period (ex. the day the blog entry was written)  at field should be part of the indexes at (hours since epoch) cnt (# of hits) 330168 1 330169 2 330173 1 330197 1 21 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 22. Compressing the Stats Table (cont'd)  Merge the rows into a sparse array  example on the prev. page becomes: (offset=330197),1,0(repeated 23 times),1,2,1  Then compress the array  the example becomes a blob of 8 bytes  originally was 8 bytes x 4 rows with index  And store the array in a single column  fewer rows lead to smaller table, faster access 22 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 23. Compressing the Stats Table (cont'd)  Write MySQL UDFs to access the sparse array  cnt_add(column,at,cnt) -- adds cnt on given index (at)  cnt_between(column,from,to) -- returns # of hits between given hours  and more...  We use int[N] arrays for vectorized calc.  especially when creating access charts 23 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 24. Create a new Message Queue 24 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 25. Q4M  A simple, reliable, fast message queue  runs as a pluggable storage engine of MySQL  GPL License; q4m.31tools.com  presented yesterday at MySQL Conference :-p  slides at tinyurl.com/q4m2009  Used for relaying messages between our servers 25 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 26. Limiting Pre-computation Load 26 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 27. Limit # of CPU-intensive Pre-computations  Use cron & setlock  setlock is part of daemontools by djb  setlock  serializes processes by using flock  -n option: use trylock; if locked, do nothing # use only one CPU core for pre-computation */2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries 50*** setlock /tmp/tasks.lock precompute_yesterday_data 27 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 28. Limit # of Disk-intensive Pre-computations  Divide pre-computation to blocks and sleep depending on the elapsed time my $LOAD = 0.25; while (true) { my $start = time(); precompute_block(); sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD); } 28 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 29. Creating our own Cache System 29 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 30. The Problem  Query cache is flushed on table update  access stats can be (should be) cached for a certain period  Memcached has a thundering-herd problem  all clients try to read the database when a cached-entry expires  critical for us since our queries does joins, aggregations, and sort operations 30 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 31. Swifty and KeyedMutex  Swifty is a mmap-based cache  cached data shared between processes  lock-free on read, flock on write  notifies a single client that the accessed entry is going to expire within few seconds  notified client can start updating a cache entry before it expires  KeyedMutex  a daemon used to block multiple clients issuing same SQL queries 31 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 32. Swifty and KeyedMutexd (cont'd)  Source codes are available:  coderepos.org/share/browser/lang/c/swifty  coderepos.org/share/browser/lang/perl/Cache-Swifty  coderepos.org/share/browser/lang/perl/KeyedMutex 32 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 33. Fulltext-search on SSD 33 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 34. Senna / Tritonn  Senna is a FTS engine popular in Japan  might not work well with European languages  Tritonn is a replacement of MyISAM FTS  uses Senna as backend  faster than MyISAM FTS  Wrote patches to support SSD  during our transition from RAM to SSD  patches accepted in Senna 1.1.4 / Tritonn 1.0.12 34 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 35. FTS: RAM-based vs. SSD-based  Size of FTS data: 20GB  Downgraded hardware to see if SSD- based FTS is feasible  Speed became ¼  but latency of searches are well below one second Old Hardware New Hardware CPU Opteron 2218 (2.6GHz) x2 Opteron 240 (1.4GHz) Memory 32GB 2GB Storage 7200rpm SATA HDD SSD (Intel X25-M) 35 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 36. Summary 36 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 37. Summary  Use UDFs for optimization  Sometime it is easier to scale UP  esp. when you can estimate your data growth  Use SSD for FTS  Baidu (China's leading search engine) uses SSD  Most of the things introduced are OSS  We plan to open-source our URL compression table as well 37 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 38. We are Looking for...  If you are interested in localizing Pathtraq to your country, please contact us  we do not have resources outside of Japan  to translate the web interface  to ask people to install our browser extension  to follow local regulations, etc. 38 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 39. Thank you for listening tinyurl.com/kazuho 39 Apr. 23 2009 Running Realtime Stats Service on MySQL