Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
openTSDB - Metrics for
a distributed world
Oliver Hankeln / gutefrage.net
@mydalon

Mittwoch, 30. Oktober 13
Who am I?
Senior Engineer - Data and Infrastructure at
gutefrage.net GmbH
Was doing software development before
DevOps adv...
Who is Gutefrage.net?
Germany‘s biggest Q&A platform
#1 German site (mobile) about 5M Unique Users
#3 German site (desktop...
What you will get
Why we chose openTSDB
What is openTSDB?
How does openTSDB store the data?
Our experiences
Some advice

M...
Why we chose
openTSDB

Mittwoch, 30. Oktober 13
We were looking at
some options
Munin

Graphite openTSDB

Ganglia

Scales
well

no

sort of

yes

yes

Keeps all
data

no
...
We have a winner!
Graphite openTSDB

Scales
well

no

sort of

Keeps all
data

no

no

Creating
metrics

easy

easy

Mittw...
Separation of concerns

Mittwoch, 30. Oktober 13
Separation of concerns
$ unzip|strip|touch|finger|grep|mount|fsck|more|yes|
fsck|fsck|fsck|umount|sleep

UI was not import...
The ecosystem
App feeds metrics in via RabbitMQ
We base Icinga checks on the metrics
We evaluate Skyline and Oculus by Ets...
openTSDB
Written by Benoît Sigoure at StumbleUpon
OpenSource (get it from github)
Uses HBase (which is based on HDFS) as a...
The big picture
UI
tcollector
API

Mittwoch, 30. Oktober 13

TSD
TSD
TSD
TSD

This is really
a cluster
HBase
Putting data into
openTSDB
$ telnet tsd01.acme.com 4242
put proc.load.avg5min 1382536472 23.2 host=db01.acme.com

Mittwoch...
It gets even better
tcollector is a python script that runs your
collectors
handles network connection, starts your
collec...
A simple tcollector script
#!/usr/bin/php
<?php
#Cast a die
$die = rand(1,6);
echo "roll.a.d6 " . time() . " " . $die . "n...
What was that HDFS
again?
HDFS is a distributed filesystem suitable for
Petabytes of data on thousands of machines.
Runs on...
Okay... and HBase?
HBase is a NoSQL database / data store on
top of HDFS
Modeled after Google‘s BigTable
Built for big tab...
How openTSDB stores
the data

Mittwoch, 30. Oktober 13
Keys are key!
Data is sharded across regions based on
their row key
You query data based on the row key
You can query row ...
Take 1
Row key format: timestamp, metric id

Mittwoch, 30. Oktober 13
Take 1
Row key format: timestamp, metric id
1382536472, 5

17

Server A

Server B

Mittwoch, 30. Oktober 13
Take 1
Row key format: timestamp, metric id
1382536472, 5
1382536472, 6

17
24

Server A

Server B

Mittwoch, 30. Oktober ...
Take 1
Row key format: timestamp, metric id
1382536472, 5
1382536472, 6
1382536472, 8
1382536473, 5
1382536473, 6
13825364...
Take 1
Row key format: timestamp, metric id
1382536472, 5
1382536472, 6
1382536472, 8
1382536473, 5
1382536473, 6
13825364...
Solution: Swap
timestamp and metric id
Row key format: metric id, timestamp

5, 1382536472
6, 1382536472
8, 1382536472
5, ...
Solution: Swap
timestamp and metric id
Row key format: metric id, timestamp

5, 1382536472
6, 1382536472
8, 1382536472
5, ...
Take 2
Metric ID first, then timestamp
Searching through many rows is slower than
searching through viewer rows. (Obviously...
Take 2 continued

5, 1382608800

5, 1382612400

Mittwoch, 30. Oktober 13

+23 +35 +94 +142
17
1
23 42
+13 +25 +88 +89
3

4...
Take 2 continued
Row key
5, 1382608800

5, 1382612400

Mittwoch, 30. Oktober 13

+23 +35 +94 +142
17
1
23 42
+13 +25 +88 +...
Take 2 continued
Cell Name
Row key
5, 1382608800

5, 1382612400

Mittwoch, 30. Oktober 13

+23 +35 +94 +142
17
1
23 42
+13...
Take 2 continued
Cell Name
Row key
5, 1382608800

5, 1382612400

Mittwoch, 30. Oktober 13

Data point

+23 +35 +94 +142
17...
Where are the tags
stored?
They are put at the end of the row key
Both tag names and tag values are
represented by IDs

Mi...
The Row Key
3 Bytes - metric ID
4 Bytes - timestamp (rounded down to the
hour)
3 Bytes tag ID
3 Bytes tag value ID
Total: ...
Let‘s look at some
graphs

Mittwoch, 30. Oktober 13
Busting some Myths

Mittwoch, 30. Oktober 13
Myth: Keeping Data is
expensive
Gartner found the price for enterprise SSDs
at 1$/GB in 2013
A data point gets compressed ...
If your work costs 50$ per hour and it
takes you only one minute to think about
and configure your RRD compaction
setting, ...
Myth: the amount of
metrics is too limited
Don‘t confuse Graphite metric count with
openTSBD metric count.
3 Bytes of metr...
Cultural issues

Mittwoch, 30. Oktober 13
Tools shape culture
shapes tools
It is time for a new monitoring culture!
Embrace machine learning!
Monitor everything in ...
Our experiences

Mittwoch, 30. Oktober 13
What works well
We store about 200M data points in several
thousand time series with no issues
tcollector is decoupling me...
Challenges
The UI is seriously lacking
no annotation support out of the box
no meta data for time series
Only 1s time reso...
salvation is coming
OpenTSDB 2 is around the corner
millisecond precision
annotations and meta data
improved API
improved ...
Friendly advice
Pick a naming scheme and stick to it
Use tags wisely (not more than 6 or 7 tags
per data point)
Use tcolle...
Questions?
Please contact me:
oliver.hankeln@gutefrage.net
@mydalon
I‘ll upload the slides and tweet about it

Mittwoch, 3...
Nächste SlideShare
Wird geladen in …5
×

openTSDB - Metrics for a distributed world

8.656 Aufrufe

Veröffentlicht am

These are the slides for my talk at the IPC13/WTC13 in Munich on openTSDB. openTSDB ist the software that we at gutefrage.net use to store about 200 million data points in several thousand time series per day.
I will talk about how openTSDB stores the data to efficiently query them afterwards. Some cultural issues and some myths are also covered.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

openTSDB - Metrics for a distributed world

  1. 1. openTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon Mittwoch, 30. Oktober 13
  2. 2. Who am I? Senior Engineer - Data and Infrastructure at gutefrage.net GmbH Was doing software development before DevOps advocate Mittwoch, 30. Oktober 13
  3. 3. Who is Gutefrage.net? Germany‘s biggest Q&A platform #1 German site (mobile) about 5M Unique Users #3 German site (desktop) about 17M Unique Users > 4 Mio PI/day Part of the Holtzbrinck group Running several platforms (Gutefrage.net, Helpster.de, Cosmiq, Comprano, ...) Mittwoch, 30. Oktober 13
  4. 4. What you will get Why we chose openTSDB What is openTSDB? How does openTSDB store the data? Our experiences Some advice Mittwoch, 30. Oktober 13
  5. 5. Why we chose openTSDB Mittwoch, 30. Oktober 13
  6. 6. We were looking at some options Munin Graphite openTSDB Ganglia Scales well no sort of yes yes Keeps all data no no yes no Creating metrics easy easy easy easy Mittwoch, 30. Oktober 13
  7. 7. We have a winner! Graphite openTSDB Scales well no sort of Keeps all data no no Creating metrics easy easy Mittwoch, 30. Oktober 13 Bingo! Munin Ganglia yes yes yes no easy easy
  8. 8. Separation of concerns Mittwoch, 30. Oktober 13
  9. 9. Separation of concerns $ unzip|strip|touch|finger|grep|mount|fsck|more|yes| fsck|fsck|fsck|umount|sleep UI was not important for our decision Alerting is not what we are looking for in our time series data base Mittwoch, 30. Oktober 13
  10. 10. The ecosystem App feeds metrics in via RabbitMQ We base Icinga checks on the metrics We evaluate Skyline and Oculus by Etsy for anomaly detection We deploy sensors via chef Mittwoch, 30. Oktober 13
  11. 11. openTSDB Written by Benoît Sigoure at StumbleUpon OpenSource (get it from github) Uses HBase (which is based on HDFS) as a storage Distributed system (multiple TSDs) Mittwoch, 30. Oktober 13
  12. 12. The big picture UI tcollector API Mittwoch, 30. Oktober 13 TSD TSD TSD TSD This is really a cluster HBase
  13. 13. Putting data into openTSDB $ telnet tsd01.acme.com 4242 put proc.load.avg5min 1382536472 23.2 host=db01.acme.com Mittwoch, 30. Oktober 13
  14. 14. It gets even better tcollector is a python script that runs your collectors handles network connection, starts your collectors at set intervals does basic process management adds host tag, does deduplication Mittwoch, 30. Oktober 13
  15. 15. A simple tcollector script #!/usr/bin/php <?php #Cast a die $die = rand(1,6); echo "roll.a.d6 " . time() . " " . $die . "n"; Mittwoch, 30. Oktober 13
  16. 16. What was that HDFS again? HDFS is a distributed filesystem suitable for Petabytes of data on thousands of machines. Runs on commodity hardware Takes care of redundancy Used by e.g. Facebook, Spotify, eBay,... Mittwoch, 30. Oktober 13
  17. 17. Okay... and HBase? HBase is a NoSQL database / data store on top of HDFS Modeled after Google‘s BigTable Built for big tables (billions of rows, millions of columns) Automatic sharding by row key Mittwoch, 30. Oktober 13
  18. 18. How openTSDB stores the data Mittwoch, 30. Oktober 13
  19. 19. Keys are key! Data is sharded across regions based on their row key You query data based on the row key You can query row key ranges (say e.g. A...D) So: think about key design Mittwoch, 30. Oktober 13
  20. 20. Take 1 Row key format: timestamp, metric id Mittwoch, 30. Oktober 13
  21. 21. Take 1 Row key format: timestamp, metric id 1382536472, 5 17 Server A Server B Mittwoch, 30. Oktober 13
  22. 22. Take 1 Row key format: timestamp, metric id 1382536472, 5 1382536472, 6 17 24 Server A Server B Mittwoch, 30. Oktober 13
  23. 23. Take 1 Row key format: timestamp, metric id 1382536472, 5 1382536472, 6 1382536472, 8 1382536473, 5 1382536473, 6 1382536473, 8 Mittwoch, 30. Oktober 13 17 24 12 134 10 99 Server A Server B
  24. 24. Take 1 Row key format: timestamp, metric id 1382536472, 5 1382536472, 6 1382536472, 8 1382536473, 5 1382536473, 6 1382536473, 8 1382536474, 5 1382536474, 6 Mittwoch, 30. Oktober 13 17 24 12 134 10 99 12 42 Server A Server B
  25. 25. Solution: Swap timestamp and metric id Row key format: metric id, timestamp 5, 1382536472 6, 1382536472 8, 1382536472 5, 1382536473 6, 1382536473 8, 1382536473 5, 1382536474 6, 1382536474 Mittwoch, 30. Oktober 13 17 24 12 134 10 99 12 42 Server A Server B
  26. 26. Solution: Swap timestamp and metric id Row key format: metric id, timestamp 5, 1382536472 6, 1382536472 8, 1382536472 5, 1382536473 6, 1382536473 8, 1382536473 5, 1382536474 6, 1382536474 Mittwoch, 30. Oktober 13 17 24 12 134 10 99 12 42 Server A Server B
  27. 27. Take 2 Metric ID first, then timestamp Searching through many rows is slower than searching through viewer rows. (Obviously) So: Put multiple data points into one row Mittwoch, 30. Oktober 13
  28. 28. Take 2 continued 5, 1382608800 5, 1382612400 Mittwoch, 30. Oktober 13 +23 +35 +94 +142 17 1 23 42 +13 +25 +88 +89 3 44 12 2
  29. 29. Take 2 continued Row key 5, 1382608800 5, 1382612400 Mittwoch, 30. Oktober 13 +23 +35 +94 +142 17 1 23 42 +13 +25 +88 +89 3 44 12 2
  30. 30. Take 2 continued Cell Name Row key 5, 1382608800 5, 1382612400 Mittwoch, 30. Oktober 13 +23 +35 +94 +142 17 1 23 42 +13 +25 +88 +89 3 44 12 2
  31. 31. Take 2 continued Cell Name Row key 5, 1382608800 5, 1382612400 Mittwoch, 30. Oktober 13 Data point +23 +35 +94 +142 17 1 23 42 +13 +25 +88 +89 3 44 12 2
  32. 32. Where are the tags stored? They are put at the end of the row key Both tag names and tag values are represented by IDs Mittwoch, 30. Oktober 13
  33. 33. The Row Key 3 Bytes - metric ID 4 Bytes - timestamp (rounded down to the hour) 3 Bytes tag ID 3 Bytes tag value ID Total: 7 Bytes + 6 Bytes * Number of tags Mittwoch, 30. Oktober 13
  34. 34. Let‘s look at some graphs Mittwoch, 30. Oktober 13
  35. 35. Busting some Myths Mittwoch, 30. Oktober 13
  36. 36. Myth: Keeping Data is expensive Gartner found the price for enterprise SSDs at 1$/GB in 2013 A data point gets compressed to 2-3 Bytes A metric that you measure every second then uses disk space for 18.9ct per year. Usually it is even cheaper Mittwoch, 30. Oktober 13
  37. 37. If your work costs 50$ per hour and it takes you only one minute to think about and configure your RRD compaction setting, you could have collected that metric on a second-by-second basis for 4.4 YEARS instead. Mittwoch, 30. Oktober 13
  38. 38. Myth: the amount of metrics is too limited Don‘t confuse Graphite metric count with openTSBD metric count. 3 Bytes of metric ID = 16.7M possibilities 3 Bytes tag value ID = 16.7M possibilities => at least 280 T metrics (graphite counting) Mittwoch, 30. Oktober 13
  39. 39. Cultural issues Mittwoch, 30. Oktober 13
  40. 40. Tools shape culture shapes tools It is time for a new monitoring culture! Embrace machine learning! Monitor everything in your organisation! Throw of the shackles of fixed intervals! Come, join the revolution! Mittwoch, 30. Oktober 13
  41. 41. Our experiences Mittwoch, 30. Oktober 13
  42. 42. What works well We store about 200M data points in several thousand time series with no issues tcollector is decoupling measurement from storage Creating new metrics is really easy You are free to choose your rhythm Mittwoch, 30. Oktober 13
  43. 43. Challenges The UI is seriously lacking no annotation support out of the box no meta data for time series Only 1s time resolution (and only 1 value/s/ time series) Mittwoch, 30. Oktober 13
  44. 44. salvation is coming OpenTSDB 2 is around the corner millisecond precision annotations and meta data improved API improved UI Mittwoch, 30. Oktober 13
  45. 45. Friendly advice Pick a naming scheme and stick to it Use tags wisely (not more than 6 or 7 tags per data point) Use tcollector wait for openTSDB 2 ;-) Mittwoch, 30. Oktober 13
  46. 46. Questions? Please contact me: oliver.hankeln@gutefrage.net @mydalon I‘ll upload the slides and tweet about it Mittwoch, 30. Oktober 13

×