Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

ELK stack at weibo.com

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
{{more}} Kibana4
{{more}} Kibana4
Wird geladen in …3
×

Hier ansehen

1 von 73 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie ELK stack at weibo.com (20)

Anzeige

Aktuellste (20)

Anzeige

ELK stack at weibo.com

  1. 1. real-time log search & analysis ELKstack@weibo.com
  2. 2. about me • Perler, SA @ weibo.com, renren.com, china.com... • Writer of 《网站运维技术与实践》 • Translator of 《 Puppet 3 Cookbook 》 • weibo account : @ARGV
  3. 3. agenda • ELKstack situation • ELKstack usecase • from ELK to ERK • performance tuning of LERK
  4. 4. ERK situation • datanode * 26: • 2.4Ghz*8, 42G, 300G *10 RAID5 • logtype * 25 , 7days , 65 billion events , 60k fields • size 8TB /day , indexing 190k eps • rsyslog/logstash * 10 • custom plugins of rsyslog/logstash/kibana • user : qa team, app/server dev team, are team • ops : ME*0.8
  5. 5. kopf stats monitor & setting modify
  6. 6. bigdesk real-time node stats
  7. 7. zabbix trapper monitor and alert KPI of ELK
  8. 8. But, Why ELK ?
  9. 9. First, what can log do? • Identify problem • data-driven develop/test/operate • audit • Laws of Marcus J. Ranum • Monitor • Monitoring is the aggregation of health and performance data, events, and relationships delivered via an interface that provides an holistic view of a system's state to better understand and address failure scenarios. @etsy
  10. 10. difficulties of LA(1) • timestamp + data = log • OK, what happened between 23:12 and 23:29 yesterday?
  11. 11. difficulties of LA(2) •text is un-structured data
  12. 12. difficulties of LA(2) •grep/awk only run at single host
  13. 13. difficulties of LA(3) • 格式复杂不方便可视化效果
  14. 14. So... • We need a real-time big- data search platform. • But, splunk is expensive. • So, spell OSS pls.
  15. 15. ELKstack Beginner
  16. 16. Hello World # bin/logstash -e ‘input{stdin{}}output{stdout{codec=>rubyd ebug}}’ Hello World { "message" => "Hello World", "@version" => "1", "@timestamp" => "2014-08- 07T10:30:59.937Z", "host" => "raochenlindeMacBook- Air.local", }
  17. 17. How Powerful • $ ./bin/logstash -e ‘input{generator{count=>10000 0000000}output{stdout{codec= >dots}}}’ | pv -abt > /dev/null • 15.1MiB 0:02:21 [ 112kiB/s]
  18. 18. How scaling
  19. 19. Talk is cheap, show me the case!
  20. 20. application log by php
  21. 21. logstash.conf
  22. 22. Kibana3 backend dev and ops use to identify the error of APIs and apps
  23. 23. and Kibana4 ok, K4 need a pretty color bynow
  24. 24. PHP slowlog
  25. 25. after multiline codec ops use to check php slow function stack within IDCs and hosts
  26. 26. drill-down one host
  27. 27. Nginx errorlog
  28. 28. grok { match => { "message" => "(?<datetime>d{4}/dd/dd dd:dd:dd) [(? <errtype>w+)] S+: *d+ (?<errmsg>[^,]+), (?<errinfo>.*)$" } } mutate { gsub => [ "errmsg", "too large body: d+ bytes", "too large body" ] } if [errinfo] { ruby { code => "event.append(Hash[event['errinfo'].split(', ').map{|l| l.split(': ')}])" } } grok { match => { "request" => '"%{WORD:verb} %{URIPATH:urlpath}(?:?% {NGX_URIPARAM:urlparam})?(?: HTTP/%{NUMBER:httpversion})"' } } kv { prefix => "url_" source => "urlparam" field_split => "&" } date { locale => 'en' match => [ "datetime", "yyyy/MM/dd HH:mm:ss" ] }
  29. 29. performance tuning and troubleshooting based on multi dimensions reports
  30. 30. difference tops in another time range
  31. 31. app crash app dev focus on crash stacks which system functions were filtered out. 。
  32. 32. New release, Ad-hoc filter, Focus crash
  33. 33. Query helper for QA and NOC, decease MTTI for complaint
  34. 34. H5 devs focus on the performance timeline of index.html
  35. 35. probability distribution of response time no more average, no more guess
  36. 36. from ELK to ERK
  37. 37. someone's children�
  38. 38. My Poor Child�
  39. 39. WHY?
  40. 40. compare logstash • Design : multithreads + SizedQueue • Lang : JRuby • Syntax : DSL • ENV : jre1.7 • Queue : rely on external system • regexp : ruby • output : java to ES • plugin : 182 • monitor : NO! rsyslog • multithreads + mainQ • C • rainerscript • within rhel6 • async queue • ERE • HTTP to ES • 57 • pstats
  41. 41. problem of Logstash • poor performance of Input/syslog, use input/tcp+filter/grok; • poor performance of Filter/geoip, had developed filter/geoip2 • high CPU cost by Filter/grok, use filter/ruby with split by myself • OOM in Input/tcp(prior 1.4.2) • OOM in Output/elasticsearch(prior 1.5.0) • retry in Output/elasticsearch repeat with SizedQueue in stud(bynow)
  42. 42. problem of LogStash(1) • LogStash::Inputs::Syslog • logstash pipeline : • input thread -> filterworker threads * Num -> output thread • But What's in Inputs::Syslog : • TCPServer/accept -> client thread -> filter/grok -> filter/date -> filterworker threads • We need to do grok and date in only one thread! • Pure TCPServer can processing 50k qps, but 6k after filter/grok, and then 700 after filter/date!
  43. 43. problem of LogStash(1) • LogStash::Inputs::Syslog • Solution: input { tcp { port => 514 } } filter { grok { match => ["message", "%{SYSLOGLINE}"] } syslog_pri { } date { match => ["timestamp", "ISO8601"] } } • 30k eps in `logstash -w 20` testing.
  44. 44. problem LogStash(2) • LogStash::Filters::Grok • What's Grok: • pre-define : NUMBER d+ use %{NUMBER:score} instead (?<score>d+) • regexp cost LOTS of CPU.
  45. 45. problem of LogStash(2) • LogStash::Filters::Grok • solution: • aviod grok, if you can define a separator to your log format: filter { ruby { init => "@kname = ['datetime','uid','limittype','limitkey','client','clientip','request_time','url']" code => "event.append(Hash[@kname.zip(event['message'].split('|'))])" } mutate { convert => ["request_time", "float"] } } • Result: cpu utils reduce about 20%
  46. 46. problem of LogStash(3) • LogStash::Filters::GeoIP • 7k eps, even if `logstash -w 30` • The new MaxMindDB format has a great performance improvement. But LogStash can't distribute it for some license reason.
  47. 47. problem of LogStash(3) • LogStash::Filters::GeoIP • solution: • use MaxMind::DB::Writer, change the internal ip.db into ip.mmdb, 300MB->50MB • JRuby can java_import maxminddb-java. • 28k eps with LogStash::Filters::MaxMindDB
  48. 48. problem of LogStash(4) • LogStash::Outputs::Elasticsearch • 3 bugs bynow : 1. OOM in logstash1.4.2(ftw-0.0.39) 2. retry by Manticore(logstash1.5.0beta1) was repeat with stud in pipeline, would cause an infinite loop of resending 3. logstash1.5.0rc1 can't record the 429 code, who knows the"got response of . source:" mean? • 1 and 3 were solved in the newest logstash1.5.0rc3.
  49. 49. problem of LogStash(5) • LogStash::Pipeline • no supervisor for filterworkers. If all filter workers exception, logstash was blocking but long live! • If you use filter/ruby to reference `event['field']` as I introduced before, check the field first! if [url] { ruby { code => "event['urlpath']=event['url'].split('?')[0]" } }
  50. 50. problem of LogStash(6) • LogStash::Pipeline • new event would go through the rest filter after `yield`, but just to output thread(prior logstash1.5.0). • yield was used in filter-split, filter-clone
  51. 51. Rsyslog tuning • action with linkedlist • imfile with an appropriate statepresistinterval(avoid too many duplication after restart) • omfwd with a small rebindinterval(when target with LVS) • an appropriate global.maxmessagesize • an appropriate queue.size and queue.highwatermask • recommended CEE log format, using with mmjsonparse • separator log format can be processing with mmfields • make the best use of rainerscript • concat JSON strings with property replacer • developed a rsyslog-mmdblookup for ip lookup
  52. 52. problem of rsyslog(1) • I find an experimental `foreach` in rsyslog8.7, great! but when I process my JSON array logs from apps, there are 3 bugs: 1. foreach don't judge the type of parameters; 2. action() don't copy msg but ref. If you omfwd each item in foreach, crash...The test-suite only use omfile which is synchronous. 3. omelasticsearch has an uninitialized variable when enabled errorfile option. There will be a new copymsg option of action() in rsyslog8.10, suppose to publish at May 20.
  53. 53. problem of rsyslog(2) • Not so many message modification plugins. • mmexternal could fork too many subprocess in v8(but not in v7). And the process speed is 2k eps! • We had finished a new rsyslog-mmdblookup plugin, would run in production env in May 15.
  54. 54. input( type=“imtcp” port=“514” ) template( name=“clientlog" type="list" ) { constant(value="{"@timestamp":"") property(name="timereported" dateFormat="rfc3339") constant(value="","host":"") property(name="hostname") constant(value="",“mmdb":") property(name="!iplocation") constant(value=",") property(name="$.line" position.from="2") } ruleset( name=“clientlog” ) { action(type="mmjsonparse") if($parsesuccess == "OK") then { foreach ($.line in $!msgarray) { if($.line!rtt == “-”) then { set $.line!rtt = 0; } set $.line!urlpath = field($.line!url, 63, 1); set $.line!urlargs = field($.line!url, 63, 2); set $.line!from = ""; if ( $.line!urlargs != "***FIELD NOT FOUND***" ) then { reset $.line!from = re_extract($.line!urlargs, "from=([0-9]+)", 0, 1, ""); } else { unset $.line!urlargs; } action(type=“mmdb” key=“.line!clientip” fields=[“city”,“isp”,“country”] mmdbfile="./ip.mmdb") action(type="omelasticsearch" server=“1.1.1.1“ bulkmode=“on“ template=“clientlog” queue.size="10000" queue.dequeuebatchsize="2000“ ) } } } if ($programname startswith “mweibo_client”) then { call clientlog stop }
  55. 55. ES tuning • DO NOT believe the articles online!! • DO testing use your own dataset, start from one node, one index, one shard, zero replica. • use unicast with a bigger fd.ping_timeout •doc_values, doc_values, doc_values!!! • increase the sets of gateway, recovery and allocation • increase refresh_interval and flush_threshold_size • increase store.throttle.max_bytes_per_sec • upgrade to 1.5.1 at least • scale: use max_shards_per_node • use bulk! no multithreads client, no async •use curator for _optimize • no _all for fixed format log
  56. 56. problem of ES(1) • OOM: • Kibana3 use facet_filter, which means lots of hits in QUERY phase. • There is circuit breaker in new version. So you may watch the following errors: Data too large, data for field [@timestamp] would be larger than limit of[639015321/609.4mb]]
  57. 57. problem of ES(1) • OOM: • solution: • doc_values,doc_values,doc_values! • No more heap needed, 31GB is enough.
  58. 58. ES 稳定性问题 (2) • long long down time when relocation and recovery. • default strategy: • recovery immediately after restart • only one shard relocation one time • limit 20MB • replica need to copy all files from primary shard!
  59. 59. ES 稳定性问题 (2) • long long down time when relocation and recovery. • solution: • gateway.*: recovery after cluster has enough nodes • cluster.routing.allocation.*: larger concurrent • indices.recovery.*: larger limit • red to yellow: 20 min for full restart. • Note: there is a bug may cause the recovery process blocking in translog phase.(prior 1.5.1)
  60. 60. problem of ES(3) • new nodes die. • default strategy of shard allocation: • try to balance the total shards number per node. • no new shard if over 90% disk. • The second day of scaling, all new shards would be allocated to the new node! That mean all indexing load.
  61. 61. ES 稳定性问题 (3) • new nodes die. • solution: 1. finish relocation before the creation of next new index. 2. set index.routing.allocation.total_shards_per_node • note1: pls set a little larger value, in case of recovery for fault... • note2: DO NOT set this to old indices, your new node is busy now.
  62. 62. problem of ES(4) • async replica • cpu util% would be rising violently if one segment has some deviation, async do NOT validate the indexing data. • ES will delete such async parameter.
  63. 63. ES performance(1) • 429, 429, 429... • length of one "client_net_fatal_error" logline may target than 1MB. • the max HTTP body of ES is 100MB. Be careful with bulk_size.
  64. 64. ES performance(2) • index size is several times larger than raw message size. • _source: raw JSON • _all: terms in every fields, for full text searching • multi-field: .raw for all fields in logstash template • So: • no _all for nginx accesslog. • no _source for metrics tsdb log. • now analyzed fields for most fields, only analyzed for raw message.
  65. 65. ES performance(3) • always CPU utils% for segment merge(hot threads forever). • max segment: 5GB • min segment: 2MB • increase: refresh(1s)/flush(200MB)_interval 。
  66. 66. cluster.name: es1003 cluster.routing.allocation.node_initial_primaries_recoveries: 30 cluster.routing.allocation.node_concurrent_recoveries: 5 cluster.routing.allocation.cluster_concurrent_rebalance: 5 cluster.routing.allocation.enable: all node.name: esnode001 node.master: false node.data: data node.max_local_storage_nodes: 1 index.routing.allocation.total_shards_per_node : 3 index.merge.scheduler.max_thread_count: 1 index.refresh_interval: 30s index.number_of_shards: 26 index.number_of_replicas: 1 index.translog.flush_threshold_size : 5000mb index.translog.flush_threshold_ops: 50000 index.search.slowlog.threshold.query.warn: 30s index.search.slowlog.threshold.fetch.warn: 1s index.indexing.slowlog.threshold.index.warn: 10s indices.store.throttle.max_bytes_per_sec: 1000mb indices.cache.filter.size: 10% indices.fielddata.cache.size: 10% indices.recovery.max_bytes_per_sec: 2gb indices.recovery.concurrent_streams: 30 path.data: /data1/elasticsearch/data path.logs: /data1/elasticsearch/logs bootstrap.mlockall: true http.max_content_length: 400mb http.enabled: true http.cors.enabled: true http.cors.allow-origin: "*" gateway.type: local gateway.recover_after_nodes: 30 gateway.recover_after_time: 5m gateway.expected_nodes: 30 discovery.zen.minimum_master_nodes: 3 discovery.zen.ping.timeout: 100s discovery.zen.ping.multicast.enabled: false discovery.zen.ping.unicast.hosts: ["10.19.0.97","10.19.0.98","10.19.0.99"] monitor.jvm.gc.young.warn: 1000ms monitor.jvm.gc.old.warn: 10s monitor.jvm.gc.old.info: 5s monitor.jvm.gc.old.debug: 2s
  67. 67. problem of ES(1) • different result in search and store: curl es.domain.com:9200/logstash-accesslog-2015.04.03/nginx/_search?q=_id:AUx- QvSBS-dhpiB8_1f1&pretty -d '{ "fields": ["requestTime"], "script_fields" : { "test1" : { "script" : "doc["requestTime"].value" }, "test2" : { "script" : "_source.requestTime" }, "test3" : { "script" : "doc["requestTime"].value * 1000" } } }'
  68. 68. NOT schema free! "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "logstash-accesslog-2015.04.03", "_type" : "nginx", "_id" : "AUx-QvSBS-dhpiB8_1f1", "_score" : 1.0, "fields" : { "test1" : [ 4603039107142836552 ], "test3" : [ -8646911284551352000 ], "requestTime" : [ 0.54 ], "test2" : [ 0.54 ], } } ] }
  69. 69. problem of ES(2) • some data can't be found! • ES need the same mapping type with the same field name in the same _type of same index. • My "client_net_fatal_error" log data was changed after one release: • {"reqhdr":{"Host":"api.weibo.cn"}} • {"reqhdr":"{"Host":"api.weibo.cn"}"} • Set the mapping of "reqhdr" object to {"enabled":false}. the string can only be watched in _sourceJSON, but not searched.
  70. 70. problem of ES(3) •some data can't be found! Again! •There was a default setting `ignore_above:256` in logstash template. curl 10.19.0.100:9200/logstash-mweibo-2015.05.18/mweibo_client_crash/_search?q=_id:AU1ltyTCQC8tD04iYBIe&pretty -d '{ "fielddata_fields" : ["jsoncontent.content", "jsoncontent.platform"], "fields" : ["jsoncontent.content","jsoncontent.platform"] }' ... "fields" : { "jsoncontent.content" : [ "dalvik.system.NativeStart.main(Native Method)nCaused by: java.lang.ClassNotFoundException: Didn't find class "com.sina.weibo.hc.tracking.manager.TrackingService" on path: DexPathList[[zip file "/data/app/com.sina.weibo-1.apk", zip file "/data/data/com.sina.weibo/code_cache/secondary- dexes/com.sina.weibo-1.apk.classes2.zip", zip file "/data/data/com.sina.weibo/app_dex/dbcf1705b9ffbc30ec98d1a76ada120909.jar"],nativeLibraryDirectories=[/data/a pp-lib/com.sina.weibo-1, /vendor/lib, /system/lib]]" ], "jsoncontent.platform" : [ "Android_4.4.4_MX4 Pro_Weibo_5.3.0 Beta_WIFI", "Android_4.4.4_MX4 Pro_Weibo_5.3.0 Beta_WIFI" ] }
  71. 71. kibana custom develop • upgrade the elastic.js version in K3 to support the API of ES1.2. Then we can use aggs API to implement new panels(percentile panel, range panel, and cardinality histogram panel). • "export as csv" for table panel. • map provider setting for bettermap. • term_stats for map. • china map. • query helper. • script field for terms panel. • OR filtering. • more in <https://github.com/chenryn/kibana>
  72. 72. see also •《 Elasticsearch Server(2 edition) 》 •《 Logging and Log Management the Authoritative Guide to Understanding the Concepts Surrounding Logging and Log Management 》 •《 Data Analysis with Open Source Tools 》 •《 Web Operations: Keeping the data on time 》 •《 The Art of Capacity Planning 》 •《大规模 Web 服务开发技术》 •https://codeascraft.com/ •http://calendar.perfplanet.com •http://kibana.logstash.es
  73. 73. –JordanSissel@logstash.net “If a newbie has a bad time, it's a bug.”

Hinweis der Redaktion

  • 这个读取/_cluster/stats接口,很费资源,不适合长期监控用。长期监控应该用/_nodes/_local接口只获取本机的情况。
  • 不管是日志处理分析还是数值监控分析,目的都应该是成为这么一个促进运维生产力的interface
  • Timestamp + double = metric monitor
    监控是定点采样,离线分析平台是要预定义规则。
    而运维在面对这种提问的时候,现有的只有一个范畴,没有规则和具体的故障点。需要的是能在这个范畴里,通过快速的下钻分析,来寻找实际的点。
    这是做运维领域的日志分析处理,跟其他“相近”的监控处理,差别很大的一个点。
  • 这个工作在单机上,就是awk、grep、sort、uniq在干。一个接一个管道。
  • 但是大规模集群下,没法这么玩。
  • 更复杂的情况就是多行日志对应一个事件。
  • 而且这个平台是要细粒度的,易用的。因为面向的用户可能是客服。
    Splunk已经是百亿美元的市值了。这就是机器数据分析领域的商业前景。
  • 这是在个人mba上的测试数据
  • 各层次都是无状态扩容的
  • 这个配置语法应该运维人员都能接受,因为同样是Ruby写的Puppet,用的也是这种风格的DSL设计。
  • query里通过_type和urlpath,errorcode等过滤条件,得到不同的histogram汇聚结果
  • 这个示例里是两个最基础的可视化面板样式。
    Histogram,也就是时间趋势;
    Terms,也就是topN排序。
    第一个示例算是演示一下日志怎么一步步从文本变成图表。
    时间趋势是metric系统最常用的,为什么还要用elk?
    一个网站加入有2000个api,然后常见的监控维度包括状态码,响应区间,ua,地理区域,运营商,一个个乘起来,这要多少万个item?
    在elk里,kibana每次刷新 都是从es里实时计算的结果,可以任意变更query语句。上一页k3的截图,顶部有8个query框,每个里面都写着不同的query语句。只要有需求,随时可以继续修改,添加更多的query框,这就是灵活性。
    至于k4,其实每个面板是绑定在一个query上的,点击铅笔标签就跳回到discover页修改query,功能还在,页面布局变化了。
  • 这里用了一个千层饼图。每层是对慢函数堆栈的同一层次的函数的topN统计。其实思路是类似agentzh常用的火焰图那种跟分层统计。
    这种效果,不单单知道最底层函数哪个最多,还能知道影响面最大的调用链条是哪个。比如本页截图,绝大多数慢请求都是在推荐页调用平台的时候curl太慢。
    看左下角的主机排名,前十个里九个是yhg机房,唯独第一个是xd机房的设备。显然有问题。那么单击一下这个主机名。页面就会刷新成下面这样。
  • 这个主机名就作为一个过滤条件添加到dashboard上了。而且会应用到整个dashboard里所有的可视化面板上。可以看到这个千层饼效果,跟之前全局的统计效果完全不一样了。看细节,最多的慢查询,变成了链接memcached时候gethostbyname过慢。问题直接就定位出来了。
    这是单一维度变化反应问题根源的一个示例。
    随着dashboard上面板增多,那么可以判断问题的维度也就变多,可以从多个维度来定位。下一个例子是nginx的errorlog
  • 往前一个时间段,各维度变化都很大。那么很明显,前面这个时间段内,造成nginx的error异常多的原因就很清楚了。
  • 前面几个都是服务器端日志。其实只要是日志都一样处理。比如客户端日志。这是我们客户端crash日志的情况。新发版的时候,开发会特别关心新版本问题出在哪。
    我们在收集这个日志的时候,会在logstash里用几行ruby处理一下,去掉堆栈里的系统库函数。然后单独排序公司自己的代码。
    可以看到有个beta版,点击一下。
  • 排序就变化了,现在新版本情况就一目了然,知道啥函数的问题了。
    Crash日志很多专门的软件在干这个事情。我这里举例,不是说这么干多么有优势,而是说这是一个日志分析处理的比较通用的办法。
  • 这是为了方便客服查询,做了一点界面上的小改动(省的客服不会写uid:”123”)。可以根据uid过滤出来这个用户在时间轴上,先后经过哪些阶段的日志,报了什么错。
  • 前面都是文本处理的玩法。Elk还有一些更偏数据统计方面的玩法。
    一个接口的响应时间,所有请求里是怎么分布的?平均时间不靠谱,这个大家都知道。可能会按照range来计数,0-100,100-1000,这个区间是不是合适?而且高峰期的区间计数肯定有变化,这是因为请求数上涨带来的正常变化,还是其实已经有异常了?
    通过这个hist分析,拐点(或者说报警的阈值)在哪里就很明显了。然后拿到不同时间段的这两个数组,还可以做t检验啊,sw检验啊,确定两个数组的分布是不是相似,由此判断说高峰期跟平常是不是正常波动,还是已经有异常了。
  • Elk提供一个golang写的logstash-forwarder,但是这个只支持ssl的tcp传输,不能直接发到队列,不带压缩。
  • 调优比较方便的是logstash里给线程取名了,top命令可以看到是哪个线程瓶颈,input、filter还是output
    Geoip2是用了maxminddb-java包,Jruby里直接java_import就可以了
  • 同样,每个action的name也可以在top的时候看到。而ruleset的消费情况在pstats记录里可以看到,根据这个做监控报警
  • Doc_values预先生成fielddata到磁盘。节约内存即保证稳定性,又提高性能。
    Multicast过不了交换机,在公有云上还可能被判定为恶意扫描。
    Bulk以一次POST的body大小在10-15MB为宜,注意自己的日志单条大小。因为bulk_size是条数不是字节数。
  • 同一索引下不同type的同名字段,其实按照第一个写入的mapping统一处理了。这样搜索的时候就乱套了。

×