Weitere ähnliche Inhalte Ähnlich wie SMACK Dev Experience (20) SMACK Dev Experience2. ● Chih-Hsuan Hsu (Joe)
● PilotTV Data Engineer
● Interestd in SMACK/ELK architecture
● 技術書籍譯者
○ Spark學習手冊
○ Cassandra技術手冊
● LinkIn:www.linkedin.com/in/joechh
● Mail:joechh731126@gmail.com
2
About Me
5. New ETL
● 以Java實做
● ETL結果會產生Json-format串流資料
● 透過Kafka Producer API將Json Streaming送到Kafka Cluster
︰用預設值建立的Kafka Producer throughput太低
New ETL with
Kafka Producer
5
6. 開始研究Kafka Producer參數實驗(0.8.2)
參數 預設值 可用選項
producer.type sync sync, async
compression.codec none none, gzip, snappy
batch.num.messages 200 unlimited
request.required.acks 0 -1, 0, 1
queue.buffering.max.messages 10000 unlimited
https://kafka.apache.org/082/documentation.html
6
12. ● request.required.acks = 1 還是有可能掉資料
● sync with leader node不代表有容錯
Kafka Producer: request.required.acks(con)
1 2 3 4
1 2 3 4
1 2 3 4Replica
Follower 1
Replica
Leader
5 6
producer sent
Replica
Follower 2
12
15. Kafka Producer: request.required.acks(con)
● request.required.acks = -1
● 要容錯:replication.factor >= 2 && min.insync.replicas >= 2
1 2 3 4
1 2 3 4
1 2 3 4
5 6
producer sent
Replica
Follower 1
Replica
Leader
Replica
Follower 2
15
16. Kafka Producer: request.required.acks(con)
1 2 3 4
1 2 3 4
1 2 3 4
5 6
all replicas sync, ack return
● request.required.acks = -1
● 要容錯:replication.factor >= 2 && min.insync.replicas >= 2
5 6Replica
Follower 1
Replica
Leader
Replica
Follower 2
16
5 6
17. Kafka Producer: request.required.acks(con)
1 2 3 4
1 2 3 4
1 2 3 4
5 6
● request.required.acks = -1
● 要容錯:replication.factor >= 2 && min.insync.replicas >= 2
5 6Replica
Follower 1
Replica
Leader
Replica
Follower 2
17
5 6
20. Kafka Producer 實驗結論
參數 最後採用值 可用選項
producer.type async sync, async
compression.codec snappy none, gzip, snappy
batch.num.messages 1000 unlimited
request.required.acks 0 -1, 0, 1
queue.buffering.max.messages 20000 unlimited
● 還是必須根據需求實際測試以及對Data Loss的容忍度
● Latency v.s. Throughput
20
21. Spark streaming
● 以Scala實做
● 多個Kafka中的Streamings做Client-sideJoin
● 將Join結果寫入(upsert)SQL server以維護既有架構
New ETL with
Kafka Producer
upsert
streaming
join
21
26. Spark with RDB的另一個(坑)注意事項
● SQL Server 最大允許同時連線數為32,767
● 不要問我為何知道............
● 無論是使用哪一套connection pool,要注意計算總連線數
● Total Connection = connection pool size * spark executor number
26
28. Spark submit 一些好用的config
Ref: https://spark.apache.org/docs/2.0.0-preview/configuration.html
● supervise
● spark.streaming.backpressure.enabled
● spark.streaming.backpressure.initialRate
● spark.streaming.kafka.maxRatePerPartition
● spark.executor.extraJavaOptions
○ -XX:+UseConcMarkSweepGC
● spark.cleaner.referenceTracking.cleanCheckpoint
28
30. How ever......Cassandra Out! in this project
User: Joe..........我們想要建立一個Dash Board。需要Ad-hoc
Query,可以對任意欄位進行任意的操作。所以我們無法跟你討
論可能的Query呢~~
Joe:
30
36. ES Side Turning for Bulk Loading
● Bulk Load時幾個Trade-off的選項
○ 不Care最新資料的Latency -> 降低index.refresh_interval
○ 不Care容錯與查詢速度 -> 將副本數設定為0
○ 不Care Merge Segment佔用的IO(越快越好) ->不掐Merge IO
36
40. ES search Query Turning
● Optimize (force merge) cold index,甚至合成單一segment
● 使用兩類Cache提昇查詢效能
○ Filter cache:將過濾的結果cache起來,以供未來其他查詢使用
○ Shard cache:將查詢結果整個cache起來,下次一樣的查詢直接回傳
● 別忘了移除為了Bulk Load模式所做的暫時性設定
40
45. Summary
● Discussed components versions
○ Kafka: 0.8.2
○ Spark: 2.0.2
○ Cassandra: 3.10
○ Elasticsearch: 2.4.5
○ Logstash: 2.4.1
○ Kibana: 4.6.4
New ETL with
Kafka Producer
45