Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Kafka & Hadoop in Rakuten

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 18 Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Kafka & Hadoop in Rakuten (20)

Anzeige

Weitere von Rakuten Group, Inc. (20)

Anzeige

Kafka & Hadoop in Rakuten

  1. 1. Kafka & Hadoop in Rakuten Apr 21st, 2021 Yongduck Lee Cloud Platform. Dept. Rakuten Group, Inc.
  2. 2. 2 What is Apache Kafka? Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. • Unified platform for handling real-time data feeds • High-throughput to support high volume event streams • Graceful dealing with large data backlogs • Low-latency delivery to handle more traditional messaging use-cases. • Fault-tolerance in the presence of machine failures • Not use in-process cache of the data https://kafka.apache.org
  3. 3. 3 What is Elasticsearch? Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data for lightning-fast search, fine-tuned relevancy, and powerful analytics that scale with ease. https://www.elastic.co/elasticsear ch/ primary replica Data Nodes Master Nodes ML Nodes Coordinating Nodes Transform Nodes Remote Cluster Nodes Cluster A Cluster B Cluster C Client
  4. 4. 4 What is Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. https://hadoop.apache.org It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  5. 5. 5 Data Pipeline Concept Data Provider Data Collection Data Wrangling Data Process & Analysis Visualization • Data investigation • Reporting • Historical data • Near time consumers Realtime data (5-15sec) • Realtime dashboards • Traffic anomalies • Initial research • Recent data • Real-Time Collection • CDC • Full Dump Data System /Network • Application logs • Access logs • Transactions logs • OS Logs/ Network Traffic User Behaviors • Purchase • Page View • Click • RQ/LDTime • Geo Location • Review • Product Search Service Platform Event / Product / Profile Info • Email • Campaign • Questionnaires • Product/Item • Demography Product • Enrichment • Normalization • Cleaning Data Users (Actors) Heterogenous Data INFRA RCMD RANKING Log Management Data Analysis …. Near Real-Time Or Batch - Unstructured data - Semi-structured data - Structured data ……
  6. 6. 6 Data Pipeline Concept Sub second / interactive investigation of data as time series Complex analytics, data processing, AI etc over large datasets. May take from seconds to days to run depending on workload and processing framework
  7. 7. 7 Kafka in Rakuten We have been providing Kafka Service from Kafka 0.8 to 2.4 with PLAINTEXT, SASL_PLANTEXT, and SASL_TLS, Handling around 1.3 Million Message/sec ( 10 GB/sec IN/OUT) around peak time at normal date. At 2021 Super Sale, we handled more than 2.5 times messages and traffics. 62 Kafka Clusters (7440 Core, 21TB Mem, 4972 Topics) 5th/Mar/2021 22:45 PM 77 Kafka Clusters (7904 Core, 22TB Mem, 5091 Topics) 08/Apr/2021 7.440 K
  8. 8. 8 Kafka in Rakuten NA EU JP 69 4 4 Near Real-Time One-way Mirroring Cross-DC Active/Active | Active/Hot Standby Kafka using MirrorMaker2 + KafkaConnect
  9. 9. 9 Elasticsearch in Rakuten We have been providing ES Service from 2.X to 7.X with Basic & Commercial Subscriptions, indexing hundreds of thousands doc/sec for near-real time log management & monitoring and user behavior & KPI analysis. At 2021 Super Sale, we handled more than 2 times docs and traffics. 47 ES Clusters (5960 Core, 6.4TB Mem, 71TB Indices)
  10. 10. 10 Hadoop In Rakuten Vcore Mem Disk 72K 442TB 130 PB RAM Nodes 1K 08/Apr/2021 We are providing HDP2 & HDP3 Clusters in JP/EU/US regions. Our use case is very aggressive multi-tenants who are using as data lake/data analysis/backup & archiving, etc. All CPU-intensive, Memory-intensive, Disk-intensive use-case are running on clusters at the same time but we are providing high stability and performance service with rich experiences on Hadoop administration from the 1st generation of Apache Hadoop.
  11. 11. 11 Hadoop in Rakuten
  12. 12. 12 Challenge on Kafka Mirroring Throughput between Region or Zone - Temporary network failure. - High Latency - Location of MirrorMaker Pros & Cons Instability or cluster broken - High Load during Rebalancing or Recovery. - Rack-awareness - Major/Minor Upgrade or Patching JDK & Cross-Realm Issue - Consuming & Producing between Cluster with different Realm or Service Name - JDK Specification about Kerberos Authentication OOM on Brokers or Zookeeper - Many Consumer or Producer - Large size of message - Z-node creation - Increase # of partitions - Relocate Mirror Maker on Source Side and increase Producer Parallelism Parallelism - Reduce size of data which will be replicated during recovery or rebalancing by small servers with proper size of DISK, CPU, and Mem for java/scalar Scale-Out than Scale-Up - Use Streaming Framework (Spark, Flink, and so on) - Use Middleware which are supporting different service name and Version.(NiFi) - Use Global KDC and one Realm for Kafka Clusters Global KDC and Proper Streaming Solution - Guide users by proactive consultation as professional. - Authorization on ZK nodes Confirm Use-Case and Dedicated ZK
  13. 13. 13 Challenge on Elasticsearch Mixed Indexing Query Pattern • Doc/sec (100K doc/sec ~ 1K doc/sec) • Size per index (1TB/hour~1GB/hour) • Short- or Long-term query Unbalanced Shard distribution • total # of shard per nodes • balance of high or low loads of shards per nodes Too many Indices and shards • long retention • Many shards on index for load distribution. Arbitrary Docs indexing on ES • Arbitrary # of Json Field. • Invalid data which are not matched with Data Type • Too many Json Field in doc. Fast Query in the middle of High load of indexing. OOM on Data Nodes and Coordinating Nodes. Hard to scale out only for High load index. ……
  14. 14. 14 Challenge on Elasticsearch Hot Cold Data Nodes Master Nodes Coordinating Nodes Client Coordinating Nodes Hot Cold Hot Cold HL Group ML Group LL Group Routing SEH Template IDX Move/Merge/READ-ONLY
  15. 15. 15 Challenge on Elasticsearch Hot Cold Data Nodes Master Nodes Coordinating Nodes Client Coordinating Nodes Hot Cold Hot Cold HL Group ML Group LL Group SEH Template IDX Move/READ-ONLY
  16. 16. 16 Challenge on Hadoop Aggressive Multi-tenant on Big box of Cluster - Job Pending or Execution Delay - NameNode Slowdown - Zookeeper Timeout - NameNode Heap - Localization Issues - Large # of Files High Performance & Low Cost - CPU-Intensive - Memory Intensive - Disk-Intensive Preemption Federation Zookeeper Separation Continuous balancing Dedicated Node with Labeling Heterogenous Proper Node Design Based on Needs Utilizing SSD & HDD On-Premise Training Course NameNode RPC QoS
  17. 17. 17 Future Challenges Self-Service • Self-Operation • Data Profiling & Governance • Broker Level Administration • Active-Active Mirroring Next Generation • Kafka vs ??? • Elasticsearch vs ??? Return To Apache Hadoop • HDP Subscription Policy • Ambari to Chef or Ansible • Rakuten Distribution Hadoop Containerization • Service Discovery • Persistent Storage or Local Storage • Physical vs Logical Separation

×