Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
9HWIOL ,
5QWH UDWLQ SDUN
.W HWDE WH FDOH
. KZLQ KDQNDU
0KHRO RR DUN
: WOLQH
  9HWIOL EL DWD SODWIRUP
  SDUN - 9HWIOL
  8 OWL WHQDQF SUREOHP
&   UH LFDWH S K RZQ
  ILOH OL WLQ
(   LQ HUW RYHU...
9HWIOL
/L 1DWD ODWIRUP
9HWIOL DWD SLSHOLQH
Cloud
Apps
S3
Suro/Kafka Ursula
SSTables
Cassandra Aegisthus
Event Data
500 bn/day, 15m
Daily
Dimensio...
9HWIOL EL DWD SODWIRUP
Data
Warehouse
Service
Tools
Gateways
Prod
Clients
Clusters
Adhoc Prod TestTest
Big Data API/Portal...
: U H FD H
•  Batch jobs (Pig, Hive)
•  ETL jobs
•  Reporting and other analysis
•  Interactive jobs (Presto)
•  Iterative...
SDUN - 9HWIOL
8L RI HSOR PHQW
•  SDUN RQ 8H R
•  HOI HUYLQ .85
•  3 OO /1. HUNHOH DWD QDO WLF WDFN
•  :QOLQH WUHDPLQ DQDO WLF
•  SDUN RQ...
SDUN RQ B. 9
•  8 OWL WHQDQW FO WHU LQ .A FOR
•  4R WLQ 8 SDUN 1U L
•  28 4D RRS & .85 +
•  1 & ODU H HF LQ WDQFH W SH
•  ...
1HSOR PHQW
S3 s3://bucket/spark/1.5/spark-1.5.tgz, spark-defaults.conf (spark.yarn.jar=1440443677)
s3://bucket/spark/1.4/s...
. YDQWD H
 . WRPDWH HSOR PHQW
  SSRUW P OWLSOH YHU LRQ
 1HSOR QHZ FR H LQ PLQ WH
&   ROO EDFN ED FR H LQ OH WKDQ D PLQ WH
8 OWL WHQDQF
UREOHP
1 QDPLF DOORFDWLRQ
Courtesy of “Dynamic allocate cluster resources to your Spark application” at Hadoop Summit 2015
1 QDPLF DOORFDWLRQ
// spark-defaults.conf
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.executorIdleTimeout...
UREOHP , . 6 (+ &
. 6 (+ &
UREOHP , . 6 )+
. 6 )+
val data = sqlContext
.table("dse.admin_genie_job_d”)
.filter($"dateint">=20150601 and $"dateint"<=20150830)
data.p...
UREOHP , . 6 )& . 6 ()
. 6 )& . 6 ()
•  PSWRP
•  SDUN H HF WRU WD N UDQ RPO IDLO FD LQ MRE IDLO UH
•  0D H
•  UHHPSWH H HF WRU WD N DUH FR QWH D ...
UREOHP &, B. 9 )
B. 9 )
•  PSWRP
•  8 MRE HW WLPH R W ULQ ORFDOL DWLRQ ZKHQ U QQLQ ZLWK SDUN MRE
RQ WKH DPH FO WHU
•  0D H
•  98 ORFDOL H R...
UH LFDWH
K RZQ
UH LFDWH S K RZQ
Case Behavior
Predicates with partition cols on partitioned table Single partition scan
Predicates with p...
UH LFDWH S K RZQ IRU PHWD DWD
Analyzer
Optimizer
SparkPlanner
Parser
HiveMetastoreCatalog
getAllPartitions()
ResolveRelati...
. 6 (+
•  PSWRP
•  HU LQ D DLQ W KHDYLO SDUWLWLRQH 4LYH WDEOH L ORZ
•  0D H
•  UH LFDWH DUH QRW S KH RZQ LQWR 4LYH PHWD WR...
UH LFDWH S K RZQ IRU PHWD DWD
Analyzer
Optimizer
SparkPlanner
Parser
HiveTableScan
getPartitionsByFilter()
HiveTableScans
3LOH 7L WLQ
5QS W SOLW FRPS WDWLRQ
•  PDSUH FH LQS W ILOHLQS WIRUPDW OL W WDW Q P WKUHD
•  KH Q PEHU RI WKUHD WR H OL W DQ IHWFK EORFN...
3LOH OL WLQ IRU SDUWLWLRQH WDEOH
Partition path
Seq[RDD]
HadoopRDD
HadoopRDD
HadoopRDD
HadoopRDD
Partition path
Partition ...
. 6 ++ ( . 6 &
•  PSWRP
•  5QS W SOLW FRPS WDWLRQ IRU SDUWLWLRQH 4LYH WDEOH RQ L ORZ
•  0D H
•  7L WLQ ILOH RQ D SHU SDUWL...
E ON OL WLQ
Partition path
ParArray[RDD]
HadoopRDD
HadoopRDD
HadoopRDD
HadoopRDD
Partition path
Partition path
Partition p...
HUIRUPDQFH LPSURYHPHQW
0
2000
4000
6000
8000
10000
12000
14000
16000
1 24 240 720
seconds
# of partitions
1.5 RC2
S3 bulk ...
5Q HUW :YHUZULWH
UREOHP , 4D RRS R WS W FRPPLWWHU
•  4RZ LW ZRUN ,
•  2DFK WD N ZULWH R WS W WR D WHPS LU
•  : WS W FRPPLWWHU UHQDPH ILU W ...
R WS W FRPPLWWHU
•  4RZ LW ZRUN ,
•  2DFK WD N ZULWH R WS W WR ORFDO L N
•  : WS W FRPPLWWHU FRSLH ILU W FFH I O WD N R WS...
UREOHP , 4LYH LQ HUW RYHUZULWH
•  4RZ LW ZRUN ,
•  1HOHWH DQ UHZULWH H L WLQ R WS W LQ SDUWLWLRQ
•  UREOHP ZLWK ,
•  L HYH...
/DWFKL SDWWHUQ
•  4RZ LW ZRUN ,
•  9HYHU HOHWH H L WLQ R WS W LQ SDUWLWLRQ
•  2DFK MRE LQ HUW D QLT H ESDUWLWLRQ FDOOH EDW...
CHSSHOLQ
5S WKRQ
9RWHERRN
/L DWD SRUWDO
•  :QH WRS KRS IRU DOO EL DWD UHODWH WRRO DQ HUYLFH
•  / LOW RQ WRS RI /L 1DWD . 5
: W RI ER H DPSOH
•  Zero installation
•  Dependency management via Docker
•  Notebook persistence
•  Elastic resources
:Q HPDQ QRWHERRN
Quick facts about Titan
•  Task execution platform leveraging Apache Mesos.
•  Manages underlying EC2 instances.
•  Proces...
Notebook Infrastructure
Ephemeral ports / --net=host mode
Zeppelin
Docker
Container A
172.X.X.X
Host machine A
54.X.X.X
Host machine B
54.X.X.X
Py...
@ H 0D H
L Y SDUN
5WHUDWLYH MRE
5WHUDWLYH MRE
1. Duplicate data and aggregate them differently.
2. Merging aggregates back.
HUIRUPDQFH LPSURYHPHQW
0:00:00
0:14:24
0:28:48
0:43:12
0:57:36
1:12:00
1:26:24
1:40:48
1:55:12
2:09:36
job 1 job 2 job 3
h...
: U FRQWULE WLRQ
. 6 (
. 6 (((
. 6 (+ +
. 6 (+
. 6 ) )
. 6 )&
. 6 )
. 6
. 6 )
. 6 +
. 6 + )
. 6 ++ (
. 6
. 6 &
.
KDQN BR
Nächste SlideShare
Wird geladen in …5
×

Netflix integrating spark at petabyte scale

September 2015 Apache Big Data

Netflix integrating spark at petabyte scale

  1. 1. 9HWIOL , 5QWH UDWLQ SDUN .W HWDE WH FDOH . KZLQ KDQNDU 0KHRO RR DUN
  2. 2. : WOLQH   9HWIOL EL DWD SODWIRUP   SDUN - 9HWIOL   8 OWL WHQDQF SUREOHP &   UH LFDWH S K RZQ   ILOH OL WLQ (   LQ HUW RYHUZULWH )   CHSSHOLQ 5S WKRQ QRWHERRN   @ H FD H L Y SDUN
  3. 3. 9HWIOL /L 1DWD ODWIRUP
  4. 4. 9HWIOL DWD SLSHOLQH Cloud Apps S3 Suro/Kafka Ursula SSTables Cassandra Aegisthus Event Data 500 bn/day, 15m Daily Dimension Data
  5. 5. 9HWIOL EL DWD SODWIRUP Data Warehouse Service Tools Gateways Prod Clients Clusters Adhoc Prod TestTest Big Data API/Portal Metacat Prod
  6. 6. : U H FD H •  Batch jobs (Pig, Hive) •  ETL jobs •  Reporting and other analysis •  Interactive jobs (Presto) •  Iterative ML jobs (Spark)
  7. 7. SDUN - 9HWIOL
  8. 8. 8L RI HSOR PHQW •  SDUN RQ 8H R •  HOI HUYLQ .85 •  3 OO /1. HUNHOH DWD QDO WLF WDFN •  :QOLQH WUHDPLQ DQDO WLF •  SDUN RQ B. 9 •  SDUN D D HUYLFH •  B. 9 DSSOLFDWLRQ RQ 28 4D RRS •  :IIOLQH EDWFK DQDO WLF
  9. 9. SDUN RQ B. 9 •  8 OWL WHQDQW FO WHU LQ .A FOR •  4R WLQ 8 SDUN 1U L •  28 4D RRS & .85 + •  1 & ODU H HF LQ WDQFH W SH •  QR H / WRWDO PHPRU
  10. 10. 1HSOR PHQW S3 s3://bucket/spark/1.5/spark-1.5.tgz, spark-defaults.conf (spark.yarn.jar=1440443677) s3://bucket/spark/1.4/spark-1.4.tgz, spark-defaults.conf (spark.yarn.jar=1440304023) /spark/1.5/1440443677/spark-assembly.jar /spark/1.5/1440720326/spark-assembly.jar /spark/1.4/1440304023/spark-assembly.jar /spark/1.4/1440989711/spark-assembly.jar name: spark version: 1.5 tags: ['type:spark', 'ver:1.5'] jars: - 's3://bucket/spark/1.5/spark-1.5.tgz’ Download latest tarball From S3 via Genie
  11. 11. . YDQWD H  . WRPDWH HSOR PHQW   SSRUW P OWLSOH YHU LRQ  1HSOR QHZ FR H LQ PLQ WH &   ROO EDFN ED FR H LQ OH WKDQ D PLQ WH
  12. 12. 8 OWL WHQDQF UREOHP
  13. 13. 1 QDPLF DOORFDWLRQ Courtesy of “Dynamic allocate cluster resources to your Spark application” at Hadoop Summit 2015
  14. 14. 1 QDPLF DOORFDWLRQ // spark-defaults.conf spark.dynamicAllocation.enabled true spark.dynamicAllocation.executorIdleTimeout 5 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 500 spark.dynamicAllocation.minExecutors 3 spark.dynamicAllocation.schedulerBacklogTimeout 5 spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5 spark.dynamicAllocation.cachedExecutorIdleTimeout 900 // yarn-site.xml yarn.nodemanager.aux-services •  spark_shuffle, mapreduce_shuffle yarn.nodemanager.aux-services.spark_shuffle.class •  org.apache.spark.network.yarn.YarnShuffleService
  15. 15. UREOHP , . 6 (+ &
  16. 16. . 6 (+ &
  17. 17. UREOHP , . 6 )+
  18. 18. . 6 )+ val data = sqlContext .table("dse.admin_genie_job_d”) .filter($"dateint">=20150601 and $"dateint"<=20150830) data.persist data.count
  19. 19. UREOHP , . 6 )& . 6 ()
  20. 20. . 6 )& . 6 () •  PSWRP •  SDUN H HF WRU WD N UDQ RPO IDLO FD LQ MRE IDLO UH •  0D H •  UHHPSWH H HF WRU WD N DUH FR QWH D IDLO UH •  RO WLRQ •  UHHPSWH H HF WRU WD N KR O EH FRQ L HUH D NLOOH
  21. 21. UREOHP &, B. 9 )
  22. 22. B. 9 ) •  PSWRP •  8 MRE HW WLPH R W ULQ ORFDOL DWLRQ ZKHQ U QQLQ ZLWK SDUN MRE RQ WKH DPH FO WHU •  0D H •  98 ORFDOL H RQH MRE DW D WLPH LQFH SDUN U QWLPH MDU L EL ORFDOL LQ SDUN MRE PD WDNH ORQ EORFNLQ 8 MRE •  RO WLRQ •  WD H SDUN U QWLPH MDU RQ 413 ZLWK KL K UHSOLDFDWLRQ •  8DNH 98 ORFDOL H P OWLSOH MRE FRQF UUHQWO
  23. 23. UH LFDWH K RZQ
  24. 24. UH LFDWH S K RZQ Case Behavior Predicates with partition cols on partitioned table Single partition scan Predicates with partition and non-partition cols on partitioned table Single partition scan No predicate on partitioned table e.g. sqlContext.table(“nccp_log”).take(10) Full scan No predicate on non-partitioned table Single partition scan
  25. 25. UH LFDWH S K RZQ IRU PHWD DWD Analyzer Optimizer SparkPlanner Parser HiveMetastoreCatalog getAllPartitions() ResolveRelation What if your table has 1.6M partitions?
  26. 26. . 6 (+ •  PSWRP •  HU LQ D DLQ W KHDYLO SDUWLWLRQH 4LYH WDEOH L ORZ •  0D H •  UH LFDWH DUH QRW S KH RZQ LQWR 4LYH PHWD WRUH R SDUN RH I OO FDQ IRU WDEOH PHWD DWD •  RO WLRQ •  K RZQ ELQDU FRPSDUL RQ H SUH LRQ YLD HW DUWLWLRQ / ILOWHU LQ WR 4LYH PHWD WRUH
  27. 27. UH LFDWH S K RZQ IRU PHWD DWD Analyzer Optimizer SparkPlanner Parser HiveTableScan getPartitionsByFilter() HiveTableScans
  28. 28. 3LOH 7L WLQ
  29. 29. 5QS W SOLW FRPS WDWLRQ •  PDSUH FH LQS W ILOHLQS WIRUPDW OL W WDW Q P WKUHD •  KH Q PEHU RI WKUHD WR H OL W DQ IHWFK EORFN ORFDWLRQ IRU WKH SHFLIL H LQS W SDWK •  HWWLQ WKL SURSHUW LQ SDUN MRE RH Q W KHOS
  30. 30. 3LOH OL WLQ IRU SDUWLWLRQH WDEOH Partition path Seq[RDD] HadoopRDD HadoopRDD HadoopRDD HadoopRDD Partition path Partition path Partition path Input dir Input dir Input dir Input dir Sequentially listing input dirs via S3N file system. S3N S3N S3N S3N
  31. 31. . 6 ++ ( . 6 & •  PSWRP •  5QS W SOLW FRPS WDWLRQ IRU SDUWLWLRQH 4LYH WDEOH RQ L ORZ •  0D H •  7L WLQ ILOH RQ D SHU SDUWLWLRQ ED L L ORZ •  9 ILOH WHP FRPS WH DWD ORFDOLW KLQW •  RO WLRQ •  / ON OL W SDUWLWLRQ LQ SDUDOOHO LQ .PD RQ 0OLHQW •  / SD DWD ORFDOLW FRPS WDWLRQ IRU REMHFW
  32. 32. E ON OL WLQ Partition path ParArray[RDD] HadoopRDD HadoopRDD HadoopRDD HadoopRDD Partition path Partition path Partition path Input dir Input dir Input dir Input dir Bulk listing input dirs in parallel via AmazonS3Client. Amazon S3Client
  33. 33. HUIRUPDQFH LPSURYHPHQW 0 2000 4000 6000 8000 10000 12000 14000 16000 1 24 240 720 seconds # of partitions 1.5 RC2 S3 bulk listing SELECT * FROM nccp_log WHERE dateint=20150801 and hour=0 LIMIT 10;
  34. 34. 5Q HUW :YHUZULWH
  35. 35. UREOHP , 4D RRS R WS W FRPPLWWHU •  4RZ LW ZRUN , •  2DFK WD N ZULWH R WS W WR D WHPS LU •  : WS W FRPPLWWHU UHQDPH ILU W FFH I O WD N WHPS LU WR ILQDO H WLQDWLRQ •  UREOHP ZLWK , •  UHQDPH L FRS DQ HOHWH •  L HYHQW DO FRQ L WHQW •  3LOH9RW3R Q 2 FHSWLRQ ULQ UHQDPH
  36. 36. R WS W FRPPLWWHU •  4RZ LW ZRUN , •  2DFK WD N ZULWH R WS W WR ORFDO L N •  : WS W FRPPLWWHU FRSLH ILU W FFH I O WD N R WS W WR •  . YDQWD H , •  .YRL UH DQDQW FRS •  .YRL HYHQW DO FRQ L WHQF
  37. 37. UREOHP , 4LYH LQ HUW RYHUZULWH •  4RZ LW ZRUN , •  1HOHWH DQ UHZULWH H L WLQ R WS W LQ SDUWLWLRQ •  UREOHP ZLWK , •  L HYHQW DO FRQ L WHQW •  3LOH.OUHD 2 L W2 FHSWLRQ ULQ UHZULWH
  38. 38. /DWFKL SDWWHUQ •  4RZ LW ZRUN , •  9HYHU HOHWH H L WLQ R WS W LQ SDUWLWLRQ •  2DFK MRE LQ HUW D QLT H ESDUWLWLRQ FDOOH EDWFKL •  . YDQWD H , •  .YRL HYHQW DO FRQ L WHQF
  39. 39. CHSSHOLQ 5S WKRQ 9RWHERRN
  40. 40. /L DWD SRUWDO •  :QH WRS KRS IRU DOO EL DWD UHODWH WRRO DQ HUYLFH •  / LOW RQ WRS RI /L 1DWD . 5
  41. 41. : W RI ER H DPSOH
  42. 42. •  Zero installation •  Dependency management via Docker •  Notebook persistence •  Elastic resources :Q HPDQ QRWHERRN
  43. 43. Quick facts about Titan •  Task execution platform leveraging Apache Mesos. •  Manages underlying EC2 instances. •  Process supervision and uptime in the face of failures. •  Auto scaling
  44. 44. Notebook Infrastructure
  45. 45. Ephemeral ports / --net=host mode Zeppelin Docker Container A 172.X.X.X Host machine A 54.X.X.X Host machine B 54.X.X.X Pyspark Docker Container B 172.X.X.X Titan cluster YARN cluster Spark AM Spark AM
  46. 46. @ H 0D H L Y SDUN
  47. 47. 5WHUDWLYH MRE
  48. 48. 5WHUDWLYH MRE 1. Duplicate data and aggregate them differently. 2. Merging aggregates back.
  49. 49. HUIRUPDQFH LPSURYHPHQW 0:00:00 0:14:24 0:28:48 0:43:12 0:57:36 1:12:00 1:26:24 1:40:48 1:55:12 2:09:36 job 1 job 2 job 3 hh:mm:ss Pig Spark 1.2
  50. 50. : U FRQWULE WLRQ . 6 ( . 6 ((( . 6 (+ + . 6 (+ . 6 ) ) . 6 )& . 6 ) . 6 . 6 ) . 6 + . 6 + ) . 6 ++ ( . 6 . 6 &
  51. 51. .
  52. 52. KDQN BR

×