SlideShare ist ein Scribd-Unternehmen logo
1 von 79
Downloaden Sie, um offline zu lesen
Apache Spark with Hortonworks
Data Pla5orm
Saptak Sen [@saptak]
Technical Product Manager
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 1
What we will do?
• Setup
• YARN Queues for Spark
• Zeppelin Notebook
• Spark Concepts
• Hands-On
Get the slides from: h/p://l.saptak.in/sparksea/le
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 2
What is Apache Spark?
A fast and general engine for large-scale data processing.
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 3
Setup
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 4
Download Hortonworks Sandbox
h"p://hortonworks.com/sandbox
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 5
Configure Hortonworks Sandbox
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 6
Open network se,ngs for VM
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 7
Open port for Zeppelin
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 8
Login to Ambari
Ambari is at port 8080. Username and password is admin/admin
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 9
Open YARN Queue Manager
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 10
Create a dedicated queue for Spark
Set name to root.spark, Capacity to 50% and Max Capacity to 95%
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 11
Save and Refresh Queues
Adjust capacity of the default queue to 50% if required before saving
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 12
Open Apache Spark configura1on
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 13
Set the 'spark' queue as the default queue for Spark
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 14
Restart All Affected Services
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 15
SSH into your Sandbox
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 16
Install the Zeppelin Service for Ambari
VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - ([0-9].[0-9]).*/1/'`
sudo git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/ZEPPELIN
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 17
Restart Ambari to enumerate the new Services
service ambari restart
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 18
Go to Ambari to Add the Zeppelin Service
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 19
Select Zeppelin Notebook
click Next and select default configura4ons unless you know what you are doing
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 20
Wait for Zeppelin deployment to complete
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 21
Launch Zeppelin from your browser
Zeppelin is at port 9995
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 22
Concepts
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 23
RDD
• Resilient
• Distributed
• Dataset
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 24
Program Flow
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 25
SparkContext
• Created by your driver Program
• Responsible for making your RDD's resilient and Distributed
• Instan<ates RDDs
• The Spark shell creates a "sc" object for you
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 26
Crea%ng RDDs
myRDD = parallelize([1,2,3,4])
myRDD = sc.textFile("hdfs:///tmp/shakepeare.txt")
# file://, s3n://
hiveCtx = HiveContext(sc)
rows = hiveCtx.sql("SELECT name, age from users")
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 27
Crea%ng RDDs
• Can also create from:
• JDBC
• Cassandra
• HBase
• Elas6csearch
• JSON, CSV, sequence files, object files, ORC, Parquet, Avro
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 28
Passing Func+ons to Spark
Many RDD methods accept a func3on as a parameter
myRdd.map(lambda x: x*x)
is the same things as
def sqrN(x):
return x*x
myRdd.map(sqrN)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 29
RDD Transforma,ons
• map
• flatmap
• filter
• dis.nct
• sample
• union, intersec.on, subtract, cartesian
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 30
Map example
myRdd = sc. parallelize([1,2,3,4])
myRdd.map(lambda x: x+x)
• Results: 1, 2, 6, 8
Transforma)ons are lazily evaluated
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 31
RDD Ac&ons
• collect
• count
• countByValue
• take
• top
• reduce
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 32
Reduce
sum = rdd.reduce(lambda x, y: x + y)
Aggregate
sumCount = nums.aggregate(
(0, 0),
(lambda acc, value: (acc[0] + value, acc[1] + 1)),
(lambda acc1 , acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])))
return sumCount[0] / float(sumCount[1])
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 33
Imports
from pyspark import SparkConf, SparkContext
import collections
Setup the Context
conf = SparkConf().setMaster("local").setAppName("foo")
sc = SparkContext(conf = conf)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 34
First RDD
lines = sc.textFile("hdfs:///tmp/shakepeare.txt")
Extract the Data
ratings = lines.map(lambda x: x.split()[2])
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 35
Aggregate
result = ratings.countByValue()
Sort and Print the results
sortedResults = collections.OrderedDict(sorted(result.items()))
for key, value in sortedResults.iteritems():
print "%s %i" % (key, value)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 36
Run the Program
spark-submit ratings-counter.py
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 37
Key, Value RDD
#list of key, value RDDs where the value is 1.
totalsByAge = rdd.map(lambda x: (x, 1))
With Key, Value RDDs, we can:
- reduceByKey()
- groupByKey()
- sortByKey()
- keys(), values()
rdd.reduceByKey(lambda x, y: x + y)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 38
Set opera)ons on Key, Value RDDs
With two lists of key, value RDDs, we can:
• join
• rightOuterJoin
• le/OuterJoin
• cogroup
• subtractByKey
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 39
Mapping just values
When mapping just values in a Key, Value RDD, it is more efficient
to use:
• mapValues()
• flatMapValues()
as it maintains the original par//oning.
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 40
Persistence
result = input.map(lambda x: x * x)
result.persist().is_cached
print(result.count())
print(result.collect().mkString(","))
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 41
Hands-On
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 42
Download the Data
#remove existing copies of dataset from HDFS
hadoop fs -rm /tmp/kddcup.data_10_percent.gz
wget http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz 
-O /tmp/kddcup.data_10_percent.gz
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 43
Load data in HDFS
hadoop fs -put /tmp/kddcup.data_10_percent.gz /tmp
hadoop fs -ls -h /tmp/kddcup.data_10_percent.gz
rm /tmp/kddcup.data_10_percent.gz
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 44
First RDD
input_file = "hdfs:///tmp/kddcup.data_10_percent.gz"
raw_rdd = sc.textFile(input_file)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 45
Retrieving version and configura1on informa1on
sc.version
sc.getConf.get("spark.home")
System.getenv().get("PYTHONPATH")
System.getenv().get("SPARK_HOME")
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 46
Count the number of rows in the dataset
print raw_rdd.count()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 47
Inspect what the data looks like
print raw_rdd.take(5)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 48
Spreading data across the cluster
a = range(100)
data = sc.parallelize(a)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 49
Filtering lines in the data
normal_raw_rdd = raw_rdd.filter(lambda x: 'normal.' in x)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 50
Count the filtered RDD
normal_count = normal_raw_rdd.count()
print normal_count
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 51
Impor&ng local libraries
from pprint import pprint
csv_rdd = raw_rdd.map(lambda x: x.split(","))
head_rows = csv_rdd.take(5)
pprint(head_rows[0])
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 52
Using non-lambda func1ons in transforma1on
def parse_interaction(line):
elems = line.split(",")
tag = elems[41]
return (tag, elems)
key_csv_rdd = raw_rdd.map(parse_interaction)
head_rows = key_csv_rdd.take(5)
pprint(head_rows[0]
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 53
Using collect() on the Spark driver
all_raw_rdd = raw_rdd.collect()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 54
Extrac'ng a smaller sample of the RDD
raw_rdd_sample = raw_rdd.sample(False, 0.1, 1234)
print raw_rdd_sample.count()
print raw_rdd.count()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 55
Local sample for local processing
raw_data_local_sample = raw_rdd.takeSample(False, 1000, 1234)
normal_data_sample = [x.split(",") for x in raw_data_local_sample if "normal." in x]
normal_data_sample_size = len(normal_data_sample)
normal_ratio = normal_data_sample_size/1000.0
print normal_ratio
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 56
Subtrac(ng on RDD from another
attack_raw_rdd = raw_rdd.subtract(normal_raw_rdd)
print attack_raw_rdd.count()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 57
Lis$ng a dis$nct list of protocols
protocols = csv_rdd.map(lambda x: x[1]).distinct()
print protocols.collect()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 58
Lis$ng a dis$nct list of services
services = csv_rdd.map(lambda x: x[2]).distinct()
print services.collect()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 59
Cartesian product of two RDDs
product = protocols.cartesian(services).collect()
print len(product)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 60
Filtering out the normal and a0ack events
normal_csv_data = csv_rdd.filter(lambda x: x[41]=="normal.")
attack_csv_data = csv_rdd.filter(lambda x: x[41]!="normal.")
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 61
Calcula&ng total normal event dura&on and a1ack event
dura&on
normal_duration_data = normal_csv_data.map(lambda x: int(x[0]))
total_normal_duration = normal_duration_data.reduce(lambda x, y: x + y)
print total_normal_duration
attack_duration_data = attack_csv_data.map(lambda x: int(x[0]))
total_attack_duration = attack_duration_data.reduce(lambda x, y: x + y)
print total_attack_duration
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 62
Calcula&ng average dura&on of normal and a1ack events
normal_count = normal_duration_data.count()
attack_count = attack_duration_data.count()
print round(total_normal_duration/float(normal_count),3)
print round(total_attack_duration/float(attack_count),3)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 63
Use aggegrate() func/on to calculate the average in a
single scan
normal_sum_count = normal_duration_data.aggregate(
(0,0), # the initial value
(lambda acc, value: (acc[0] + value, acc[1] + 1)), # combine val/acc
(lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))
)
print round(normal_sum_count[0]/float(normal_sum_count[1]),3)
attack_sum_count = attack_duration_data.aggregate(
(0,0), # the initial value
(lambda acc, value: (acc[0] + value, acc[1] + 1)), # combine value with acc
(lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])) # combine accumulators
)
print round(attack_sum_count[0]/float(attack_sum_count[1]),3)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 64
Crea%ng a key, value RDD
key_value_rdd = csv_rdd.map(lambda x: (x[41], x))
print key_value_rdd.take(1)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 65
Reduce by key and aggregate
key_value_duration = csv_rdd.map(lambda x: (x[41], float(x[0])))
durations_by_key = key_value_duration.reduceByKey(lambda x, y: x + y)
print durations_by_key.collect()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 66
Count by key
counts_by_key = key_value_rdd.countByKey()
print counts_by_key
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 67
Combine by Key to get dura1on and a2empts by type
sum_counts = key_value_duration.combineByKey(
(lambda x: (x, 1)),
# the initial value, with value x and count 1
(lambda acc, value: (acc[0]+value, acc[1]+1)),
# how to combine a pair value with the accumulator: sum value, and increment count
(lambda acc1, acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1]))
# combine accumulators
)
print sum_counts.collectAsMap()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 68
Sorted average dura,on by type
duration_means_by_type = sum_counts
.map(lambda (key,value): (key,
round(value[0]/value[1],3))).collectAsMap()
# Print them sorted
for tag in sorted(duration_means_by_type, key=duration_means_by_type.get, reverse=True):
print tag, duration_means_by_type[tag]
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 69
Crea%ng a Schema and crea%ng a DataFrame
row_data = csv_rdd.map(lambda p: Row(
duration=int(p[0]),
protocol_type=p[1],
service=p[2],
flag=p[3],
src_bytes=int(p[4]),
dst_bytes=int(p[5])
)
)
interactions_df = sqlContext.createDataFrame(row_data)
interactions_df.registerTempTable("interactions")
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 70
Quering with SQL
SELECT duration, dst_bytes FROM interactions
WHERE protocol_type = 'tcp'
AND duration > 1000
AND dst_bytes = 0
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 71
Prin%ng the Schema
interactions_df.printSchema()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 72
Coun%ng events by protocol
interactions_df.select("protocol_type", "duration", "dst_bytes")
.groupBy("protocol_type").count().show()
interactions_df.select("protocol_type", "duration",
"dst_bytes").filter(interactions_df.duration>1000)
.filter(interactions_df.dst_bytes==0)
.groupBy("protocol_type").count().show()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 73
Modifying the labels
def get_label_type(label):
if label!="normal.":
return "attack"
else:
return "normal"
row_labeled_data = csv_rdd.map(lambda p: Row(
duration=int(p[0]),
protocol_type=p[1],
service=p[2],
flag=p[3],
src_bytes=int(p[4]),
dst_bytes=int(p[5]),
label=get_label_type(p[41])
)
)
interactions_labeled_df = sqlContext.createDataFrame(row_labeled_data)
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 74
Coun%ng by labels
interactions_labeled_df.select("label", "protocol_type")
.groupBy("label", "protocol_type").count().show()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 75
Group by label and protocol
interactions_labeled_df.select("label", "protocol_type")
.groupBy("label", "protocol_type").count().show()
interactions_labeled_df.select("label", "protocol_type", "dst_bytes")
.groupBy("label", "protocol_type",
interactions_labeled_df.dst_bytes==0).count().show()
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 76
Ques%ons
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 77
More resources
• More tutorials: h/p://hortonworks.com/tutorials
• Hortonworks gallery: h/p://hortonworks-gallery.github.io
My co-ordinates
• Twi%er: @saptak
• Website: h%p://saptak.in
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 78
© Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 79

Weitere ähnliche Inhalte

Was ist angesagt?

"Enabling Googley microservices with gRPC." at Devoxx France 2017
"Enabling Googley microservices with gRPC." at Devoxx France 2017"Enabling Googley microservices with gRPC." at Devoxx France 2017
"Enabling Googley microservices with gRPC." at Devoxx France 2017Alex Borysov
 
Rtl sdr software defined radio
Rtl sdr   software defined radioRtl sdr   software defined radio
Rtl sdr software defined radioEueung Mulyana
 
What's really the difference between a VM and a Container?
What's really the difference between a VM and a Container?What's really the difference between a VM and a Container?
What's really the difference between a VM and a Container?Adrian Otto
 
Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Laurent Bernaille
 
Triple o 를 이용한 빠르고 쉬운 open stack 설치
Triple o 를 이용한 빠르고 쉬운 open stack 설치Triple o 를 이용한 빠르고 쉬운 open stack 설치
Triple o 를 이용한 빠르고 쉬운 open stack 설치SangWook Byun
 
Kubernetes Services are sooo Yesterday!
Kubernetes Services are sooo Yesterday!Kubernetes Services are sooo Yesterday!
Kubernetes Services are sooo Yesterday!CloudOps2005
 
Docker with BGP - OpenDNS
Docker with BGP - OpenDNSDocker with BGP - OpenDNS
Docker with BGP - OpenDNSbacongobbler
 
Chef and OpenStack Workshop from ChefConf 2013
Chef and OpenStack Workshop from ChefConf 2013Chef and OpenStack Workshop from ChefConf 2013
Chef and OpenStack Workshop from ChefConf 2013Matt Ray
 
Deep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksDeep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksLaurent Bernaille
 
How to use TripleO tools for your own project
How to use TripleO tools for your own projectHow to use TripleO tools for your own project
How to use TripleO tools for your own projectGonéri Le Bouder
 
Deep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksDeep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksLaurent Bernaille
 
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project CalicoKubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project CalicoKubeAcademy
 
Dockerizing the Hard Services: Neutron and Nova
Dockerizing the Hard Services: Neutron and NovaDockerizing the Hard Services: Neutron and Nova
Dockerizing the Hard Services: Neutron and Novaclayton_oneill
 
Kernel Recipes 2014 - Quick state of the art of clang
Kernel Recipes 2014 - Quick state of the art of clangKernel Recipes 2014 - Quick state of the art of clang
Kernel Recipes 2014 - Quick state of the art of clangAnne Nicolas
 
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaSOpenstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaSSadique Puthen
 
Terraform: Cloud Configuration Management (WTC/IPC'16)
Terraform: Cloud Configuration Management (WTC/IPC'16)Terraform: Cloud Configuration Management (WTC/IPC'16)
Terraform: Cloud Configuration Management (WTC/IPC'16)Martin Schütte
 

Was ist angesagt? (20)

"Enabling Googley microservices with gRPC." at Devoxx France 2017
"Enabling Googley microservices with gRPC." at Devoxx France 2017"Enabling Googley microservices with gRPC." at Devoxx France 2017
"Enabling Googley microservices with gRPC." at Devoxx France 2017
 
Rtl sdr software defined radio
Rtl sdr   software defined radioRtl sdr   software defined radio
Rtl sdr software defined radio
 
What's really the difference between a VM and a Container?
What's really the difference between a VM and a Container?What's really the difference between a VM and a Container?
What's really the difference between a VM and a Container?
 
Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019
 
Jumbo Mumbo in OpenStack
Jumbo Mumbo in OpenStackJumbo Mumbo in OpenStack
Jumbo Mumbo in OpenStack
 
HPC on OpenStack
HPC on OpenStackHPC on OpenStack
HPC on OpenStack
 
Triple o 를 이용한 빠르고 쉬운 open stack 설치
Triple o 를 이용한 빠르고 쉬운 open stack 설치Triple o 를 이용한 빠르고 쉬운 open stack 설치
Triple o 를 이용한 빠르고 쉬운 open stack 설치
 
Kubernetes Services are sooo Yesterday!
Kubernetes Services are sooo Yesterday!Kubernetes Services are sooo Yesterday!
Kubernetes Services are sooo Yesterday!
 
Triple o overview
Triple o overviewTriple o overview
Triple o overview
 
Docker with BGP - OpenDNS
Docker with BGP - OpenDNSDocker with BGP - OpenDNS
Docker with BGP - OpenDNS
 
Chef and OpenStack Workshop from ChefConf 2013
Chef and OpenStack Workshop from ChefConf 2013Chef and OpenStack Workshop from ChefConf 2013
Chef and OpenStack Workshop from ChefConf 2013
 
Deep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksDeep dive in Docker Overlay Networks
Deep dive in Docker Overlay Networks
 
How to use TripleO tools for your own project
How to use TripleO tools for your own projectHow to use TripleO tools for your own project
How to use TripleO tools for your own project
 
Deep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksDeep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay Networks
 
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project CalicoKubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
 
Dockerizing the Hard Services: Neutron and Nova
Dockerizing the Hard Services: Neutron and NovaDockerizing the Hard Services: Neutron and Nova
Dockerizing the Hard Services: Neutron and Nova
 
Kernel Recipes 2014 - Quick state of the art of clang
Kernel Recipes 2014 - Quick state of the art of clangKernel Recipes 2014 - Quick state of the art of clang
Kernel Recipes 2014 - Quick state of the art of clang
 
The Monitoring Playground
The Monitoring PlaygroundThe Monitoring Playground
The Monitoring Playground
 
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaSOpenstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
 
Terraform: Cloud Configuration Management (WTC/IPC'16)
Terraform: Cloud Configuration Management (WTC/IPC'16)Terraform: Cloud Configuration Management (WTC/IPC'16)
Terraform: Cloud Configuration Management (WTC/IPC'16)
 

Andere mochten auch

Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...DATAVERSITY
 
Demand generation using mobile and data to boost bricks and mortar sales (1)
Demand generation using mobile and data to boost bricks and mortar sales (1)Demand generation using mobile and data to boost bricks and mortar sales (1)
Demand generation using mobile and data to boost bricks and mortar sales (1)ad:tech London
 
Retail Road to Recovery-Bricks and Mortar Retail is Back!
Retail Road to Recovery-Bricks and Mortar Retail is Back!Retail Road to Recovery-Bricks and Mortar Retail is Back!
Retail Road to Recovery-Bricks and Mortar Retail is Back!Desley Cowley
 
WSDM2014
WSDM2014WSDM2014
WSDM2014Jun Yu
 
Projecte vertebrats A
Projecte vertebrats AProjecte vertebrats A
Projecte vertebrats Ammatarin
 
A report on Nuclear energy -Globally
A report on Nuclear energy -GloballyA report on Nuclear energy -Globally
A report on Nuclear energy -GloballyPranab Ghosh
 

Andere mochten auch (9)

Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
 
kiran New
kiran Newkiran New
kiran New
 
phoebe cv one
phoebe cv onephoebe cv one
phoebe cv one
 
Demand generation using mobile and data to boost bricks and mortar sales (1)
Demand generation using mobile and data to boost bricks and mortar sales (1)Demand generation using mobile and data to boost bricks and mortar sales (1)
Demand generation using mobile and data to boost bricks and mortar sales (1)
 
Retail Road to Recovery-Bricks and Mortar Retail is Back!
Retail Road to Recovery-Bricks and Mortar Retail is Back!Retail Road to Recovery-Bricks and Mortar Retail is Back!
Retail Road to Recovery-Bricks and Mortar Retail is Back!
 
WSDM2014
WSDM2014WSDM2014
WSDM2014
 
Projecte vertebrats A
Projecte vertebrats AProjecte vertebrats A
Projecte vertebrats A
 
SIP Presenation
SIP PresenationSIP Presenation
SIP Presenation
 
A report on Nuclear energy -Globally
A report on Nuclear energy -GloballyA report on Nuclear energy -Globally
A report on Nuclear energy -Globally
 

Ähnlich wie Apache Spark with Hortonworks Data Platform - Seattle Meetup

Software Define your Current Storage with Opensource
Software Define your Current Storage with OpensourceSoftware Define your Current Storage with Opensource
Software Define your Current Storage with OpensourceAntonio Romeo
 
Simplify Cloud Applications using Spring Cloud
Simplify Cloud Applications using Spring CloudSimplify Cloud Applications using Spring Cloud
Simplify Cloud Applications using Spring CloudRamnivas Laddad
 
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScyllaDB
 
Developing and Deploying PHP with Docker
Developing and Deploying PHP with DockerDeveloping and Deploying PHP with Docker
Developing and Deploying PHP with DockerPatrick Mizer
 
Was is Docker? Or: Docker for Software Developers
Was is Docker? Or: Docker for Software DevelopersWas is Docker? Or: Docker for Software Developers
Was is Docker? Or: Docker for Software DevelopersChristian Nagel
 
Hand-On Lab: CA Release Automation Rapid Development Kit and SDK
Hand-On Lab: CA Release Automation Rapid Development Kit and SDKHand-On Lab: CA Release Automation Rapid Development Kit and SDK
Hand-On Lab: CA Release Automation Rapid Development Kit and SDKCA Technologies
 
Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy
Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy
Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy Jeffrey Holden
 
Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...
Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...
Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...DevOps.com
 
Cross-compilation native sous android
Cross-compilation native sous androidCross-compilation native sous android
Cross-compilation native sous androidThierry Gayet
 
Bare Metal to OpenStack with Razor and Chef
Bare Metal to OpenStack with Razor and ChefBare Metal to OpenStack with Razor and Chef
Bare Metal to OpenStack with Razor and ChefMatt Ray
 
26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rulesFreddy Buenaño
 
Pursuing evasive custom command & control - GuideM
Pursuing evasive custom command & control - GuideMPursuing evasive custom command & control - GuideM
Pursuing evasive custom command & control - GuideMMark Secretario
 
Flutter Vikings 2022 - Full Stack Dart
Flutter Vikings 2022  - Full Stack DartFlutter Vikings 2022  - Full Stack Dart
Flutter Vikings 2022 - Full Stack DartChris Swan
 
Toronto Virtual Meetup #7 - Anypoint VPC, VPN and DLB Architecture
Toronto Virtual Meetup #7 - Anypoint VPC, VPN and DLB ArchitectureToronto Virtual Meetup #7 - Anypoint VPC, VPN and DLB Architecture
Toronto Virtual Meetup #7 - Anypoint VPC, VPN and DLB ArchitectureAlexandra N. Martinez
 
Cloud-native .NET Microservices mit Kubernetes
Cloud-native .NET Microservices mit KubernetesCloud-native .NET Microservices mit Kubernetes
Cloud-native .NET Microservices mit KubernetesQAware GmbH
 
Interop 2017 - Managing Containers in Production
Interop 2017 - Managing Containers in ProductionInterop 2017 - Managing Containers in Production
Interop 2017 - Managing Containers in ProductionBrian Gracely
 
Springboard & OpenCV
Springboard & OpenCVSpringboard & OpenCV
Springboard & OpenCVCruise Chen
 
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB
 

Ähnlich wie Apache Spark with Hortonworks Data Platform - Seattle Meetup (20)

Software Define your Current Storage with Opensource
Software Define your Current Storage with OpensourceSoftware Define your Current Storage with Opensource
Software Define your Current Storage with Opensource
 
Simplify Cloud Applications using Spring Cloud
Simplify Cloud Applications using Spring CloudSimplify Cloud Applications using Spring Cloud
Simplify Cloud Applications using Spring Cloud
 
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
 
Developing and Deploying PHP with Docker
Developing and Deploying PHP with DockerDeveloping and Deploying PHP with Docker
Developing and Deploying PHP with Docker
 
Was is Docker? Or: Docker for Software Developers
Was is Docker? Or: Docker for Software DevelopersWas is Docker? Or: Docker for Software Developers
Was is Docker? Or: Docker for Software Developers
 
Hand-On Lab: CA Release Automation Rapid Development Kit and SDK
Hand-On Lab: CA Release Automation Rapid Development Kit and SDKHand-On Lab: CA Release Automation Rapid Development Kit and SDK
Hand-On Lab: CA Release Automation Rapid Development Kit and SDK
 
Bct Aws-VPC-Training
Bct Aws-VPC-TrainingBct Aws-VPC-Training
Bct Aws-VPC-Training
 
CI/CD on AWS
CI/CD on AWSCI/CD on AWS
CI/CD on AWS
 
Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy
Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy
Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy
 
Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...
Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...
Running a Cost-Effective DynamoDB-Compatible Database on Managed Kubernetes S...
 
Cross-compilation native sous android
Cross-compilation native sous androidCross-compilation native sous android
Cross-compilation native sous android
 
Bare Metal to OpenStack with Razor and Chef
Bare Metal to OpenStack with Razor and ChefBare Metal to OpenStack with Razor and Chef
Bare Metal to OpenStack with Razor and Chef
 
26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rules
 
Pursuing evasive custom command & control - GuideM
Pursuing evasive custom command & control - GuideMPursuing evasive custom command & control - GuideM
Pursuing evasive custom command & control - GuideM
 
Flutter Vikings 2022 - Full Stack Dart
Flutter Vikings 2022  - Full Stack DartFlutter Vikings 2022  - Full Stack Dart
Flutter Vikings 2022 - Full Stack Dart
 
Toronto Virtual Meetup #7 - Anypoint VPC, VPN and DLB Architecture
Toronto Virtual Meetup #7 - Anypoint VPC, VPN and DLB ArchitectureToronto Virtual Meetup #7 - Anypoint VPC, VPN and DLB Architecture
Toronto Virtual Meetup #7 - Anypoint VPC, VPN and DLB Architecture
 
Cloud-native .NET Microservices mit Kubernetes
Cloud-native .NET Microservices mit KubernetesCloud-native .NET Microservices mit Kubernetes
Cloud-native .NET Microservices mit Kubernetes
 
Interop 2017 - Managing Containers in Production
Interop 2017 - Managing Containers in ProductionInterop 2017 - Managing Containers in Production
Interop 2017 - Managing Containers in Production
 
Springboard & OpenCV
Springboard & OpenCVSpringboard & OpenCV
Springboard & OpenCV
 
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
 

Mehr von Saptak Sen

Introduction to Apache NiFi - Seattle Scalability Meetup
Introduction to Apache NiFi - Seattle Scalability MeetupIntroduction to Apache NiFi - Seattle Scalability Meetup
Introduction to Apache NiFi - Seattle Scalability MeetupSaptak Sen
 
Data Management in Microsoft HDInsight: How to Move and Store Your Data
Data Management in Microsoft HDInsight: How to Move and Store Your DataData Management in Microsoft HDInsight: How to Move and Store Your Data
Data Management in Microsoft HDInsight: How to Move and Store Your DataSaptak Sen
 
Taking High Performance Computing to the Cloud: Windows HPC and
Taking High Performance Computing to the Cloud: Windows HPC and Taking High Performance Computing to the Cloud: Windows HPC and
Taking High Performance Computing to the Cloud: Windows HPC and Saptak Sen
 
LINQ to HPC: Developing Big Data Applications on Windows HPC Server
LINQ to HPC: Developing Big Data Applications on Windows HPC ServerLINQ to HPC: Developing Big Data Applications on Windows HPC Server
LINQ to HPC: Developing Big Data Applications on Windows HPC ServerSaptak Sen
 
Managing and Deploying High Performance Computing Clusters using Windows HPC ...
Managing and Deploying High Performance Computing Clusters using Windows HPC ...Managing and Deploying High Performance Computing Clusters using Windows HPC ...
Managing and Deploying High Performance Computing Clusters using Windows HPC ...Saptak Sen
 
Do You Have Big Data? (Most Likely!)
Do You Have Big Data? (Most Likely!)Do You Have Big Data? (Most Likely!)
Do You Have Big Data? (Most Likely!)Saptak Sen
 
Predictive Analytics with Microsoft Big Data
Predictive Analytics with Microsoft Big DataPredictive Analytics with Microsoft Big Data
Predictive Analytics with Microsoft Big DataSaptak Sen
 
Data Management in Microsoft HDInsight: How to Move and Store Your Data
Data Management in Microsoft HDInsight: How to Move and Store Your DataData Management in Microsoft HDInsight: How to Move and Store Your Data
Data Management in Microsoft HDInsight: How to Move and Store Your DataSaptak Sen
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 

Mehr von Saptak Sen (9)

Introduction to Apache NiFi - Seattle Scalability Meetup
Introduction to Apache NiFi - Seattle Scalability MeetupIntroduction to Apache NiFi - Seattle Scalability Meetup
Introduction to Apache NiFi - Seattle Scalability Meetup
 
Data Management in Microsoft HDInsight: How to Move and Store Your Data
Data Management in Microsoft HDInsight: How to Move and Store Your DataData Management in Microsoft HDInsight: How to Move and Store Your Data
Data Management in Microsoft HDInsight: How to Move and Store Your Data
 
Taking High Performance Computing to the Cloud: Windows HPC and
Taking High Performance Computing to the Cloud: Windows HPC and Taking High Performance Computing to the Cloud: Windows HPC and
Taking High Performance Computing to the Cloud: Windows HPC and
 
LINQ to HPC: Developing Big Data Applications on Windows HPC Server
LINQ to HPC: Developing Big Data Applications on Windows HPC ServerLINQ to HPC: Developing Big Data Applications on Windows HPC Server
LINQ to HPC: Developing Big Data Applications on Windows HPC Server
 
Managing and Deploying High Performance Computing Clusters using Windows HPC ...
Managing and Deploying High Performance Computing Clusters using Windows HPC ...Managing and Deploying High Performance Computing Clusters using Windows HPC ...
Managing and Deploying High Performance Computing Clusters using Windows HPC ...
 
Do You Have Big Data? (Most Likely!)
Do You Have Big Data? (Most Likely!)Do You Have Big Data? (Most Likely!)
Do You Have Big Data? (Most Likely!)
 
Predictive Analytics with Microsoft Big Data
Predictive Analytics with Microsoft Big DataPredictive Analytics with Microsoft Big Data
Predictive Analytics with Microsoft Big Data
 
Data Management in Microsoft HDInsight: How to Move and Store Your Data
Data Management in Microsoft HDInsight: How to Move and Store Your DataData Management in Microsoft HDInsight: How to Move and Store Your Data
Data Management in Microsoft HDInsight: How to Move and Store Your Data
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 

Kürzlich hochgeladen

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 

Kürzlich hochgeladen (20)

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 

Apache Spark with Hortonworks Data Platform - Seattle Meetup

  • 1. Apache Spark with Hortonworks Data Pla5orm Saptak Sen [@saptak] Technical Product Manager © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 1
  • 2. What we will do? • Setup • YARN Queues for Spark • Zeppelin Notebook • Spark Concepts • Hands-On Get the slides from: h/p://l.saptak.in/sparksea/le © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 2
  • 3. What is Apache Spark? A fast and general engine for large-scale data processing. © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 3
  • 4. Setup © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 4
  • 5. Download Hortonworks Sandbox h"p://hortonworks.com/sandbox © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 5
  • 6. Configure Hortonworks Sandbox © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 6
  • 7. Open network se,ngs for VM © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 7
  • 8. Open port for Zeppelin © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 8
  • 9. Login to Ambari Ambari is at port 8080. Username and password is admin/admin © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 9
  • 10. Open YARN Queue Manager © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 10
  • 11. Create a dedicated queue for Spark Set name to root.spark, Capacity to 50% and Max Capacity to 95% © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 11
  • 12. Save and Refresh Queues Adjust capacity of the default queue to 50% if required before saving © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 12
  • 13. Open Apache Spark configura1on © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 13
  • 14. Set the 'spark' queue as the default queue for Spark © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 14
  • 15. Restart All Affected Services © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 15
  • 16. SSH into your Sandbox © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 16
  • 17. Install the Zeppelin Service for Ambari VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - ([0-9].[0-9]).*/1/'` sudo git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/ZEPPELIN © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 17
  • 18. Restart Ambari to enumerate the new Services service ambari restart © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 18
  • 19. Go to Ambari to Add the Zeppelin Service © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 19
  • 20. Select Zeppelin Notebook click Next and select default configura4ons unless you know what you are doing © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 20
  • 21. Wait for Zeppelin deployment to complete © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 21
  • 22. Launch Zeppelin from your browser Zeppelin is at port 9995 © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 22
  • 23. Concepts © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 23
  • 24. RDD • Resilient • Distributed • Dataset © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 24
  • 25. Program Flow © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 25
  • 26. SparkContext • Created by your driver Program • Responsible for making your RDD's resilient and Distributed • Instan<ates RDDs • The Spark shell creates a "sc" object for you © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 26
  • 27. Crea%ng RDDs myRDD = parallelize([1,2,3,4]) myRDD = sc.textFile("hdfs:///tmp/shakepeare.txt") # file://, s3n:// hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age from users") © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 27
  • 28. Crea%ng RDDs • Can also create from: • JDBC • Cassandra • HBase • Elas6csearch • JSON, CSV, sequence files, object files, ORC, Parquet, Avro © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 28
  • 29. Passing Func+ons to Spark Many RDD methods accept a func3on as a parameter myRdd.map(lambda x: x*x) is the same things as def sqrN(x): return x*x myRdd.map(sqrN) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 29
  • 30. RDD Transforma,ons • map • flatmap • filter • dis.nct • sample • union, intersec.on, subtract, cartesian © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 30
  • 31. Map example myRdd = sc. parallelize([1,2,3,4]) myRdd.map(lambda x: x+x) • Results: 1, 2, 6, 8 Transforma)ons are lazily evaluated © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 31
  • 32. RDD Ac&ons • collect • count • countByValue • take • top • reduce © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 32
  • 33. Reduce sum = rdd.reduce(lambda x, y: x + y) Aggregate sumCount = nums.aggregate( (0, 0), (lambda acc, value: (acc[0] + value, acc[1] + 1)), (lambda acc1 , acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))) return sumCount[0] / float(sumCount[1]) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 33
  • 34. Imports from pyspark import SparkConf, SparkContext import collections Setup the Context conf = SparkConf().setMaster("local").setAppName("foo") sc = SparkContext(conf = conf) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 34
  • 35. First RDD lines = sc.textFile("hdfs:///tmp/shakepeare.txt") Extract the Data ratings = lines.map(lambda x: x.split()[2]) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 35
  • 36. Aggregate result = ratings.countByValue() Sort and Print the results sortedResults = collections.OrderedDict(sorted(result.items())) for key, value in sortedResults.iteritems(): print "%s %i" % (key, value) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 36
  • 37. Run the Program spark-submit ratings-counter.py © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 37
  • 38. Key, Value RDD #list of key, value RDDs where the value is 1. totalsByAge = rdd.map(lambda x: (x, 1)) With Key, Value RDDs, we can: - reduceByKey() - groupByKey() - sortByKey() - keys(), values() rdd.reduceByKey(lambda x, y: x + y) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 38
  • 39. Set opera)ons on Key, Value RDDs With two lists of key, value RDDs, we can: • join • rightOuterJoin • le/OuterJoin • cogroup • subtractByKey © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 39
  • 40. Mapping just values When mapping just values in a Key, Value RDD, it is more efficient to use: • mapValues() • flatMapValues() as it maintains the original par//oning. © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 40
  • 41. Persistence result = input.map(lambda x: x * x) result.persist().is_cached print(result.count()) print(result.collect().mkString(",")) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 41
  • 42. Hands-On © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 42
  • 43. Download the Data #remove existing copies of dataset from HDFS hadoop fs -rm /tmp/kddcup.data_10_percent.gz wget http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz -O /tmp/kddcup.data_10_percent.gz © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 43
  • 44. Load data in HDFS hadoop fs -put /tmp/kddcup.data_10_percent.gz /tmp hadoop fs -ls -h /tmp/kddcup.data_10_percent.gz rm /tmp/kddcup.data_10_percent.gz © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 44
  • 45. First RDD input_file = "hdfs:///tmp/kddcup.data_10_percent.gz" raw_rdd = sc.textFile(input_file) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 45
  • 46. Retrieving version and configura1on informa1on sc.version sc.getConf.get("spark.home") System.getenv().get("PYTHONPATH") System.getenv().get("SPARK_HOME") © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 46
  • 47. Count the number of rows in the dataset print raw_rdd.count() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 47
  • 48. Inspect what the data looks like print raw_rdd.take(5) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 48
  • 49. Spreading data across the cluster a = range(100) data = sc.parallelize(a) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 49
  • 50. Filtering lines in the data normal_raw_rdd = raw_rdd.filter(lambda x: 'normal.' in x) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 50
  • 51. Count the filtered RDD normal_count = normal_raw_rdd.count() print normal_count © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 51
  • 52. Impor&ng local libraries from pprint import pprint csv_rdd = raw_rdd.map(lambda x: x.split(",")) head_rows = csv_rdd.take(5) pprint(head_rows[0]) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 52
  • 53. Using non-lambda func1ons in transforma1on def parse_interaction(line): elems = line.split(",") tag = elems[41] return (tag, elems) key_csv_rdd = raw_rdd.map(parse_interaction) head_rows = key_csv_rdd.take(5) pprint(head_rows[0] © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 53
  • 54. Using collect() on the Spark driver all_raw_rdd = raw_rdd.collect() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 54
  • 55. Extrac'ng a smaller sample of the RDD raw_rdd_sample = raw_rdd.sample(False, 0.1, 1234) print raw_rdd_sample.count() print raw_rdd.count() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 55
  • 56. Local sample for local processing raw_data_local_sample = raw_rdd.takeSample(False, 1000, 1234) normal_data_sample = [x.split(",") for x in raw_data_local_sample if "normal." in x] normal_data_sample_size = len(normal_data_sample) normal_ratio = normal_data_sample_size/1000.0 print normal_ratio © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 56
  • 57. Subtrac(ng on RDD from another attack_raw_rdd = raw_rdd.subtract(normal_raw_rdd) print attack_raw_rdd.count() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 57
  • 58. Lis$ng a dis$nct list of protocols protocols = csv_rdd.map(lambda x: x[1]).distinct() print protocols.collect() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 58
  • 59. Lis$ng a dis$nct list of services services = csv_rdd.map(lambda x: x[2]).distinct() print services.collect() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 59
  • 60. Cartesian product of two RDDs product = protocols.cartesian(services).collect() print len(product) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 60
  • 61. Filtering out the normal and a0ack events normal_csv_data = csv_rdd.filter(lambda x: x[41]=="normal.") attack_csv_data = csv_rdd.filter(lambda x: x[41]!="normal.") © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 61
  • 62. Calcula&ng total normal event dura&on and a1ack event dura&on normal_duration_data = normal_csv_data.map(lambda x: int(x[0])) total_normal_duration = normal_duration_data.reduce(lambda x, y: x + y) print total_normal_duration attack_duration_data = attack_csv_data.map(lambda x: int(x[0])) total_attack_duration = attack_duration_data.reduce(lambda x, y: x + y) print total_attack_duration © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 62
  • 63. Calcula&ng average dura&on of normal and a1ack events normal_count = normal_duration_data.count() attack_count = attack_duration_data.count() print round(total_normal_duration/float(normal_count),3) print round(total_attack_duration/float(attack_count),3) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 63
  • 64. Use aggegrate() func/on to calculate the average in a single scan normal_sum_count = normal_duration_data.aggregate( (0,0), # the initial value (lambda acc, value: (acc[0] + value, acc[1] + 1)), # combine val/acc (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])) ) print round(normal_sum_count[0]/float(normal_sum_count[1]),3) attack_sum_count = attack_duration_data.aggregate( (0,0), # the initial value (lambda acc, value: (acc[0] + value, acc[1] + 1)), # combine value with acc (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])) # combine accumulators ) print round(attack_sum_count[0]/float(attack_sum_count[1]),3) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 64
  • 65. Crea%ng a key, value RDD key_value_rdd = csv_rdd.map(lambda x: (x[41], x)) print key_value_rdd.take(1) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 65
  • 66. Reduce by key and aggregate key_value_duration = csv_rdd.map(lambda x: (x[41], float(x[0]))) durations_by_key = key_value_duration.reduceByKey(lambda x, y: x + y) print durations_by_key.collect() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 66
  • 67. Count by key counts_by_key = key_value_rdd.countByKey() print counts_by_key © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 67
  • 68. Combine by Key to get dura1on and a2empts by type sum_counts = key_value_duration.combineByKey( (lambda x: (x, 1)), # the initial value, with value x and count 1 (lambda acc, value: (acc[0]+value, acc[1]+1)), # how to combine a pair value with the accumulator: sum value, and increment count (lambda acc1, acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1])) # combine accumulators ) print sum_counts.collectAsMap() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 68
  • 69. Sorted average dura,on by type duration_means_by_type = sum_counts .map(lambda (key,value): (key, round(value[0]/value[1],3))).collectAsMap() # Print them sorted for tag in sorted(duration_means_by_type, key=duration_means_by_type.get, reverse=True): print tag, duration_means_by_type[tag] © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 69
  • 70. Crea%ng a Schema and crea%ng a DataFrame row_data = csv_rdd.map(lambda p: Row( duration=int(p[0]), protocol_type=p[1], service=p[2], flag=p[3], src_bytes=int(p[4]), dst_bytes=int(p[5]) ) ) interactions_df = sqlContext.createDataFrame(row_data) interactions_df.registerTempTable("interactions") © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 70
  • 71. Quering with SQL SELECT duration, dst_bytes FROM interactions WHERE protocol_type = 'tcp' AND duration > 1000 AND dst_bytes = 0 © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 71
  • 72. Prin%ng the Schema interactions_df.printSchema() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 72
  • 73. Coun%ng events by protocol interactions_df.select("protocol_type", "duration", "dst_bytes") .groupBy("protocol_type").count().show() interactions_df.select("protocol_type", "duration", "dst_bytes").filter(interactions_df.duration>1000) .filter(interactions_df.dst_bytes==0) .groupBy("protocol_type").count().show() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 73
  • 74. Modifying the labels def get_label_type(label): if label!="normal.": return "attack" else: return "normal" row_labeled_data = csv_rdd.map(lambda p: Row( duration=int(p[0]), protocol_type=p[1], service=p[2], flag=p[3], src_bytes=int(p[4]), dst_bytes=int(p[5]), label=get_label_type(p[41]) ) ) interactions_labeled_df = sqlContext.createDataFrame(row_labeled_data) © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 74
  • 75. Coun%ng by labels interactions_labeled_df.select("label", "protocol_type") .groupBy("label", "protocol_type").count().show() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 75
  • 76. Group by label and protocol interactions_labeled_df.select("label", "protocol_type") .groupBy("label", "protocol_type").count().show() interactions_labeled_df.select("label", "protocol_type", "dst_bytes") .groupBy("label", "protocol_type", interactions_labeled_df.dst_bytes==0).count().show() © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 76
  • 77. Ques%ons © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 77
  • 78. More resources • More tutorials: h/p://hortonworks.com/tutorials • Hortonworks gallery: h/p://hortonworks-gallery.github.io My co-ordinates • Twi%er: @saptak • Website: h%p://saptak.in © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 78
  • 79. © Crea've Commons A.ribu'on-ShareAlike 3.0 Unported License 79