SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Stream processing with R and Amazon Kinesis
Big Data Day LA 2016
Gergely Daroczi
@daroczig
July 9 2016
About me
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 2 / 28
CARD.com’s View of the World
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 3 / 28
CARD.com’s View of the World
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 4 / 28
Modern Marketing at CARD.com
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 5 / 28
Further Data Partners
card transaction processors
card manufacturers
CIP/KYC service providers
online ad platforms
remarketing networks
licensing partners
communication engines
others
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 6 / 28
Infrastructure
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 7 / 28
Why R?
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 8 / 28
Infrastructure
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 9 / 28
The Classic Era
Source: SnowPlow
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 10 / 28
The Hybrid Era
Source: SnowPlow
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 11 / 28
The Unified Era
Source: SnowPlow
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 12 / 28
Why Amazon Kinesis & Redshift?
Source: Kinesis Product Details
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 13 / 28
Intro to Amazon Kinesis Streams
Source: Kinesis Developer Guide
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 14 / 28
Intro to Amazon Kinesis Shards
Source: AWS re:Invent 2013
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 15 / 28
How to Communicate with Kinesis
Writing data to the stream:
Amazon Kinesis Streams API (!)
Amazon Kinesis Producer Library (KPL) from Java
flume-kinesis
Amazon Kinesis Agent
Reading data from the stream:
Amazon Kinesis Streams API (!)
Amazon Kinesis Client Library (KCL) from Java, Node.js, .NET,
Python, Ruby
Managing streams:
Amazon Kinesis Streams API (!)
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 16 / 28
Now We Need an R Client!
Initialize connection to the SDK:
> library(rJava)
+ .jinit(classpath =
+ list.files(
+ '~/Projects/kineric/inst/java/',
+ full.names = TRUE),
+ parameters = "-Xmx512m")
What methods do we have there?
> kc <- .jnew('com.amazonaws.services.kinesis.AmazonKinesisClient')
> .jmethods(kc)
[1] "public void com.amazonaws.services.kinesis.AmazonKinesisClient.splitShard(java
[2] "public void com.amazonaws.services.kinesis.AmazonKinesisClient.splitShard(com.
[3] "public void com.amazonaws.services.kinesis.AmazonKinesisClient.removeTagsFromS
[4] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se
[5] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se
[6] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se
[7] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se
[8] "public com.amazonaws.services.kinesis.model.GetShardIteratorResult com.amazona
[9] "public com.amazonaws.services.kinesis.model.GetShardIteratorResult com.amazona
[10] "public com.amazonaws.services.kinesis.model.GetShardIteratorResult com.amazona
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 17 / 28
The First Steps from R
Set API endpoint:
> kc$setEndpoint('kinesis.us-west-2.amazonaws.com', 'kinesis', 'us-west-2')
What streams do we have access to?
> kc$listStreams()
[1] "Java-Object{{StreamNames: [test_kinesis],HasMoreStreams: false}}"
> kc$listStreams()$getStreamNames()
[1] "Java-Object{[test_kinesis]}"
Describe this stream:
> (kcstream <- kc$describeStream(StreamName = 'test_kinesis')$getStreamDescription()
[1] "Java-Object{{StreamName: test_kinesis,StreamARN: arn:aws:kinesis:us-west-2:5957
> kcstream$getStreamName()
[1] "test_kinesis"
> kcstream$getStreamStatus()
[1] "ACTIVE"
> kcstream$getShards()
[1] "Java-Object{[
{ShardId: shardId-000000000000,HashKeyRange: {
StartingHashKey: 0,EndingHashKey: 17014118346046923173168730371588410572324},Sequenc
{ShardId: shardId-000000000001,HashKeyRange: {StartingHashKey: 170141183460469231731
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 18 / 28
Writing to the Stream
> prr <- .jnew('com.amazonaws.services.kinesis.model.PutRecordRequest')
> prr$setStreamName('test_kinesis')
> prr$setData(J('java.nio.ByteBuffer')$wrap(.jbyte(charToRaw('foobar'))))
> prr$setPartitionKey('42')
> prr
[1] "Java-Object{{StreamName: test_kinesis,
Data: java.nio.HeapByteBuffer[pos=0 lim=6 cap=6],
PartitionKey: 42,}}"
> kc$putRecord(prr)
[1] "Java-Object{{ShardId: shardId-000000000001,
SequenceNumber: 49562894160449444332153346371084313572324361665031176210}}"
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 19 / 28
Get Records from the Stream
First, we need a shared iterator, which expires in 5 minutes:
> sir <- .jnew('com.amazonaws.services.kinesis.model.GetShardIteratorRequest')
> sir$setStreamName('test_kinesis')
> sir$setShardId(.jnew('java/lang/String', '0'))
> sir$setShardIteratorType('TRIM_HORIZON')
> iterator <- kc$getShardIterator(sir)$getShardIterator()
Then we can use it to get records:
> gir <- .jnew('com.amazonaws.services.kinesis.model.GetRecordsRequest')
> gir$setShardIterator(iterator)
> kc$getRecords(gir)$getRecords()
[1] "Java-Object{[]}"
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 20 / 28
Get Records from the Stream
First, we need a shared iterator, which expires in 5 minutes:
> sir <- .jnew('com.amazonaws.services.kinesis.model.GetShardIteratorRequest')
> sir$setStreamName('test_kinesis')
> sir$setShardId(.jnew('java/lang/String', '0'))
> sir$setShardIteratorType('TRIM_HORIZON')
> iterator <- kc$getShardIterator(sir)$getShardIterator()
Then we can use it to get records:
> gir <- .jnew('com.amazonaws.services.kinesis.model.GetRecordsRequest')
> gir$setShardIterator(iterator)
> kc$getRecords(gir)$getRecords()
[1] "Java-Object{[]}"
The data was pushed to the other shard:
> sir$setShardId(.jnew('java/lang/String', '1'))
> iterator <- kc$getShardIterator(sir)$getShardIterator()
> gir$setShardIterator(iterator)
> kc$getRecords(gir)$getRecords()
[1] "Java-Object{[{SequenceNumber: 4956289416044944433215334637108431357232436166503
ApproximateArrivalTimestamp: Tue Jun 14 09:40:19 CEST 2016,
Data: java.nio.HeapByteBuffer[pos=0 lim=6 cap=6],
PartitionKey: 42}]}"Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 20 / 28
Get Further Records from the Stream
> sapply(kc$getRecords(gir)$getRecords(),
+ function(x)
+ rawToChar(x$getData()$array()))
[1] "foobar"
> iterator <- kc$getRecords(gir)$getNextShardIterator()
> gir$setShardIterator(iterator)
> kc$getRecords(gir)$getRecords()
[1] "Java-Object{[]}"
Get the oldest available data again:
> iterator <- kc$getShardIterator(sir)$getShardIterator()
> gir$setShardIterator(iterator)
> kc$getRecords(gir)$getRecords()
[1] "Java-Object{[{SequenceNumber: 4956289416044944433215334637108431357232436166503
Or we can refer to the last known sequence number as well:
> sir$setShardIteratorType('AT_SEQUENCE_NUMBER')
> sir$setStartingSequenceNumber('495628941604494443321533463710843135723243616650311
> iterator <- kc$getShardIterator(sir)$getShardIterator()
> gir$setShardIterator(iterator)
> kc$getRecords(gir)$getRecords()
[1] "Java-Object{[{SequenceNumber: 4956289416044944433215334637108431357232436166503
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 21 / 28
Managing Shards
No need for both shards:
> ms <- .jnew('com.amazonaws.services.kinesis.model.MergeShardsRequest')
> ms$setShardToMerge('shardId-000000000000')
> ms$setAdjacentShardToMerge('shardId-000000000001')
> ms$setStreamName('test_kinesis')
> kc$mergeShards(ms)
What do we have now?
> kc$describeStream(StreamName = 'test_kinesis')$getStreamDescription()$getShards()
[1] "Java-Object{[
{ShardId: shardId-000000000000,HashKeyRange: {StartingHashKey: 0,EndingHashKey: 1701
SequenceNumberRange: {
StartingSequenceNumber: 49562894160427143586954815717376297430913467927668719618,
EndingSequenceNumber: 49562894160438293959554081028945856364232263390243848194}},
{ShardId: shardId-000000000001,HashKeyRange: {StartingHashKey: 170141183460469231731
SequenceNumberRange: {
StartingSequenceNumber: 49562894160449444332153346340517833149186116289174700050,
EndingSequenceNumber: 49562894160460594704752611652087392082504911751749828626}},
{ShardId: shardId-000000000002,
ParentShardId: shardId-000000000000,
AdjacentParentShardId: shardId-000000000001,
HashKeyRange: {StartingHashKey: 0,EndingHashKey: 34028236692093846346337460743176821
SequenceNumberRange: {StartingSequenceNumber: 49562904991497673099704924344727019527
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 22 / 28
Next Steps
Ideas:
R function managing/scaling shards and attaching R processors
One R processor per shard (parallel threads on a node or cluster)
Store started/finished sequence number in memory/DB (checkpointing)
Example (Design for a simple MVP on a single node)
mcparallel starts parallel R processes to evaluate the given expression
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 23 / 28
Next Steps
Ideas:
R function managing/scaling shards and attaching R processors
One R processor per shard (parallel threads on a node or cluster)
Store started/finished sequence number in memory/DB (checkpointing)
Example (Design for a simple MVP on a single node)
mcparallel starts parallel R processes to evaluate the given expression
Problems:
duplicated records after failure due to in-memory checkpoint
needs a DB for running on a cluster
hard to write good unit tests
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 23 / 28
Next Steps
Ideas:
R function managing/scaling shards and attaching R processors
One R processor per shard (parallel threads on a node or cluster)
Store started/finished sequence number in memory/DB (checkpointing)
Example (Design for a simple MVP on a single node)
mcparallel starts parallel R processes to evaluate the given expression
Problems:
duplicated records after failure due to in-memory checkpoint
needs a DB for running on a cluster
hard to write good unit tests
Why not use KCL then?
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 23 / 28
Amazon Kinesis Client Library
An easy-to-use programming model for processing data
java -cp amazon-kinesis-client-1.6.1.jar 
com.amazonaws.services.kinesis.multilang.MultiLangDaemon 
test-kinesis.properties
Scale-out and fault-tolerant processing (checkpointing via DynamoDB)
Logging and metrics in CloudWatch
The MultiLangDaemon spawns processes written in any language,
communication happens via JSON messages sent over stdin/stdout
Only a few events/methods to care about in the consumer application:
1 initialize
2 processRecords
3 checkpoint
4 shutdown
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 24 / 28
Messages from the KCL
1 initialize:
Perform initialization steps
Write “status” message to indicate you are done
Begin reading line from STDIN to receive next action
2 processRecords:
Perform processing tasks (you may write a checkpoint message at any
time)
Write “status” message to STDOUT to indicate you are done.
Begin reading line from STDIN to receive next action
3 shutdown:
Perform shutdown tasks (you may write a checkpoint message at any
time)
Write “status” message to STDOUT to indicate you are done.
Begin reading line from STDIN to receive next action
4 checkpoint:
Decide whether to checkpoint again based on whether there is an error
or not.
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 25 / 28
Again: Why R?
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 26 / 28
R Script Interacting with KCL
#!/usr/bin/r -i
while (TRUE) {
## read and parse messages
line <- fromJSON(readLines(n = 1))
## nothing to do unless we receive a record to process
if (line$action == 'processRecords') {
## process each record
lapply(line$records, function(r) {
business_logic(r)
cat(toJSON(list(action = 'checkpoint', checkpoint = r$sequenceNumber)))
})
}
## return response in JSON
cat(toJSON(list(action = 'status', responseFor = line$action)))
}
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 27 / 28
Examples
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 28 / 28

Weitere ähnliche Inhalte

Andere mochten auch

Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Data Con LA
 
101129 tokyopref bochibochi
101129 tokyopref bochibochi101129 tokyopref bochibochi
101129 tokyopref bochibochiredgang
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterData Con LA
 
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...Data Con LA
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Data Con LA
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Data Con LA
 
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.Data Con LA
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...Data Con LA
 
Do you know how the ultra affluent use social media? Find out.
Do you know how the ultra affluent use social media? Find out.Do you know how the ultra affluent use social media? Find out.
Do you know how the ultra affluent use social media? Find out.The Social Executive
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
6 damaging myths about social media and the truths behind them
6 damaging myths about social media and the truths behind them6 damaging myths about social media and the truths behind them
6 damaging myths about social media and the truths behind themThe Social Executive
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Data Con LA
 
Data Warehousing Patterns for Hadoop
Data Warehousing Patterns for HadoopData Warehousing Patterns for Hadoop
Data Warehousing Patterns for HadoopMichelle Ufford
 
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate UniversityBig Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate UniversityData Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA
 

Andere mochten auch (20)

Data-Driven @ Netflix
Data-Driven @ NetflixData-Driven @ Netflix
Data-Driven @ Netflix
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
 
101129 tokyopref bochibochi
101129 tokyopref bochibochi101129 tokyopref bochibochi
101129 tokyopref bochibochi
 
Dot pab forum september 2011
Dot pab forum september 2011Dot pab forum september 2011
Dot pab forum september 2011
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
 
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
 
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
 
Do you know how the ultra affluent use social media? Find out.
Do you know how the ultra affluent use social media? Find out.Do you know how the ultra affluent use social media? Find out.
Do you know how the ultra affluent use social media? Find out.
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
6 damaging myths about social media and the truths behind them
6 damaging myths about social media and the truths behind them6 damaging myths about social media and the truths behind them
6 damaging myths about social media and the truths behind them
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
 
Data Warehousing Patterns for Hadoop
Data Warehousing Patterns for HadoopData Warehousing Patterns for Hadoop
Data Warehousing Patterns for Hadoop
 
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate UniversityBig Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 

Mehr von Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

Mehr von Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Kürzlich hochgeladen

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Kürzlich hochgeladen (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Big Data Day LA 2016/ Data Science Track - Stream processing with R and Amazon Kinesis, Gergely Daroczi, Director of Analytics, CARD.com

  • 1. Stream processing with R and Amazon Kinesis Big Data Day LA 2016 Gergely Daroczi @daroczig July 9 2016
  • 2. About me Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 2 / 28
  • 3. CARD.com’s View of the World Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 3 / 28
  • 4. CARD.com’s View of the World Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 4 / 28
  • 5. Modern Marketing at CARD.com Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 5 / 28
  • 6. Further Data Partners card transaction processors card manufacturers CIP/KYC service providers online ad platforms remarketing networks licensing partners communication engines others Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 6 / 28
  • 7. Infrastructure Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 7 / 28
  • 8. Why R? Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 8 / 28
  • 9. Infrastructure Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 9 / 28
  • 10. The Classic Era Source: SnowPlow Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 10 / 28
  • 11. The Hybrid Era Source: SnowPlow Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 11 / 28
  • 12. The Unified Era Source: SnowPlow Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 12 / 28
  • 13. Why Amazon Kinesis & Redshift? Source: Kinesis Product Details Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 13 / 28
  • 14. Intro to Amazon Kinesis Streams Source: Kinesis Developer Guide Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 14 / 28
  • 15. Intro to Amazon Kinesis Shards Source: AWS re:Invent 2013 Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 15 / 28
  • 16. How to Communicate with Kinesis Writing data to the stream: Amazon Kinesis Streams API (!) Amazon Kinesis Producer Library (KPL) from Java flume-kinesis Amazon Kinesis Agent Reading data from the stream: Amazon Kinesis Streams API (!) Amazon Kinesis Client Library (KCL) from Java, Node.js, .NET, Python, Ruby Managing streams: Amazon Kinesis Streams API (!) Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 16 / 28
  • 17. Now We Need an R Client! Initialize connection to the SDK: > library(rJava) + .jinit(classpath = + list.files( + '~/Projects/kineric/inst/java/', + full.names = TRUE), + parameters = "-Xmx512m") What methods do we have there? > kc <- .jnew('com.amazonaws.services.kinesis.AmazonKinesisClient') > .jmethods(kc) [1] "public void com.amazonaws.services.kinesis.AmazonKinesisClient.splitShard(java [2] "public void com.amazonaws.services.kinesis.AmazonKinesisClient.splitShard(com. [3] "public void com.amazonaws.services.kinesis.AmazonKinesisClient.removeTagsFromS [4] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se [5] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se [6] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se [7] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se [8] "public com.amazonaws.services.kinesis.model.GetShardIteratorResult com.amazona [9] "public com.amazonaws.services.kinesis.model.GetShardIteratorResult com.amazona [10] "public com.amazonaws.services.kinesis.model.GetShardIteratorResult com.amazona Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 17 / 28
  • 18. The First Steps from R Set API endpoint: > kc$setEndpoint('kinesis.us-west-2.amazonaws.com', 'kinesis', 'us-west-2') What streams do we have access to? > kc$listStreams() [1] "Java-Object{{StreamNames: [test_kinesis],HasMoreStreams: false}}" > kc$listStreams()$getStreamNames() [1] "Java-Object{[test_kinesis]}" Describe this stream: > (kcstream <- kc$describeStream(StreamName = 'test_kinesis')$getStreamDescription() [1] "Java-Object{{StreamName: test_kinesis,StreamARN: arn:aws:kinesis:us-west-2:5957 > kcstream$getStreamName() [1] "test_kinesis" > kcstream$getStreamStatus() [1] "ACTIVE" > kcstream$getShards() [1] "Java-Object{[ {ShardId: shardId-000000000000,HashKeyRange: { StartingHashKey: 0,EndingHashKey: 17014118346046923173168730371588410572324},Sequenc {ShardId: shardId-000000000001,HashKeyRange: {StartingHashKey: 170141183460469231731 Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 18 / 28
  • 19. Writing to the Stream > prr <- .jnew('com.amazonaws.services.kinesis.model.PutRecordRequest') > prr$setStreamName('test_kinesis') > prr$setData(J('java.nio.ByteBuffer')$wrap(.jbyte(charToRaw('foobar')))) > prr$setPartitionKey('42') > prr [1] "Java-Object{{StreamName: test_kinesis, Data: java.nio.HeapByteBuffer[pos=0 lim=6 cap=6], PartitionKey: 42,}}" > kc$putRecord(prr) [1] "Java-Object{{ShardId: shardId-000000000001, SequenceNumber: 49562894160449444332153346371084313572324361665031176210}}" Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 19 / 28
  • 20. Get Records from the Stream First, we need a shared iterator, which expires in 5 minutes: > sir <- .jnew('com.amazonaws.services.kinesis.model.GetShardIteratorRequest') > sir$setStreamName('test_kinesis') > sir$setShardId(.jnew('java/lang/String', '0')) > sir$setShardIteratorType('TRIM_HORIZON') > iterator <- kc$getShardIterator(sir)$getShardIterator() Then we can use it to get records: > gir <- .jnew('com.amazonaws.services.kinesis.model.GetRecordsRequest') > gir$setShardIterator(iterator) > kc$getRecords(gir)$getRecords() [1] "Java-Object{[]}" Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 20 / 28
  • 21. Get Records from the Stream First, we need a shared iterator, which expires in 5 minutes: > sir <- .jnew('com.amazonaws.services.kinesis.model.GetShardIteratorRequest') > sir$setStreamName('test_kinesis') > sir$setShardId(.jnew('java/lang/String', '0')) > sir$setShardIteratorType('TRIM_HORIZON') > iterator <- kc$getShardIterator(sir)$getShardIterator() Then we can use it to get records: > gir <- .jnew('com.amazonaws.services.kinesis.model.GetRecordsRequest') > gir$setShardIterator(iterator) > kc$getRecords(gir)$getRecords() [1] "Java-Object{[]}" The data was pushed to the other shard: > sir$setShardId(.jnew('java/lang/String', '1')) > iterator <- kc$getShardIterator(sir)$getShardIterator() > gir$setShardIterator(iterator) > kc$getRecords(gir)$getRecords() [1] "Java-Object{[{SequenceNumber: 4956289416044944433215334637108431357232436166503 ApproximateArrivalTimestamp: Tue Jun 14 09:40:19 CEST 2016, Data: java.nio.HeapByteBuffer[pos=0 lim=6 cap=6], PartitionKey: 42}]}"Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 20 / 28
  • 22. Get Further Records from the Stream > sapply(kc$getRecords(gir)$getRecords(), + function(x) + rawToChar(x$getData()$array())) [1] "foobar" > iterator <- kc$getRecords(gir)$getNextShardIterator() > gir$setShardIterator(iterator) > kc$getRecords(gir)$getRecords() [1] "Java-Object{[]}" Get the oldest available data again: > iterator <- kc$getShardIterator(sir)$getShardIterator() > gir$setShardIterator(iterator) > kc$getRecords(gir)$getRecords() [1] "Java-Object{[{SequenceNumber: 4956289416044944433215334637108431357232436166503 Or we can refer to the last known sequence number as well: > sir$setShardIteratorType('AT_SEQUENCE_NUMBER') > sir$setStartingSequenceNumber('495628941604494443321533463710843135723243616650311 > iterator <- kc$getShardIterator(sir)$getShardIterator() > gir$setShardIterator(iterator) > kc$getRecords(gir)$getRecords() [1] "Java-Object{[{SequenceNumber: 4956289416044944433215334637108431357232436166503 Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 21 / 28
  • 23. Managing Shards No need for both shards: > ms <- .jnew('com.amazonaws.services.kinesis.model.MergeShardsRequest') > ms$setShardToMerge('shardId-000000000000') > ms$setAdjacentShardToMerge('shardId-000000000001') > ms$setStreamName('test_kinesis') > kc$mergeShards(ms) What do we have now? > kc$describeStream(StreamName = 'test_kinesis')$getStreamDescription()$getShards() [1] "Java-Object{[ {ShardId: shardId-000000000000,HashKeyRange: {StartingHashKey: 0,EndingHashKey: 1701 SequenceNumberRange: { StartingSequenceNumber: 49562894160427143586954815717376297430913467927668719618, EndingSequenceNumber: 49562894160438293959554081028945856364232263390243848194}}, {ShardId: shardId-000000000001,HashKeyRange: {StartingHashKey: 170141183460469231731 SequenceNumberRange: { StartingSequenceNumber: 49562894160449444332153346340517833149186116289174700050, EndingSequenceNumber: 49562894160460594704752611652087392082504911751749828626}}, {ShardId: shardId-000000000002, ParentShardId: shardId-000000000000, AdjacentParentShardId: shardId-000000000001, HashKeyRange: {StartingHashKey: 0,EndingHashKey: 34028236692093846346337460743176821 SequenceNumberRange: {StartingSequenceNumber: 49562904991497673099704924344727019527 Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 22 / 28
  • 24. Next Steps Ideas: R function managing/scaling shards and attaching R processors One R processor per shard (parallel threads on a node or cluster) Store started/finished sequence number in memory/DB (checkpointing) Example (Design for a simple MVP on a single node) mcparallel starts parallel R processes to evaluate the given expression Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 23 / 28
  • 25. Next Steps Ideas: R function managing/scaling shards and attaching R processors One R processor per shard (parallel threads on a node or cluster) Store started/finished sequence number in memory/DB (checkpointing) Example (Design for a simple MVP on a single node) mcparallel starts parallel R processes to evaluate the given expression Problems: duplicated records after failure due to in-memory checkpoint needs a DB for running on a cluster hard to write good unit tests Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 23 / 28
  • 26. Next Steps Ideas: R function managing/scaling shards and attaching R processors One R processor per shard (parallel threads on a node or cluster) Store started/finished sequence number in memory/DB (checkpointing) Example (Design for a simple MVP on a single node) mcparallel starts parallel R processes to evaluate the given expression Problems: duplicated records after failure due to in-memory checkpoint needs a DB for running on a cluster hard to write good unit tests Why not use KCL then? Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 23 / 28
  • 27. Amazon Kinesis Client Library An easy-to-use programming model for processing data java -cp amazon-kinesis-client-1.6.1.jar com.amazonaws.services.kinesis.multilang.MultiLangDaemon test-kinesis.properties Scale-out and fault-tolerant processing (checkpointing via DynamoDB) Logging and metrics in CloudWatch The MultiLangDaemon spawns processes written in any language, communication happens via JSON messages sent over stdin/stdout Only a few events/methods to care about in the consumer application: 1 initialize 2 processRecords 3 checkpoint 4 shutdown Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 24 / 28
  • 28. Messages from the KCL 1 initialize: Perform initialization steps Write “status” message to indicate you are done Begin reading line from STDIN to receive next action 2 processRecords: Perform processing tasks (you may write a checkpoint message at any time) Write “status” message to STDOUT to indicate you are done. Begin reading line from STDIN to receive next action 3 shutdown: Perform shutdown tasks (you may write a checkpoint message at any time) Write “status” message to STDOUT to indicate you are done. Begin reading line from STDIN to receive next action 4 checkpoint: Decide whether to checkpoint again based on whether there is an error or not. Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 25 / 28
  • 29. Again: Why R? Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 26 / 28
  • 30. R Script Interacting with KCL #!/usr/bin/r -i while (TRUE) { ## read and parse messages line <- fromJSON(readLines(n = 1)) ## nothing to do unless we receive a record to process if (line$action == 'processRecords') { ## process each record lapply(line$records, function(r) { business_logic(r) cat(toJSON(list(action = 'checkpoint', checkpoint = r$sequenceNumber))) }) } ## return response in JSON cat(toJSON(list(action = 'status', responseFor = line$action))) } Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 27 / 28
  • 31. Examples Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 28 / 28