SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Streams of social
consciousness
Real-time data transformation
Who am I?
Psycholinguist
Research/Data
analysis
Flex Programmer
OO, Enterprise
Interactive
Developer
Browser + Server
2000 2008 2013
Marielle Lange @widged
Stream expertise
Fairly recent and rather limited
๏Gulp -> custom modules written by adapting other
modules.
๏Data analysis -> Using streams to process large size data
sets.
➡ I will Attempt to provide the minimal orientation to get
started. Staying clear of complex topics like back-
pressure handling.
Streams for data analysis
Garden Data. Aggregating data scrapped from a large number of websites.
Parsing them. Normalizing them (Farenheit vs Celsius, March in NH or SH). Reducing
them (converting [55-65] to 55 #1, 60 #1, 65 #1). Rendering them (average vs
visualisation).
➡
Streams manage a data flow.
Sources. Where data pour
from.
Sinks. Where results pour to.
Throughs. Where data gets
manipulated and
transformed.
ReadStream.
WriteStream.
What are they good for?
๏ Gulp - writing your own modules.
๏ Real-time data obtained from remote servers that would
be too impractical to buffer in a device with limited
memory.
๏ Map-reduce types of computations - a programming
model for processing and generating large data sets. A
map function generates a set of intermediate key/value
pairs ({word: ‘hello’, length: 5}) and a reduce function
merges all intermediate values associated with the same
intermediate key ([‘agile’ , ‘greet’ ,‘hello’] - list of words of
length 5). Great if you want to run computations on
distributed systems.
Streams 101
Readable Streams
Abstraction for a source of data that you are reading
data from.
‣ http responses, on the client
‣ http requests, on the server
‣ fs read streams
‣ zlib streams
‣ crypto streams
‣ tcp sockets
‣ child process stdout and stderr
‣ process.stdin
Notes
๏A readable stream will not start emitting data until you indicate that you are ready
to receive it.
๏Readable streams have two “modes”: a flowing mode and a non-flowing mode.
var flappyStream =
readable.read();
Writable Streams
Abstraction for a destination that you are writing data to.‣ http responses, on the client
‣ http requests, on the server
‣ fs write streams
‣ zlib streams
‣ crypto streams
‣ tcp sockets
‣ child process stdin
‣ process.stdout, process.stderr
writeable
.write(flappyBird);
Transforms
Compressing a file using gzip
var fs     = require(“fs”),
zlib   = require(“fs”);
var readable = fs.createReadStream("foo.txt"),
writable = fs.createWriteStream("foo.txt.gz");
readable
   .pipe(gzip)
   .pipe(writable);
var evilStream =
transform.output
.read();
transform.input
.write(flappyBird);
Abstraction for a stream that is both readable and writable,
where the input is related to the output (map or filter step).
Dominic Tarr’s `through`
module provides a similar
functionality
Basic API
Readable stream
var fs = require('fs');
var readable = fs.createReadStream('foo.txt');
// this is the classic api
readable
.on('data', function (data) { console.log('Data!', data); })
.on('error', function (err) { console.error('Error', err); })
.on('end', function () { console.log('All done!'); });
var fs = require('fs');
var readable = fs.createReadStream('foo.txt')
, writable = fs.createWriteStream('copy.txt');
readable.pipe(writable)
.on('finish', function () { writable.write('an extra line'); });
Writable stream
Toolbox
event-stream (D. Tarr)
var fs     = require(“fs”),
JSONStream = require('JSONStream'),
map = require('map-stream');
var input = fs.createReadStream("twitter-feed.json"),
output = fs.createWriteStream("twitter-sentiments.json");
input
.pipe(JSONStream.parse("*"))
.pipe(map(computeSentiments))
.pipe(output);
Stream playground (J. Resig)
Stream handbook (@Substack)
Rapidly define a list of files to read from with glob strings
Vinyl
var fs = require('fs'), vinyl = require('vinyl-fs')
vinyl.src('./data/*/quad/*.comp.json', { buffer: false }).pipe(map(mapSource));
function mapSource(file, asyncReturn) { var srcStream = file.contents;
srcStream .pipe(JSONStream.parse("*")) .pipe(SomeAnalysis)
.pipe(vinyl.dest("./out"))
};
Example
Twitter Sentiments
Register an application with the Twitter API –
https://dev.twitter.com/
Create an access token.
In your projects, add a file “secret_keys.js” with:
Takes advantage of the sentiment module:
https://github.com/thisandagain/sentiment
consumer_secret: "YOUR_CONSUMER_SECRET", access_token_key: "USER_ACCESS_TOKEN", access_token_secret: "USE
Programming
Style
Separation of concerns
The #1 reason to use streams for me is that the
piping structure encourages the writing of programs
as bite-size modules that are highly interchangeable.
In the early stages of writing the example program, I had:
tweets
.pipe(map(englishOnly))
.pipe(map(addSentiment))
Then I found out that the API gave you the option to specify a language filter.All I had to do was drop one line of code.
Functional Programming
A more functional style of programming encourages the
avoidance of side effects or state mutation.
var fs     = require(“fs”),
map   = require(“map-stream”);
var readable = fs.createReadStream("foo.txt"),
readable
   .pipe(map(filterEnglish))
function filterEnglish(data, asyncReturn) {
   if(data.language === “en”) {
// write these data to the output stream
      asyncReturn(null, data);
   } else {
// but don’t write these.
      asyncReturn();
   }
}
๏ Single Responsibility Principle: "A
function should do one thing, and do it
well."
๏ Pure functions. No knowledge of the
external world whatsoever. Every bit
of information required for the running
of the function is explicitly passed as
paramter.
๏ Immutable data. A function returns a
new data that captures the
transformation rather than a reference
to the old data.
๏ Higher Order Functions. Functions
that return functions (partials,
currying). A way to capture local

Weitere ähnliche Inhalte

Was ist angesagt?

Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
 
Asko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture HighloadAsko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture HighloadOntico
 
Streaming options in the wild
Streaming options in the wildStreaming options in the wild
Streaming options in the wildAtif Akhtar
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataTreasure Data, Inc.
 
Low latency stream processing with jet
Low latency stream processing with jetLow latency stream processing with jet
Low latency stream processing with jetStreamNative
 
Investigating server logs
Investigating server logsInvestigating server logs
Investigating server logsAnimesh Shaw
 
Describing configurations of software experiments as Linked Data
Describing configurations of software experiments as Linked DataDescribing configurations of software experiments as Linked Data
Describing configurations of software experiments as Linked DataJoachim Van Herwegen
 
J-Day Kraków: Listen to the sounds of your application
J-Day Kraków: Listen to the sounds of your applicationJ-Day Kraków: Listen to the sounds of your application
J-Day Kraków: Listen to the sounds of your applicationMaciej Bilas
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformFarzad Nozarian
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddlerholiman
 
Small intro to Big Data - Old version
Small intro to Big Data - Old versionSmall intro to Big Data - Old version
Small intro to Big Data - Old versionSoftwareMill
 
Poster GraphQL-LD: Linked Data Querying with GraphQL
Poster GraphQL-LD: Linked Data Querying with GraphQLPoster GraphQL-LD: Linked Data Querying with GraphQL
Poster GraphQL-LD: Linked Data Querying with GraphQLRuben Taelman
 
Data Streaming in Kafka
Data Streaming in KafkaData Streaming in Kafka
Data Streaming in KafkaSilviuMarcu1
 
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...MongoDB
 

Was ist angesagt? (20)

Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
 
Asko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture HighloadAsko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture Highload
 
Streaming options in the wild
Streaming options in the wildStreaming options in the wild
Streaming options in the wild
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
 
Low latency stream processing with jet
Low latency stream processing with jetLow latency stream processing with jet
Low latency stream processing with jet
 
The Rise of Streaming SQL
The Rise of Streaming SQLThe Rise of Streaming SQL
The Rise of Streaming SQL
 
Investigating server logs
Investigating server logsInvestigating server logs
Investigating server logs
 
Describing configurations of software experiments as Linked Data
Describing configurations of software experiments as Linked DataDescribing configurations of software experiments as Linked Data
Describing configurations of software experiments as Linked Data
 
J-Day Kraków: Listen to the sounds of your application
J-Day Kraków: Listen to the sounds of your applicationJ-Day Kraków: Listen to the sounds of your application
J-Day Kraków: Listen to the sounds of your application
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing Platform
 
Stream Processing with Ballerina
Stream Processing with BallerinaStream Processing with Ballerina
Stream Processing with Ballerina
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddler
 
Introduction to influx db
Introduction to influx dbIntroduction to influx db
Introduction to influx db
 
Small intro to Big Data - Old version
Small intro to Big Data - Old versionSmall intro to Big Data - Old version
Small intro to Big Data - Old version
 
Poster GraphQL-LD: Linked Data Querying with GraphQL
Poster GraphQL-LD: Linked Data Querying with GraphQLPoster GraphQL-LD: Linked Data Querying with GraphQL
Poster GraphQL-LD: Linked Data Querying with GraphQL
 
Log Events @Twitter
Log Events @TwitterLog Events @Twitter
Log Events @Twitter
 
Data Infrastructure in Kumparan
Data Infrastructure in KumparanData Infrastructure in Kumparan
Data Infrastructure in Kumparan
 
Data Streaming in Kafka
Data Streaming in KafkaData Streaming in Kafka
Data Streaming in Kafka
 
Siddhi - cloud-native stream processor
Siddhi - cloud-native stream processorSiddhi - cloud-native stream processor
Siddhi - cloud-native stream processor
 
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
 

Ähnlich wie Streams

Headless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoHeadless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoSander Mangel
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksGuido Schmutz
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022ZainAsgar1
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingPalani Kumar
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQLWSO2
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Dataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice WayDataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice WayQAware GmbH
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...Amazon Web Services
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 

Ähnlich wie Streams (20)

Headless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoHeadless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in Magento
 
Streams in Node.js
Streams in Node.jsStreams in Node.js
Streams in Node.js
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Streams in Node .pdf
Streams in Node .pdfStreams in Node .pdf
Streams in Node .pdf
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
CGI by rj
CGI by rjCGI by rj
CGI by rj
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Dataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice WayDataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice Way
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 

Streams

  • 2. Who am I? Psycholinguist Research/Data analysis Flex Programmer OO, Enterprise Interactive Developer Browser + Server 2000 2008 2013 Marielle Lange @widged
  • 3. Stream expertise Fairly recent and rather limited ๏Gulp -> custom modules written by adapting other modules. ๏Data analysis -> Using streams to process large size data sets. ➡ I will Attempt to provide the minimal orientation to get started. Staying clear of complex topics like back- pressure handling.
  • 4. Streams for data analysis Garden Data. Aggregating data scrapped from a large number of websites. Parsing them. Normalizing them (Farenheit vs Celsius, March in NH or SH). Reducing them (converting [55-65] to 55 #1, 60 #1, 65 #1). Rendering them (average vs visualisation). ➡
  • 5. Streams manage a data flow. Sources. Where data pour from. Sinks. Where results pour to. Throughs. Where data gets manipulated and transformed. ReadStream. WriteStream.
  • 6. What are they good for? ๏ Gulp - writing your own modules. ๏ Real-time data obtained from remote servers that would be too impractical to buffer in a device with limited memory. ๏ Map-reduce types of computations - a programming model for processing and generating large data sets. A map function generates a set of intermediate key/value pairs ({word: ‘hello’, length: 5}) and a reduce function merges all intermediate values associated with the same intermediate key ([‘agile’ , ‘greet’ ,‘hello’] - list of words of length 5). Great if you want to run computations on distributed systems.
  • 8. Readable Streams Abstraction for a source of data that you are reading data from. ‣ http responses, on the client ‣ http requests, on the server ‣ fs read streams ‣ zlib streams ‣ crypto streams ‣ tcp sockets ‣ child process stdout and stderr ‣ process.stdin Notes ๏A readable stream will not start emitting data until you indicate that you are ready to receive it. ๏Readable streams have two “modes”: a flowing mode and a non-flowing mode. var flappyStream = readable.read();
  • 9. Writable Streams Abstraction for a destination that you are writing data to.‣ http responses, on the client ‣ http requests, on the server ‣ fs write streams ‣ zlib streams ‣ crypto streams ‣ tcp sockets ‣ child process stdin ‣ process.stdout, process.stderr writeable .write(flappyBird);
  • 10. Transforms Compressing a file using gzip var fs     = require(“fs”), zlib   = require(“fs”); var readable = fs.createReadStream("foo.txt"), writable = fs.createWriteStream("foo.txt.gz"); readable    .pipe(gzip)    .pipe(writable); var evilStream = transform.output .read(); transform.input .write(flappyBird); Abstraction for a stream that is both readable and writable, where the input is related to the output (map or filter step). Dominic Tarr’s `through` module provides a similar functionality
  • 11. Basic API Readable stream var fs = require('fs'); var readable = fs.createReadStream('foo.txt'); // this is the classic api readable .on('data', function (data) { console.log('Data!', data); }) .on('error', function (err) { console.error('Error', err); }) .on('end', function () { console.log('All done!'); }); var fs = require('fs'); var readable = fs.createReadStream('foo.txt') , writable = fs.createWriteStream('copy.txt'); readable.pipe(writable) .on('finish', function () { writable.write('an extra line'); }); Writable stream
  • 13. event-stream (D. Tarr) var fs     = require(“fs”), JSONStream = require('JSONStream'), map = require('map-stream'); var input = fs.createReadStream("twitter-feed.json"), output = fs.createWriteStream("twitter-sentiments.json"); input .pipe(JSONStream.parse("*")) .pipe(map(computeSentiments)) .pipe(output);
  • 16. Rapidly define a list of files to read from with glob strings Vinyl var fs = require('fs'), vinyl = require('vinyl-fs') vinyl.src('./data/*/quad/*.comp.json', { buffer: false }).pipe(map(mapSource)); function mapSource(file, asyncReturn) { var srcStream = file.contents; srcStream .pipe(JSONStream.parse("*")) .pipe(SomeAnalysis) .pipe(vinyl.dest("./out")) };
  • 18. Twitter Sentiments Register an application with the Twitter API – https://dev.twitter.com/ Create an access token. In your projects, add a file “secret_keys.js” with: Takes advantage of the sentiment module: https://github.com/thisandagain/sentiment consumer_secret: "YOUR_CONSUMER_SECRET", access_token_key: "USER_ACCESS_TOKEN", access_token_secret: "USE
  • 20. Separation of concerns The #1 reason to use streams for me is that the piping structure encourages the writing of programs as bite-size modules that are highly interchangeable. In the early stages of writing the example program, I had: tweets .pipe(map(englishOnly)) .pipe(map(addSentiment)) Then I found out that the API gave you the option to specify a language filter.All I had to do was drop one line of code.
  • 21. Functional Programming A more functional style of programming encourages the avoidance of side effects or state mutation. var fs     = require(“fs”), map   = require(“map-stream”); var readable = fs.createReadStream("foo.txt"), readable    .pipe(map(filterEnglish)) function filterEnglish(data, asyncReturn) {    if(data.language === “en”) { // write these data to the output stream       asyncReturn(null, data);    } else { // but don’t write these.       asyncReturn();    } } ๏ Single Responsibility Principle: "A function should do one thing, and do it well." ๏ Pure functions. No knowledge of the external world whatsoever. Every bit of information required for the running of the function is explicitly passed as paramter. ๏ Immutable data. A function returns a new data that captures the transformation rather than a reference to the old data. ๏ Higher Order Functions. Functions that return functions (partials, currying). A way to capture local