SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Introduction to Data Engineering
(with Scala)
John Nestor 47 Degrees
www.47deg.com
June 27, 2016
Galvanize
147deg.com
47deg.com © Copyright 2015 47 Degrees
Outline
• Introduction
• Data Engineering Requirements
• Data Engineering Design Patterns
• Recommended Data Engineering Tools and Systems
• Final Thoughts
2
Introduction
3
47deg.com © Copyright 2015 47 Degrees
Typical Data Engineering Systems
• Low latency response to HTTP or REST requests
• Database reads and writes
• Run ML models
• Produce event streams for later processing
• Near real time event processing
• Simple analytics and alerts
• Analysis of server information
• Logs and metrics
• Produce data for later analysis by data scientists
4
47deg.com © Copyright 2015 47 Degrees
Big Data
• (Much) Too big to fit on a single machine
• Must have both
• distributed computation
• distributed data (bases)
• Distributed systems means no single main memory
• Must pass data across servers
• Large number of distributed components means failure
is common
• Dealing with failure must be part of the fundamental
architecture
5
47deg.com © Copyright 2015 47 Degrees
• https://blogs.oracle.com/jag/resource/Fallacies.html
Peter Deutsch
• The network is reliable
• Latency is zero
• Bandwidth is infinite
• The network is secure
• Topology doesn’t change
• There is one administrator
• Transport cost is zero
• The network is homogeneous
6
Fallacies of Distributed Computing
47deg.com © Copyright 2015 47 Degrees
Reactive Manifesto
• http://www.reactivemanifesto.org/
• Responsive - predictable latency
• Resilient - fault tolerant
• Elastic - (auto) scalability
• Message driven - basis of a distributed implementation
7
Data Engineering
Requirements
8
47deg.com © Copyright 2015 47 Degrees
Scalability
• New systems are getting bigger all the time
• Hardware is getting cheaper
• Business requirements to stay competitive are
increasing
• Cloud computing permits easy expansion based on
instantaneous need
• No single server is ever big enough
• Scalability goal: performance increases (close to)
linearly with the number of servers
9
47deg.com © Copyright 2015 47 Degrees
Availability
• Systems are increasingly expected to be available 24/7
with no downtime
• Any server can fail, others must be able to take over
• No downtime for maintenance. Software upgrades
occur without shutting system down.
• Must avoid availability killing features such a 2 phase
commit
• SLA’s # of nine’s
• The best most achieve is 3 nines (8.8 hours per year)
• Most strive for 6 nines (30 minutes per year)
• AWS S3 claims 9 nines (32 msec per year)
10
47deg.com © Copyright 2015 47 Degrees
Durability
• Loosing data is never acceptable
• Since any single point can fail, we must replicate data
• Replication to
• main memory
• different server
• server in different zone
• across geo-distributed data centers
• AWS S3 will loose at most one object out of 32K objects
every 10 million years
11
47deg.com © Copyright 2015 47 Degrees
Latency and Bandwidth
• Latency - msec to process a single request
• More hops can increase latency
• Very fast network hardware can reduce latency
• Speed of light is still the upper bound
• Bandwidth - number of requests processed per sec
• More servers can increase bandwidth
• Latency Numbers Every Programmer Should Know
• main memory (0.0001 msec)
• different server (0.5 msec)
• across geo-distributed data centers (150 msec)
12
Data Engineering
Design Patterns
13
47deg.com © Copyright 2015 47 Degrees
Immutable Data
• Concurrent access to mutable data requires
synchronization. Immutable data does not.
• Data passed between servers will be immutable
• Immutable data plus functional programming results in
code that is easier to understand and test
14
47deg.com © Copyright 2015 47 Degrees
Messaging (1 of 2)
• Message sent from A to B
• A gets ack from B
• A gets no ack from B
• message never got to B
• ack from B never got to A
• What kind?
• at most once (never resend)
• at least once (resend if no ack)
• exactly once (resend idempotently if no ack)
15
47deg.com © Copyright 2015 47 Degrees
Messaging (2 of 2)
• Idempotence
• Multiple sends have same effect
• set X to 3, NOT add 2 to X
• Attach GUID, destination must handle
• In order delivery
• Waiting for an ack before sending next increases
latency
• Attach sequence number, destination must handle
• Batching multiple messages together can help
• Design so order does not matter
16
47deg.com © Copyright 2015 47 Degrees
Persistent Data (1 of 3)
• CAP theorem (pick 2)
• Consistency (ACID)
• Availability
• Partition tolerance (closely tied to fault tolerance)
• Distributed consistency solutions: 2-phase commit is
“the anti-availability protocol” (Helland)
• For very large highly available systems, AP is only
possible choice
17
47deg.com © Copyright 2015 47 Degrees
Persistent Data (2 of 3)
• Detecting conflicts with Vector clocks
• Each server has own time
• Vector has one element for each server
• Forms a partial order
• Resolving conflicts (for example: 2 different phone numbers)
• Select the latest
• Ask someone
• Keep both
• CRDTs (generalization of keep both)
• conflict free replicated data sets
• merge must be commutative, associative, idempotent
18
47deg.com © Copyright 2015 47 Degrees
Persistent Data (3 of 3)
• Log based stores
• Sequence of transformational steps
• Each step is immutable
• Log is append only (fast sequential write to disk)
• Database is a cache of some point in the log
• Log is primary
• Database can be deleted and recreated from log
19
47deg.com © Copyright 2015 47 Degrees
Concurrency and Distribution
• Individual servers are getting ever more cores.
• Utilization is key
• Large data applications require multiple servers
• Connections between servers are frequent points of
failure
• Parallel data operations help: parallel collections, Spark
• Traditional synchronization (locks, monitors) are error
prone and very hard to get right.
• Message bases systems (Hoare’s CSP, Hewitt’s actors)
are a better solution and work well across servers.
20
47deg.com © Copyright 2015 47 Degrees
Logging and Monitoring
• As systems involve more and more servers
• Detecting and locating failure is getting harder
• Understanding system performance and performance
tuning is getting harder
• We now produce massive amounts of logs and
monitoring data
• Making sense of this huge volume of data is hard
• For failures we need near real-time analysis
• Increasing need for data science solutions
21
47deg.com © Copyright 2015 47 Degrees
Continuous Deployment (1 of 2)
• High availability means we can no longer shut down for
upgrades to
• Application code
• Operating system upgrades and patches
• Hardware maintenance
• Automatic server failover
• Rolling upgrades
• Backward compatibility
• Messages
• Database schemas
22
47deg.com © Copyright 2015 47 Degrees
Continuous Deployment (2 of 2)
• Deployment of lots of small changes reduces the chance of
errors in any single deployment
• Requires comprehensive automation for testing and
deployment
• But errors still do occur
• Although we have good methods for testing individual
components, integration testing is still hard and error prone.
• Some approaches
• Roll back
• A-B testing
• Database checkpoints
23
Recommended Data
Engineering Tools and
Systems
24
47deg.com © Copyright 2015 47 Degrees
Choices
• Open source preferred
• Personal favorites
• Widely used (best practices in leading companies)
25
47deg.com © Copyright 2015 47 Degrees
Prefer Open Source
• “Free”
• Full source is available
• Community participation
• Can move very fast
• More responsive
• Plus if there is a commercial company providing
support
26
47deg.com © Copyright 2015 47 Degrees
Programming Language (1 of 3)
• Compiled versus interpreted
• Compiled: C, C++, Go
• Semi-compiled: Java, C#, Scala
• Interpreted: Python, Ruby, R
• Static versus dynamic type checking
• Static catches more errors at compile-time
• Static are easier to understand and maintain
• Static requires more work writing
• Garbage collection. Safety versus performance
27
47deg.com © Copyright 2015 47 Degrees
Programming Languages (2 of 3)
• Choice of language does not matter
• I can write any algorithm in any language
• Lets avoid pointless “language religion” wars
• Choice of language matters a lot
• Language can have a big impact on performance,
productivity and reliability
• Programming languages shape the way we think
28
47deg.com © Copyright 2015 47 Degrees
Programming Languages (3 of 3)
• Scala
• Semi-compiled. Compiled with JIT compiler.
• Statically typed but concise syntax of untyped
• Garbage collected
• Runs on JVM. Full ecosystem of libraries and tools available.
• Key features
• Functional plus immutable data (major advance in program quality)
• Scala Futures and Akka Actors (major advance in easy to
understand, easy to get correct, and fault-tolerant distributed
computation)
• Main language for Spark
• Suitable for both data engineers and data scientists (better
cooperation)
29
47deg.com © Copyright 2015 47 Degrees
Messaging
• Kafka (written in Scala)
• Reliable buffer between produced and consumer
• Can replay
• Multiple produces and consumers
• Multiple topics
• Linearly scalable
• Kafka stream
• Other
• Reactive streams
• Spark streaming
30
47deg.com © Copyright 2015 47 Degrees
Databases
• Relational: Postgres (scaling can be a problem)
• Embedded: LevelDB, MapDB
• NoSQL: Cassandra, Couchbase
• Graph: Neo4j, Titan, DataStax Enterprise Graph
31
47deg.com © Copyright 2015 47 Degrees
Analytics
• Hadoop (let it die!)
• Spark (Written in Scala, Scala API is best)
• Trend toward SQL
• Improved performance via query optimizer
• Widely understood (but poor?) programming model
• Somewhat abandoned functional programming
(RDDs)
• dataset transforms: experiment to combine functional
programming with support for query optimization
32
47deg.com © Copyright 2015 47 Degrees
Data Center Infrastructure and Continuous Deployment
• GitHub, SBT, Artifactory, Jenkins
• Docker/Rkt, Etcd, CoreOS
• Mesos, Kubernetes
• Cloud: AWS, Google, Microsoft
33
Final Thoughts
34
47deg.com © Copyright 2015 47 Degrees
Final Thoughts
• Scala is the best choice for both data engineers and
data scientists
• Spark is the best choice for data analysis
• Data will continue to grow in size and importance
• The number of servers we use will continue to grow
requiring better fault tolerance and better automation
• When data engineers and data scientists work closely
together both benefit and better results are achieved
• We need to break down traditional silos
• We need shared tools and technologies that work
well for both groups
35
Questions
36

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (10)

Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Big data road map
Big data road mapBig data road map
Big data road map
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
11 Hard to Ignore Data Analytics Quotes
11 Hard to Ignore Data Analytics Quotes11 Hard to Ignore Data Analytics Quotes
11 Hard to Ignore Data Analytics Quotes
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
 
Big Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business NeedsBig Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business Needs
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must Know
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 

Mehr von John Nestor (9)

LambdaFlow: Scala Functional Message Processing
LambdaFlow: Scala Functional Message Processing LambdaFlow: Scala Functional Message Processing
LambdaFlow: Scala Functional Message Processing
 
LambdaTest
LambdaTestLambdaTest
LambdaTest
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset Transforms
 
Logging in Scala
Logging in ScalaLogging in Scala
Logging in Scala
 
Messaging patterns
Messaging patternsMessaging patterns
Messaging patterns
 
Experience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaExperience Converting from Ruby to Scala
Experience Converting from Ruby to Scala
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
Scala Json Features and Performance
Scala Json Features and PerformanceScala Json Features and Performance
Scala Json Features and Performance
 
Neutronium
NeutroniumNeutronium
Neutronium
 

Kürzlich hochgeladen

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 

Kürzlich hochgeladen (20)

tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 

Introduction to Data Engineering (with Scala)

  • 1. Introduction to Data Engineering (with Scala) John Nestor 47 Degrees www.47deg.com June 27, 2016 Galvanize 147deg.com
  • 2. 47deg.com © Copyright 2015 47 Degrees Outline • Introduction • Data Engineering Requirements • Data Engineering Design Patterns • Recommended Data Engineering Tools and Systems • Final Thoughts 2
  • 4. 47deg.com © Copyright 2015 47 Degrees Typical Data Engineering Systems • Low latency response to HTTP or REST requests • Database reads and writes • Run ML models • Produce event streams for later processing • Near real time event processing • Simple analytics and alerts • Analysis of server information • Logs and metrics • Produce data for later analysis by data scientists 4
  • 5. 47deg.com © Copyright 2015 47 Degrees Big Data • (Much) Too big to fit on a single machine • Must have both • distributed computation • distributed data (bases) • Distributed systems means no single main memory • Must pass data across servers • Large number of distributed components means failure is common • Dealing with failure must be part of the fundamental architecture 5
  • 6. 47deg.com © Copyright 2015 47 Degrees • https://blogs.oracle.com/jag/resource/Fallacies.html Peter Deutsch • The network is reliable • Latency is zero • Bandwidth is infinite • The network is secure • Topology doesn’t change • There is one administrator • Transport cost is zero • The network is homogeneous 6 Fallacies of Distributed Computing
  • 7. 47deg.com © Copyright 2015 47 Degrees Reactive Manifesto • http://www.reactivemanifesto.org/ • Responsive - predictable latency • Resilient - fault tolerant • Elastic - (auto) scalability • Message driven - basis of a distributed implementation 7
  • 9. 47deg.com © Copyright 2015 47 Degrees Scalability • New systems are getting bigger all the time • Hardware is getting cheaper • Business requirements to stay competitive are increasing • Cloud computing permits easy expansion based on instantaneous need • No single server is ever big enough • Scalability goal: performance increases (close to) linearly with the number of servers 9
  • 10. 47deg.com © Copyright 2015 47 Degrees Availability • Systems are increasingly expected to be available 24/7 with no downtime • Any server can fail, others must be able to take over • No downtime for maintenance. Software upgrades occur without shutting system down. • Must avoid availability killing features such a 2 phase commit • SLA’s # of nine’s • The best most achieve is 3 nines (8.8 hours per year) • Most strive for 6 nines (30 minutes per year) • AWS S3 claims 9 nines (32 msec per year) 10
  • 11. 47deg.com © Copyright 2015 47 Degrees Durability • Loosing data is never acceptable • Since any single point can fail, we must replicate data • Replication to • main memory • different server • server in different zone • across geo-distributed data centers • AWS S3 will loose at most one object out of 32K objects every 10 million years 11
  • 12. 47deg.com © Copyright 2015 47 Degrees Latency and Bandwidth • Latency - msec to process a single request • More hops can increase latency • Very fast network hardware can reduce latency • Speed of light is still the upper bound • Bandwidth - number of requests processed per sec • More servers can increase bandwidth • Latency Numbers Every Programmer Should Know • main memory (0.0001 msec) • different server (0.5 msec) • across geo-distributed data centers (150 msec) 12
  • 14. 47deg.com © Copyright 2015 47 Degrees Immutable Data • Concurrent access to mutable data requires synchronization. Immutable data does not. • Data passed between servers will be immutable • Immutable data plus functional programming results in code that is easier to understand and test 14
  • 15. 47deg.com © Copyright 2015 47 Degrees Messaging (1 of 2) • Message sent from A to B • A gets ack from B • A gets no ack from B • message never got to B • ack from B never got to A • What kind? • at most once (never resend) • at least once (resend if no ack) • exactly once (resend idempotently if no ack) 15
  • 16. 47deg.com © Copyright 2015 47 Degrees Messaging (2 of 2) • Idempotence • Multiple sends have same effect • set X to 3, NOT add 2 to X • Attach GUID, destination must handle • In order delivery • Waiting for an ack before sending next increases latency • Attach sequence number, destination must handle • Batching multiple messages together can help • Design so order does not matter 16
  • 17. 47deg.com © Copyright 2015 47 Degrees Persistent Data (1 of 3) • CAP theorem (pick 2) • Consistency (ACID) • Availability • Partition tolerance (closely tied to fault tolerance) • Distributed consistency solutions: 2-phase commit is “the anti-availability protocol” (Helland) • For very large highly available systems, AP is only possible choice 17
  • 18. 47deg.com © Copyright 2015 47 Degrees Persistent Data (2 of 3) • Detecting conflicts with Vector clocks • Each server has own time • Vector has one element for each server • Forms a partial order • Resolving conflicts (for example: 2 different phone numbers) • Select the latest • Ask someone • Keep both • CRDTs (generalization of keep both) • conflict free replicated data sets • merge must be commutative, associative, idempotent 18
  • 19. 47deg.com © Copyright 2015 47 Degrees Persistent Data (3 of 3) • Log based stores • Sequence of transformational steps • Each step is immutable • Log is append only (fast sequential write to disk) • Database is a cache of some point in the log • Log is primary • Database can be deleted and recreated from log 19
  • 20. 47deg.com © Copyright 2015 47 Degrees Concurrency and Distribution • Individual servers are getting ever more cores. • Utilization is key • Large data applications require multiple servers • Connections between servers are frequent points of failure • Parallel data operations help: parallel collections, Spark • Traditional synchronization (locks, monitors) are error prone and very hard to get right. • Message bases systems (Hoare’s CSP, Hewitt’s actors) are a better solution and work well across servers. 20
  • 21. 47deg.com © Copyright 2015 47 Degrees Logging and Monitoring • As systems involve more and more servers • Detecting and locating failure is getting harder • Understanding system performance and performance tuning is getting harder • We now produce massive amounts of logs and monitoring data • Making sense of this huge volume of data is hard • For failures we need near real-time analysis • Increasing need for data science solutions 21
  • 22. 47deg.com © Copyright 2015 47 Degrees Continuous Deployment (1 of 2) • High availability means we can no longer shut down for upgrades to • Application code • Operating system upgrades and patches • Hardware maintenance • Automatic server failover • Rolling upgrades • Backward compatibility • Messages • Database schemas 22
  • 23. 47deg.com © Copyright 2015 47 Degrees Continuous Deployment (2 of 2) • Deployment of lots of small changes reduces the chance of errors in any single deployment • Requires comprehensive automation for testing and deployment • But errors still do occur • Although we have good methods for testing individual components, integration testing is still hard and error prone. • Some approaches • Roll back • A-B testing • Database checkpoints 23
  • 25. 47deg.com © Copyright 2015 47 Degrees Choices • Open source preferred • Personal favorites • Widely used (best practices in leading companies) 25
  • 26. 47deg.com © Copyright 2015 47 Degrees Prefer Open Source • “Free” • Full source is available • Community participation • Can move very fast • More responsive • Plus if there is a commercial company providing support 26
  • 27. 47deg.com © Copyright 2015 47 Degrees Programming Language (1 of 3) • Compiled versus interpreted • Compiled: C, C++, Go • Semi-compiled: Java, C#, Scala • Interpreted: Python, Ruby, R • Static versus dynamic type checking • Static catches more errors at compile-time • Static are easier to understand and maintain • Static requires more work writing • Garbage collection. Safety versus performance 27
  • 28. 47deg.com © Copyright 2015 47 Degrees Programming Languages (2 of 3) • Choice of language does not matter • I can write any algorithm in any language • Lets avoid pointless “language religion” wars • Choice of language matters a lot • Language can have a big impact on performance, productivity and reliability • Programming languages shape the way we think 28
  • 29. 47deg.com © Copyright 2015 47 Degrees Programming Languages (3 of 3) • Scala • Semi-compiled. Compiled with JIT compiler. • Statically typed but concise syntax of untyped • Garbage collected • Runs on JVM. Full ecosystem of libraries and tools available. • Key features • Functional plus immutable data (major advance in program quality) • Scala Futures and Akka Actors (major advance in easy to understand, easy to get correct, and fault-tolerant distributed computation) • Main language for Spark • Suitable for both data engineers and data scientists (better cooperation) 29
  • 30. 47deg.com © Copyright 2015 47 Degrees Messaging • Kafka (written in Scala) • Reliable buffer between produced and consumer • Can replay • Multiple produces and consumers • Multiple topics • Linearly scalable • Kafka stream • Other • Reactive streams • Spark streaming 30
  • 31. 47deg.com © Copyright 2015 47 Degrees Databases • Relational: Postgres (scaling can be a problem) • Embedded: LevelDB, MapDB • NoSQL: Cassandra, Couchbase • Graph: Neo4j, Titan, DataStax Enterprise Graph 31
  • 32. 47deg.com © Copyright 2015 47 Degrees Analytics • Hadoop (let it die!) • Spark (Written in Scala, Scala API is best) • Trend toward SQL • Improved performance via query optimizer • Widely understood (but poor?) programming model • Somewhat abandoned functional programming (RDDs) • dataset transforms: experiment to combine functional programming with support for query optimization 32
  • 33. 47deg.com © Copyright 2015 47 Degrees Data Center Infrastructure and Continuous Deployment • GitHub, SBT, Artifactory, Jenkins • Docker/Rkt, Etcd, CoreOS • Mesos, Kubernetes • Cloud: AWS, Google, Microsoft 33
  • 35. 47deg.com © Copyright 2015 47 Degrees Final Thoughts • Scala is the best choice for both data engineers and data scientists • Spark is the best choice for data analysis • Data will continue to grow in size and importance • The number of servers we use will continue to grow requiring better fault tolerance and better automation • When data engineers and data scientists work closely together both benefit and better results are achieved • We need to break down traditional silos • We need shared tools and technologies that work well for both groups 35