Weitere ähnliche Inhalte
Kürzlich hochgeladen (20)
Introduction to Data Engineering (with Scala)
- 1. Introduction to Data Engineering
(with Scala)
John Nestor 47 Degrees
www.47deg.com
June 27, 2016
Galvanize
147deg.com
- 2. 47deg.com © Copyright 2015 47 Degrees
Outline
• Introduction
• Data Engineering Requirements
• Data Engineering Design Patterns
• Recommended Data Engineering Tools and Systems
• Final Thoughts
2
- 4. 47deg.com © Copyright 2015 47 Degrees
Typical Data Engineering Systems
• Low latency response to HTTP or REST requests
• Database reads and writes
• Run ML models
• Produce event streams for later processing
• Near real time event processing
• Simple analytics and alerts
• Analysis of server information
• Logs and metrics
• Produce data for later analysis by data scientists
4
- 5. 47deg.com © Copyright 2015 47 Degrees
Big Data
• (Much) Too big to fit on a single machine
• Must have both
• distributed computation
• distributed data (bases)
• Distributed systems means no single main memory
• Must pass data across servers
• Large number of distributed components means failure
is common
• Dealing with failure must be part of the fundamental
architecture
5
- 6. 47deg.com © Copyright 2015 47 Degrees
• https://blogs.oracle.com/jag/resource/Fallacies.html
Peter Deutsch
• The network is reliable
• Latency is zero
• Bandwidth is infinite
• The network is secure
• Topology doesn’t change
• There is one administrator
• Transport cost is zero
• The network is homogeneous
6
Fallacies of Distributed Computing
- 7. 47deg.com © Copyright 2015 47 Degrees
Reactive Manifesto
• http://www.reactivemanifesto.org/
• Responsive - predictable latency
• Resilient - fault tolerant
• Elastic - (auto) scalability
• Message driven - basis of a distributed implementation
7
- 9. 47deg.com © Copyright 2015 47 Degrees
Scalability
• New systems are getting bigger all the time
• Hardware is getting cheaper
• Business requirements to stay competitive are
increasing
• Cloud computing permits easy expansion based on
instantaneous need
• No single server is ever big enough
• Scalability goal: performance increases (close to)
linearly with the number of servers
9
- 10. 47deg.com © Copyright 2015 47 Degrees
Availability
• Systems are increasingly expected to be available 24/7
with no downtime
• Any server can fail, others must be able to take over
• No downtime for maintenance. Software upgrades
occur without shutting system down.
• Must avoid availability killing features such a 2 phase
commit
• SLA’s # of nine’s
• The best most achieve is 3 nines (8.8 hours per year)
• Most strive for 6 nines (30 minutes per year)
• AWS S3 claims 9 nines (32 msec per year)
10
- 11. 47deg.com © Copyright 2015 47 Degrees
Durability
• Loosing data is never acceptable
• Since any single point can fail, we must replicate data
• Replication to
• main memory
• different server
• server in different zone
• across geo-distributed data centers
• AWS S3 will loose at most one object out of 32K objects
every 10 million years
11
- 12. 47deg.com © Copyright 2015 47 Degrees
Latency and Bandwidth
• Latency - msec to process a single request
• More hops can increase latency
• Very fast network hardware can reduce latency
• Speed of light is still the upper bound
• Bandwidth - number of requests processed per sec
• More servers can increase bandwidth
• Latency Numbers Every Programmer Should Know
• main memory (0.0001 msec)
• different server (0.5 msec)
• across geo-distributed data centers (150 msec)
12
- 14. 47deg.com © Copyright 2015 47 Degrees
Immutable Data
• Concurrent access to mutable data requires
synchronization. Immutable data does not.
• Data passed between servers will be immutable
• Immutable data plus functional programming results in
code that is easier to understand and test
14
- 15. 47deg.com © Copyright 2015 47 Degrees
Messaging (1 of 2)
• Message sent from A to B
• A gets ack from B
• A gets no ack from B
• message never got to B
• ack from B never got to A
• What kind?
• at most once (never resend)
• at least once (resend if no ack)
• exactly once (resend idempotently if no ack)
15
- 16. 47deg.com © Copyright 2015 47 Degrees
Messaging (2 of 2)
• Idempotence
• Multiple sends have same effect
• set X to 3, NOT add 2 to X
• Attach GUID, destination must handle
• In order delivery
• Waiting for an ack before sending next increases
latency
• Attach sequence number, destination must handle
• Batching multiple messages together can help
• Design so order does not matter
16
- 17. 47deg.com © Copyright 2015 47 Degrees
Persistent Data (1 of 3)
• CAP theorem (pick 2)
• Consistency (ACID)
• Availability
• Partition tolerance (closely tied to fault tolerance)
• Distributed consistency solutions: 2-phase commit is
“the anti-availability protocol” (Helland)
• For very large highly available systems, AP is only
possible choice
17
- 18. 47deg.com © Copyright 2015 47 Degrees
Persistent Data (2 of 3)
• Detecting conflicts with Vector clocks
• Each server has own time
• Vector has one element for each server
• Forms a partial order
• Resolving conflicts (for example: 2 different phone numbers)
• Select the latest
• Ask someone
• Keep both
• CRDTs (generalization of keep both)
• conflict free replicated data sets
• merge must be commutative, associative, idempotent
18
- 19. 47deg.com © Copyright 2015 47 Degrees
Persistent Data (3 of 3)
• Log based stores
• Sequence of transformational steps
• Each step is immutable
• Log is append only (fast sequential write to disk)
• Database is a cache of some point in the log
• Log is primary
• Database can be deleted and recreated from log
19
- 20. 47deg.com © Copyright 2015 47 Degrees
Concurrency and Distribution
• Individual servers are getting ever more cores.
• Utilization is key
• Large data applications require multiple servers
• Connections between servers are frequent points of
failure
• Parallel data operations help: parallel collections, Spark
• Traditional synchronization (locks, monitors) are error
prone and very hard to get right.
• Message bases systems (Hoare’s CSP, Hewitt’s actors)
are a better solution and work well across servers.
20
- 21. 47deg.com © Copyright 2015 47 Degrees
Logging and Monitoring
• As systems involve more and more servers
• Detecting and locating failure is getting harder
• Understanding system performance and performance
tuning is getting harder
• We now produce massive amounts of logs and
monitoring data
• Making sense of this huge volume of data is hard
• For failures we need near real-time analysis
• Increasing need for data science solutions
21
- 22. 47deg.com © Copyright 2015 47 Degrees
Continuous Deployment (1 of 2)
• High availability means we can no longer shut down for
upgrades to
• Application code
• Operating system upgrades and patches
• Hardware maintenance
• Automatic server failover
• Rolling upgrades
• Backward compatibility
• Messages
• Database schemas
22
- 23. 47deg.com © Copyright 2015 47 Degrees
Continuous Deployment (2 of 2)
• Deployment of lots of small changes reduces the chance of
errors in any single deployment
• Requires comprehensive automation for testing and
deployment
• But errors still do occur
• Although we have good methods for testing individual
components, integration testing is still hard and error prone.
• Some approaches
• Roll back
• A-B testing
• Database checkpoints
23
- 25. 47deg.com © Copyright 2015 47 Degrees
Choices
• Open source preferred
• Personal favorites
• Widely used (best practices in leading companies)
25
- 26. 47deg.com © Copyright 2015 47 Degrees
Prefer Open Source
• “Free”
• Full source is available
• Community participation
• Can move very fast
• More responsive
• Plus if there is a commercial company providing
support
26
- 27. 47deg.com © Copyright 2015 47 Degrees
Programming Language (1 of 3)
• Compiled versus interpreted
• Compiled: C, C++, Go
• Semi-compiled: Java, C#, Scala
• Interpreted: Python, Ruby, R
• Static versus dynamic type checking
• Static catches more errors at compile-time
• Static are easier to understand and maintain
• Static requires more work writing
• Garbage collection. Safety versus performance
27
- 28. 47deg.com © Copyright 2015 47 Degrees
Programming Languages (2 of 3)
• Choice of language does not matter
• I can write any algorithm in any language
• Lets avoid pointless “language religion” wars
• Choice of language matters a lot
• Language can have a big impact on performance,
productivity and reliability
• Programming languages shape the way we think
28
- 29. 47deg.com © Copyright 2015 47 Degrees
Programming Languages (3 of 3)
• Scala
• Semi-compiled. Compiled with JIT compiler.
• Statically typed but concise syntax of untyped
• Garbage collected
• Runs on JVM. Full ecosystem of libraries and tools available.
• Key features
• Functional plus immutable data (major advance in program quality)
• Scala Futures and Akka Actors (major advance in easy to
understand, easy to get correct, and fault-tolerant distributed
computation)
• Main language for Spark
• Suitable for both data engineers and data scientists (better
cooperation)
29
- 30. 47deg.com © Copyright 2015 47 Degrees
Messaging
• Kafka (written in Scala)
• Reliable buffer between produced and consumer
• Can replay
• Multiple produces and consumers
• Multiple topics
• Linearly scalable
• Kafka stream
• Other
• Reactive streams
• Spark streaming
30
- 31. 47deg.com © Copyright 2015 47 Degrees
Databases
• Relational: Postgres (scaling can be a problem)
• Embedded: LevelDB, MapDB
• NoSQL: Cassandra, Couchbase
• Graph: Neo4j, Titan, DataStax Enterprise Graph
31
- 32. 47deg.com © Copyright 2015 47 Degrees
Analytics
• Hadoop (let it die!)
• Spark (Written in Scala, Scala API is best)
• Trend toward SQL
• Improved performance via query optimizer
• Widely understood (but poor?) programming model
• Somewhat abandoned functional programming
(RDDs)
• dataset transforms: experiment to combine functional
programming with support for query optimization
32
- 33. 47deg.com © Copyright 2015 47 Degrees
Data Center Infrastructure and Continuous Deployment
• GitHub, SBT, Artifactory, Jenkins
• Docker/Rkt, Etcd, CoreOS
• Mesos, Kubernetes
• Cloud: AWS, Google, Microsoft
33
- 35. 47deg.com © Copyright 2015 47 Degrees
Final Thoughts
• Scala is the best choice for both data engineers and
data scientists
• Spark is the best choice for data analysis
• Data will continue to grow in size and importance
• The number of servers we use will continue to grow
requiring better fault tolerance and better automation
• When data engineers and data scientists work closely
together both benefit and better results are achieved
• We need to break down traditional silos
• We need shared tools and technologies that work
well for both groups
35