Data, dev-ops, and cloud services: Building a distributed data-platform
A lecture given to Computer Science Students at the University of Warwick, February 2012.
Gen AI in Business - Global Trends Report 2024.pdf
Building a distributed data-platform - A perspective on current trends in computing
1. Data, dev-ops, and cloud services
Building a distributed data-platform
Charles Care
Engineering Team
Kasabi / Talis
2. Talk overview
● About me...
● What Kasabi is,
● what we are trying to do
● how we are working to achieve that
● a quick walk-though
● Discussion of the Kasabi platform team
● Our technology / architecture
● Our engineering culture
● Lessons learnt
7. About Kasabi
● Data market place
● Bringing together data...
● owners
● consumers
● Lowering the barrier for data-driven apps to
enter the market
● Enabling new opportunities for aggregating and
mixing data
9. Kasabi as a data platform
Data engineers
Data enthusiasts
Data Owners
Application
Developers
Third-party services API developers
10. About Kasabi
● Publish datasets using standard APIs
● Access data using standard APIs
● Query a dataset using SPARQL
● Search a dataset using a simple full-text search
● Define, contribute, and share your own APIs
19. Data Platform
Load balancing and routing
Update services Search services Query services
Datasets
● Need to store and update datasets
● Access data via various services
● Must scale with load and increasing data
● Must be tolerant to failure
● Extensible
● Should be easy to add new services over time
21. Distributed Platform
Routing layer
Dynamic Gossip Network
Update
service SPARQL
Update
service Search service
service
Update New
service service?
SPARQL
service
Search
Search SPARQL service
service service
Sequence Service Storage Service Monitoring Services
22. Distributed Platform – updates
Routing layer
Dynamic Gossip Network
Update
service SPARQL
Update
service Search service
service
Update - Updates are sequenced
- Data stored in distributed storage New
service service?
SPARQL
service
Search
Search SPARQL service
service service
Sequence Service Storage Service Monitoring Services
23. Distributed Platform – updates
Routing layer
Dynamic Gossip Network
Update
service SPARQL
Update
service Search service
service
- Updates are gossiped around
Update network
New
service - Here a SPARQL node realises
service?
that it should apply the update
SPARQL
service
Search
Search SPARQL service
service service
Sequence Service Storage Service Monitoring Services
24. Distributed Platform – query
Routing layer
Dynamic Gossip Network
Update
service SPARQL
Update
service Search service
service
SPARQL queries
will now reflect
the update that
Update New
was submitted
service service?
SPARQL
service
Search
Search SPARQL service
service service
Sequence Service Storage Service Monitoring Services
25. Monolithic vs distributed
● Monolithic
● Easy to synchronise events and data
● Consistent views and queries
● Less inter-process communication / less network overhead
● Easier to optimise for high throughput
● Single code-base
● Fewer processes to monitor
● Distributed
● Service-oriented - separate concerns run in isolated processes (and can be scaled
independently)
● Development is component-based
– Changes are more focussed / helps avoids scope-creep
● Deployment can be localised to avoid downtime
● Failure is more likely – so you need to plan for it
● Easier to integrate out-of-the box software – e.g. using standard Apache Solr
26. Distributed data platform
● Separate services for each API
● Communication via Gossip messages
● Have to manage eventual consistency
● Highly scalable
● Easy to add new services
● Use standard protocols and open-source components
● HTTP libraries / REST / ZeroMQ / Apache Thrift
● RDF and SPARQL using Apache Jena
● Search using Apache Solr
● Avoid modification and forks
● Deploy into Amazon EC2 (also using: S3, EMR, and ELB)
28. Consider a start-up in 2002
● Have an idea...
● Get funding (development, op-ex,
cap-ex)
● Aquire servers
● Set-up your servers
– mail, web, source code repo, build
systems
– development, staging, live
● Some 'cloud' services
– …, SourceForge, shared servers, etc
● Build, and go, to market
● Probably embedding open-source
components
● Delivery based on full-stack,
monolithic, architectures
29. Consider a start-up in 2012
● Have an idea...
● Get funding (development capital, op-ex)
● you will probably not get cap-ex
● Use cloud services... rent rather than buy
● SaaS – Software as a Service
– Why would you run your own (chat/email etc)
– Host your code in GitHub/BitBucket etc
● PaaS – Platform as a Service
– Do you need to control the full stack?
– Could you leverage platforms like: Heroku, Joyant,
AppEngine etc
– Amazon RDS
● IaaS – Infrastructure as a Service
– Cloud services to provide 'bare metal'
● Build and go to market quickly
● scale elastically over time
30. But what about the enterprise?
● Benefits of cloud services are
already transforming the enterprise
● Private clouds
● Virtual appliances
● Cloud bursting
● Independent scaling
● Separation of concerns
● SOA architecture
● And in future...
● Appetite for IaaS is growing
● PaaS and SaaS will follow.
● Perimeter security will be replaced by
localised security boundaries
32. How it all happens
● Constantly iterating through...
● Requirements
● Development (Test-driven)
● Testing/Review
● Deployment
● Operation
● We're an Agile, dev-ops team...
so all the above is a shared responsibility
33. Being a dev-ops team...
● Removing barriers between development and operations
● Shared responsibilities rather than distrust
● Everyone has root access
● Developers are responsible for operating systems they build
● Everyone is free to make changes
...and responsible to manage the roll-out of those changes
● Ops/Deployment/Monitoring are automated
● Everyone should have full-stack awareness
● Read more...
● http://dev2ops.org/blog/2010/2/22/what-is-devops.html
● http://www.jedi.be/blog/
● http://en.wikipedia.org/wiki/Devops
● http://www.slideshare.net/jallspaw/ 10-deploys-per-day-dev-and-ops-cooperation-at-flickr
35. Requirements and Planning
● Identification of requirement
● Planning
● Break down big changes into smaller tasks
– Can the change be deployed in small steps?
– Can the change be dark-deployed?
● Understand the wider impact
● Find middle ground between generic and specific
● Team is self-organising
● People pull work from the prioritised, planned stories
37. Writing the code
● Work on a branch
● don't know if/when you'll merge
● Test-driven
● Unit tests first
● Do acceptance tests need to change?
● What technology? Which tool-sets?
● Smoke testing
● How do you know it works?
● What's different in production?
● What are the risks of failure?
● Feature flags?
Tests run: 110, Failures: 0, Errors: 0, Skipped: 2
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 39 seconds
[INFO] Finished at: Sat Feb 18 15:20:36 GMT 2012
[INFO] Final Memory: 33M/240M
[INFO] ------------------------------------------------------------------------
38. Writing the code
● Avoid unnecessary scope-creep
● “I'll just fix this...”
● “It would be much cleaner if I re-factored this...”
● “It would be neat if I also added this...”
● …however, these observations can be written as new stories
● …and sometimes it's good to fix things before they cause pain
● …if extra changes are really necessary, can they be implemented separately?
● …team should be empowered to fix technical debt
● ...managing scope-creep is a shared responsibility
● Be prepared to abandon a change if it's taking too long, maybe it needs
more planning?
● Should you be pairing?
● Should you demo your work?
39. Code review
● Code review possible with tools for distributed
teams (e.g. Gerrit or ReviewBoard)
● If you're not following a strict pairing policy,
code-review is vital
● Useful to make others aware of changes
● Gerrit
● Build agent automatically builds your change and
runs tests – verify +/- 1
● Invite others to review your code, they can give it a
score between -2 and +2.
● Can only deploy code once at least one person has
given a +2
● Work-flow is customisable
● Self-organising... anyone can review
$> git commit
$> git review
42. Merge / Deployment
● Merge & Deployment
● One-click deployment
● Developer should press the button
● Code is merged into the
master/release branch
● Build server automatically checks
out the code and builds, tags, and
uploads the release to an artefact
repository
● Package is automatically
deployed on all servers
– Extra orchestration for external-facing
services to avoid “thundering-herd”
problems
45. Technical lessons learnt
● Use distributed SOA-based services to reduce tight-
coupling
● Monitor everything...
● Leverage cloud offerings
● wrap them with well-defined interfaces to avoid lock-in
● Design systems to scale
● Use open and unmodified components where possible
● Standard components fronting external APIs
● E.g. Jena, Solr, Haproxy, Apache
46. Practices that have helped us
● Dev-ops culture
● Pragmatic approach to agile development
● Task allocation should be 'pull', rather than 'push'
● Teams should be self-organising
● Pairing when working on new problems
● Test-Driven-Development (TDD)
● Continuous integration
● Peer-review of code
● Continuous deployment
50. Credits
● Thanks for the invite to speak
● Thanks to Kasabi / Talis Systems Ltd
● Sign up at http://www.kasabi.com
Graphics from http://www.iconarchive.com/,
http://www.oxygen-icons.org and http://www.icons-land.com