Building a distributed data-platform - A perspective on current trends in computing

Data, dev-ops, and cloud services

Building a distributed data-platform

Charles Care

Engineering Team
Kasabi / Talis

Talk overview
● About me...
● What Kasabi is,
● what we are trying to do
● how we are working to achieve that
● a quick walk-though
● Discussion of the Kasabi platform team
● Our technology / architecture
● Our engineering culture
● Lessons learnt

Views are mine...

…and not necessarily those of
my (current/past) employers

About me...
● 2001-2004 – BSc Computer Science (Warwick)
● 2004-2008 – PhD Computer Science (Warwick)
● 2007-2011 – BT Plc
● Technical risk analyst – BT Global MPLS Network
● Software Engineer – Infrastructure for Financial Markets
● Senior Software Engineer – Central software standards
and tools
● 2011-Present – Talis/Kasabi
● Software Engineer – Semantic web platform

About Kasabi
● Data market place
● Bringing together data...
● owners
● consumers
● Lowering the barrier for data-driven apps to
enter the market
● Enabling new opportunities for aggregating and
mixing data

Data licensing today

Bespoke, expensive, contracts

Data Owners Data Consumers

Kasabi as a data platform

Data engineers
Data enthusiasts

Data Owners
Application
Developers

Third-party services API developers

About Kasabi
● Publish datasets using standard APIs
● Access data using standard APIs
● Query a dataset using SPARQL
● Search a dataset using a simple full-text search
● Define, contribute, and share your own APIs

Data marketplace

http://www.kasabi.com/

Access data using standard APIs

Current organisation
● Product development
● Data engineering
● Customer operations
● Platform development

Data Platform
Load balancing and routing

Update services Search services Query services

Datasets
● Need to store and update datasets
● Access data via various services
● Must scale with load and increasing data
● Must be tolerant to failure
● Extensible
● Should be easy to add new services over time

To distribute...

...or not to distribute

Distributed Platform
Routing layer

Dynamic Gossip Network
Update
service SPARQL
Update
service Search service
service

Update New
service service?

SPARQL
service
Search
Search SPARQL service
service service

Sequence Service Storage Service Monitoring Services

Distributed Platform – updates
Routing layer

Update
service SPARQL
Update
service

Update - Updates are sequenced
- Data stored in distributed storage New
service service?

SPARQL
service
Search
service service


Distributed Platform – updates
Routing layer

Update
service SPARQL
Update
service

- Updates are gossiped around
Update network
New
service - Here a SPARQL node realises
service?
that it should apply the update

SPARQL
service
Search
service service


Distributed Platform – query
Routing layer

Update
service SPARQL
Update
service
SPARQL queries
will now reflect
the update that
Update New
was submitted
service service?

SPARQL
service
Search
service service


Monolithic vs distributed
● Monolithic
● Easy to synchronise events and data
● Consistent views and queries
● Less inter-process communication / less network overhead
● Easier to optimise for high throughput
● Single code-base
● Fewer processes to monitor
● Distributed
● Service-oriented - separate concerns run in isolated processes (and can be scaled
independently)
● Development is component-based
– Changes are more focussed / helps avoids scope-creep
● Deployment can be localised to avoid downtime
● Failure is more likely – so you need to plan for it
● Easier to integrate out-of-the box software – e.g. using standard Apache Solr

Distributed data platform
● Separate services for each API
● Communication via Gossip messages
● Have to manage eventual consistency
● Highly scalable
● Easy to add new services
● Use standard protocols and open-source components
● HTTP libraries / REST / ZeroMQ / Apache Thrift
● RDF and SPARQL using Apache Jena
● Search using Apache Solr
● Avoid modification and forks
● Deploy into Amazon EC2 (also using: S3, EMR, and ELB)

Benefits of using cloud services

Consider a start-up in 2002
● Have an idea...
● Get funding (development, op-ex,
cap-ex)
● Aquire servers
● Set-up your servers
– mail, web, source code repo, build
systems
– development, staging, live
● Some 'cloud' services
– …, SourceForge, shared servers, etc
● Build, and go, to market
● Probably embedding open-source
components
● Delivery based on full-stack,
monolithic, architectures

Consider a start-up in 2012
● Have an idea...
● Get funding (development capital, op-ex)
● you will probably not get cap-ex
● Use cloud services... rent rather than buy
● SaaS – Software as a Service
– Why would you run your own (chat/email etc)
– Host your code in GitHub/BitBucket etc
● PaaS – Platform as a Service
– Do you need to control the full stack?
– Could you leverage platforms like: Heroku, Joyant,
AppEngine etc
– Amazon RDS
● IaaS – Infrastructure as a Service
– Cloud services to provide 'bare metal'
● Build and go to market quickly
● scale elastically over time

But what about the enterprise?
● Benefits of cloud services are
already transforming the enterprise
● Private clouds
● Virtual appliances
● Cloud bursting
● Independent scaling
● Separation of concerns
● SOA architecture
● And in future...
● Appetite for IaaS is growing
● PaaS and SaaS will follow.
● Perimeter security will be replaced by
localised security boundaries

So how do we build this stuff...?

How it all happens
● Constantly iterating through...
● Requirements
● Development (Test-driven)
● Testing/Review
● Deployment
● Operation
● We're an Agile, dev-ops team...
so all the above is a shared responsibility

Being a dev-ops team...
● Removing barriers between development and operations
● Shared responsibilities rather than distrust
● Everyone has root access
● Developers are responsible for operating systems they build
● Everyone is free to make changes
...and responsible to manage the roll-out of those changes
● Ops/Deployment/Monitoring are automated
● Everyone should have full-stack awareness
● Read more...
● http://dev2ops.org/blog/2010/2/22/what-is-devops.html
● http://www.jedi.be/blog/
● http://en.wikipedia.org/wiki/Devops
● http://www.slideshare.net/jallspaw/ 10-deploys-per-day-dev-and-ops-cooperation-at-flickr

Requirements and Planning
● Identification of requirement
● Planning
● Break down big changes into smaller tasks
– Can the change be deployed in small steps?
– Can the change be dark-deployed?
● Understand the wider impact
● Find middle ground between generic and specific
● Team is self-organising
● People pull work from the prioritised, planned stories

Branch based development
● One branch per change, squash before merge

Writing the code
● Work on a branch
● don't know if/when you'll merge
● Test-driven
● Unit tests first
● Do acceptance tests need to change?
● What technology? Which tool-sets?
● Smoke testing
● How do you know it works?
● What's different in production?
● What are the risks of failure?
● Feature flags?
Tests run: 110, Failures: 0, Errors: 0, Skipped: 2

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 39 seconds
[INFO] Finished at: Sat Feb 18 15:20:36 GMT 2012
[INFO] Final Memory: 33M/240M
[INFO] ------------------------------------------------------------------------

Writing the code
● Avoid unnecessary scope-creep
● “I'll just fix this...”
● “It would be much cleaner if I re-factored this...”
● “It would be neat if I also added this...”
● …however, these observations can be written as new stories
● …and sometimes it's good to fix things before they cause pain
● …if extra changes are really necessary, can they be implemented separately?
● …team should be empowered to fix technical debt
● ...managing scope-creep is a shared responsibility
● Be prepared to abandon a change if it's taking too long, maybe it needs
more planning?
● Should you be pairing?
● Should you demo your work?

Code review
● Code review possible with tools for distributed
teams (e.g. Gerrit or ReviewBoard)
● If you're not following a strict pairing policy,
code-review is vital
● Useful to make others aware of changes
● Gerrit
● Build agent automatically builds your change and
runs tests – verify +/- 1
● Invite others to review your code, they can give it a
score between -2 and +2.
● Can only deploy code once at least one person has
given a +2
● Work-flow is customisable
● Self-organising... anyone can review

$> git commit
$> git review

Merge / Deployment
● Merge & Deployment
● One-click deployment
● Developer should press the button
● Code is merged into the
master/release branch
● Build server automatically checks
out the code and builds, tags, and
uploads the release to an artefact
repository
● Package is automatically
deployed on all servers
– Extra orchestration for external-facing
services to avoid “thundering-herd”
problems

Managing infrastructure
● Puppet or Chef
● Build packages (e.g. DEB or RPM)
● Centralise configuration management
● Utilising cloud compute infrastructure
● Amazon EC2
● Amazon S3
● Elastic load balancers
● Elastic Map-Reduce
● Application monitoring
● Metrics
● Log analysis
● Internal monitoring
● External checks

Lessons learnt

(again, my views!)

Technical lessons learnt
● Use distributed SOA-based services to reduce tight-
coupling
● Monitor everything...
● Leverage cloud offerings
● wrap them with well-defined interfaces to avoid lock-in
● Design systems to scale
● Use open and unmodified components where possible
● Standard components fronting external APIs
● E.g. Jena, Solr, Haproxy, Apache

Practices that have helped us
● Dev-ops culture
● Pragmatic approach to agile development
● Task allocation should be 'pull', rather than 'push'
● Teams should be self-organising
● Pairing when working on new problems
● Test-Driven-Development (TDD)
● Continuous integration
● Peer-review of code
● Continuous deployment

Conclusion
● Isolate your design into components
● Empower your team to release small changes
frequently
● Leverage hosted/cloud offerings

Credits
● Thanks for the invite to speak
● Thanks to Kasabi / Talis Systems Ltd

● Sign up at http://www.kasabi.com

Graphics from http://www.iconarchive.com/,
http://www.oxygen-icons.org and http://www.icons-land.com

Building a distributed data-platform - A perspective on current trends in computing

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to Building a distributed data-platform - A perspective on current trends in computing

Similar to Building a distributed data-platform - A perspective on current trends in computing (20)

Recently uploaded

Recently uploaded (20)

Building a distributed data-platform - A perspective on current trends in computing