Filip Rogaczewski - Atlassian Connect Team Lead.
Presentation from Gdansk University of Technology about integration business application in the cloud i.e. how to integrate 50 000+ servers together.
Introduction to IEEE STANDARDS and its different types.pptx
Business Applications Integration In The Cloud
1. Previously worked in
Lufthansa, NASA, Intel
Running, biking,
paragliding
Travelling
Photography
Filip Rogaczewski • frogaczewski@atlassian.com •
Spartez/Atlassian
ETI graduate
Team leader in Spartez
8. WHY
Service Oriented Architecture
Scales the application
• Loosely coupled services
• Less resource restrictions for services
• Communication with well defined API
• Allows better technological choice for services
• Distinct deployment models
Service
Service
CONTAINER
Integration HTTP
9. WHY
Service Oriented Architecture
Different hardware stack for services in Facebook
Type I
Web
Type III
DB
Type IV
Hadoop
Type V
Haystack
Type VI
Cache
Type VII
Cold storage
CPU (2) Xeon
E5-2670
(2) Xeon
E5-2660
(2) Xeon
E5-2660
(2) Xeon
E5-2660
(2) Xeon
E5-2660
(2) Xeon
E5-2660
Memory 16GB 144 GB 64 GB 96 GB 144 GB 144 GB
Disk (1) 500 GB
SATA
3.2TB PCI
Flash (15) 4TB SAS (30) 4TB SAS (1) 2 TB
SATA
(240) 4TB
SATA
11. WHY
Service Oriented Architecture
More effective organisation
• Each team running a single service.
• Each team is cross-functional (designers, product managers,
testers, developers, ops-engineers).
• Decision about roadmap happen locally.
• Geographically collocated teams, one service in USA, second
service in Australia, third in Poland.
• Easier to scale work, multiple teams working at the same
time.
13. WHY
In Process Integration
CONTAINER
Add-On
In Process
• Resources are shared
• Access to all data
• Doesn’t scale
Tied to the stack
• Language
• Frameworks
Add-On No clear API boundaries
18. WHY
Integrations of multiple applications
You can sell all your products instead of one.
19. WHY
Extending with marketplace
Customers always want more features.
If you can’t give it to them, let someone else do this - marketplace.
Cash 25% of what external vendors sold using your marketplace.
21. WHY
Enterprise customers
Customers who want to integrate your product with their existing
applications
HR
Communi
cation
Environm
ent
CRM
Asset
manageme
nt
Supply
GRC chain
Finance
22. WHY
Acquisitions
You buy next fantastic company.
You want to quickly integrate this feature.
Can take couple of months if you have an integration layer ready.
Might never be done, if you don’t.
???
23. CASE STUDIES
HOW
Agenda
WHY
UI INTEGRATION
OPPORTUNITY REST API
MESSAGING
MULTI-TENANCY
DEPLOYMENT
26. HOW
Iframe
Never embed HTML from external sites.
When using iframes, browser provides security:
• Don’t set sandboxing to allow-forms, allow-scripts, allow-same-
origin, allow-top-navigation. This is a security model
very difficult to manage.
Sign the URL so server rendering content can authenticate the
request.
Optionally pass context parameters.
Use CORS or postMessage for communication.
Performance issues.
28. HOW
Security: How to verify this request?
https://whoslooking-stg.herokuapp.com/poller?issue_key=ACJIRA-157
&tz=Australia%2FSydney
&loc=en-US
&user_id=frogaczewski
&user_key=frogaczewski
&xdm_e=https%3A%2F%2Fecosystem.atlassian.net&xdm_c=channel-whoslooking-connect-stg__
whos-looking&cp=&lic=none
&jwt=
eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJmcm
9nYWN6ZXdza2kiLCJxc2giOiJiZjA1NmU5MjEzYjBkODIyNDA
wNzg4YmQ4MThhNDk4YmM0NGQ0OTMyYTM2MWU1Mjk1Zj
cwMTczOGRiMGRjOTA2IiwiaXNzIjoiamlyYTo1OTk3NWQ2Ny
00Y2EwLTRlOWUtOTk2MC1kMWFhYWU3NmJiMzkiLCJleHA
iOjE0MTMxMzI2NTksImlhdCI6MTQxMzEzMjQ3OX0.Da8VXjL
_9z5xyzErtaJohHKH-xx-0Rp-9MF_xtIvcaY
29. HOW
Security: URL signing requirements
1. Signature for validation who created the request.
2. Issuer: identify the application instance which issued the
request. Is this jiraForEti or is this jiraForGdanskUniversity?
3. Expiration time of the token. Time in UTC after which you
should no longer accept the token.
4. Query hash. Prevents URL tampering.
5. Id of the user for authorisation.
6. Algorithm used to sign the URL.
30. HOW
Security: Signature validation
1. Token has the following form:
2. Upon installation host and service exchange a shared secret.
3. Service receives a public key of the host. Service have to verify
the public key. Each service expose REST API for public key
retrieval.
4. During request service extracts the issuer and signature
algorithm from the URL and retrieves the sharedSecret for the
issuer.
5. Service signs encodedHeader.encodedClaims with algorithm
from the header and verifies if the signatures match. If yes, return
content. If no, return 403 (forbidden).
32. HOW
Sandboxing
An iframe instance whose parent and child reside on different
domains or hostnames constitutes a sandboxed environment. The
contained page has no access to its parent. These restrictions are
imposed by the browser's same origin policy.
There are a few limitations applicable to iframes:
• Stylesheet properties from the parent do not cascade to the
child page
• Child pages have no access to its parent's DOM and JavaScript
properties
• Likewise, the parent has no access to its child's DOM or
JavaScript properties.
33. HOW
Cross origin resource sharing (CORS)
1. Keep the list of whitelisted URL with services allowed to access
server resources.
2. When executing cross-origin request, the browser header:
Origin: http://service.atlassian.net
3. If the service is whitelisted, server should return:
Access-Control-Allow-Origin: http://service.atlassian.net
DO NOT USE JSONP
4. Multiple headers for:
choosing a subset of allowed headers
(Access-Control-Allow-Headers)
choosing a subset of allowed HTTP methods
(Access-Control-Allow-Methods)
34. HOW
window.postMessage
1. Create clear JS API between parent and iframe.
2. Parent creates an event listener for a message.
window.addEventListener("message", executeXHR, false);
3. Client executes:
window.parent.postMessage(“request",
JSON.stringify({url: ‘/rest/api/2/dashboard’,
success: function() { alert(“1”);}}
)
4. Parent executes the request on behalf of the child and
postMessage the results.
5. Difficult to implement. Host should provide a library with
abstraction over JS functions it can handle.
36. HOW
Performance: Apdex
New relic: measuring user satisfaction
• In Atlassian
• Satisfied 1s
• Tolerating 3s
• Our Apdex goal is 0.9
• Apdex between 0.85 to 0.93
is considered to be a good
score.
• For business applications
users are more tolerant then
for customer applications
• Financial services are out of
scope.
37. HOW
Performance: Latency
1. Latency
Within California?
Within Europe?
Across Atlantic?
US to Australia?
EMEA to Asia Pacific?
2. Response times of the application is different in various
geographical regions. The customer in US will usually have much
better performance then the one in Europe.
3. Use CDN for caching of static resource (akamai, cloudfront,
edgecast)
4. There are enterprise class solutions reducing latency (Verizon
Enterprise Solutions)
30 ms
30 ms
90 ms
210 ms
250 ms
42. WHY
REST API
Representational state transfer.
API is Application Programming Interface.
For API to make sense, it needs to be stable. Each service needs
an API policy.
Unless the REST API creates security risk, it can’t change without
a previous notice (deprecation period) when services can start
using a valid replacement or announce a end of life for a feature.
Unfortunately, errors are also API. Bad return codes can’t change
for instance.
API should be versioned. Don’t change current API, release a new one.
“Be liberal with what you accept, be consistent with what you
return”
Be precise with accepted and returned content-type.
43. WHY
GET method
rest/api/issue/ should return all issues?
NO. Collections should always be paginated. Returning everything is
never realistic in large systems.
rest/api/issue/ACJIRA-1 should return a details of a particular issue.
NOT all of them. Let user define as query parameter fields which
should be returned. You are loosing precious CPU cycles and
network bandwidth for returning everything.
rest/api/issue/ACJIRA-1 should return ETag
ETag header in response for GET:
“ETag: xyz”
Second request with header:
”If-None-Match: xyz”
304 when not modified, OK when changed with new ETag. Or not found.
44. WHY
HATEOS
rest/api/issue/ACJIRA-1/delete is not a valid GET usage.
Use HATEOAS (Hypertext As The Engine Of Application State)
{
"href": "rest/api/issue/ACJIRA-1",
"rel": "self",
"method": "GET"
},
{
"href": "rest/api/issue",
"rel": "all-paginated",
"method": "GET"
},
{
"href": "rest/api/issue",
"rel": "create",
"method": "POST"
}
{
"href": "rest/api/issue/ACJIRA-1",
"rel": "update",
"method": "PUT"
},
{
"href": "rest/api/issue/ACJIRA-1",
"rel": "delete",
"method": "DELETE"
},
{
"href": "rest/api/issue/ACJIRA-1",
"rel": “partial-update",
"method": "PATCH"
}
idempotent
idempotent
not idempotent
idempotent
idempotent
not idempotent
45. WHY
REST API security
Prefer the same mechanism as for UI authentication
Possible to use BasicAuth, OAuth, but only with SSL/TLS.
Always check permissions of the user.
Interesting problem to solve?
We have a project ACJIRA and user Filip who can’t access the
project. What return code shall he get?
It should be 404 (not found)
403 (forbidden) reveals that the project exists. Projects are often
named after the company name for which the service is provided.
Companies may disagree to publicly acknowledge relationship with
another company.
46. WHY
AaaS (API as a Service)
You don’t need to write all APIs yourself. You can integrate with
existing APIs.
APIs directories/marketplaces where you can buy APIs.
Be careful with passing the user data to external services.
48. HOW
How do I know about data change?
CI server doesn’t execute PUT request /issue/ACJIRA-27 build
completed. How would it know who is interested?
It publishes information that the build was completed, jira-build-monitor-service
registers a listener for this information.
49. HOW
Messaging
There are many approaches and concepts around messaging.
The key differentiator is message delivery guarantee.
It is easy to have 90% or 95% message delivery guarantee.
Assuring 100% message delivery is almost impossible. It may
require complete service rewrite.
It is very important to understand the use case to make a decision
what is the expected message delivery.
Send messages asynchronously. Connections are precious
resources for your service.
Messages are API as well. They should have a clear contract and
deprecation policy. Make them granular.
Specify the content type. Be careful with content-length, too long
may DOS the receiver.
Sign the request.
50. HOW
What can go wrong?
Server dies during a change.
Event sourcing - record each change in a database. If server died,
there is no change to message. Each change have a sequence
number.
Database trigger. Move the message to a queue. What if database
server dies?
Resend with a possible duplicate flag. Is the order preserved? Who
is controlling this? What if the controlling node of publisher dies?
Server died after change, before sending the message.
What if the message was not delivered?
Server died during processing the message?
Pull the message again with REST request to publisher. Parametrise
the request with last successfully processed message.
Use some Queue Service implementation acting as a proxy. Amazon
SQS for instance.
51. HOW
Eventually consistent
It costs a lot of money to provide
message guarantee (implement all the
steps from previous slide).
Most business applications can life
without reliable messaging for a while.
When running 52 000 servers or more (it
will always be more), you need to
acknowledge that things are going fail and
messages are not going to be delivered.
Apply resilient architecture, which polls for
data change (event sourcing again) if the
messages are not delivered.
53. HOW
How do I ensure I display proper data?
I want to display information about related pages owned only by this
customer.
I want to display information only about source code changes made by
organisation of my current customer.
54. HOW
Multi-tenancy
Ability of the single application to serve requests from multiple
customers at the same time.
When the application is written for the on-premises clients, it
doesn’t make sense to support multiple organisations.
When the application is written for the cloud, it doesn’t make
sense to host each customer separately.
Customers with a single office use JIRA 8h a day. It can serve
other customers for remaining 16h.
Single server can process 500 concurrent users. It can host 10
small companies.
The application should be written to run with 0-tenants and 1000-
tenants.
55. HOW
Multi-tenancy is difficult
We have data of Nike, NASA and Twitter. We can’t leak this data.
Tenant id is public.
Encrypted information about the tenant needs to be propagated
with each request.
When passing this information, it must be encrypted along
with a timestamp.
Tenant id must be unique and strong.
DON’TS: put the hostname, organisation name or any other
data to tenant id. This data will change.
We had an error:
https://ecosystem.atlassian.net/browse/AC-811
OpenID provider for all services.
57. HOW
How do I deploy this?
52 000 servers in multiple data centers.
Difference in
- os version (good if the os is the same)
- hardware
- database version
- schema version
You can’t update everything at the same time:
- no expected downtime
- data centers not optimised for 100% energy utilisation
- data centers not optimised for the heat.
Services updated independently:
- each team owns it own deployment schedule
- each team may maintain couple of versions of services
- experimental features may be enabled/disabled on some services
58. HOW
Fast Five - Quality at speed
Stage Behaviour Data Code Data
schema Activation Comment
1 Old Old Old Deployment Code is running as is.
2 Old
New and
old
together
Old Deployment New code deployment.
3 Old
New and
old
together
New
Deployment
or
Configuration
Database migration.
4
New and
old
together
New and
old
together
New
Deployment,
Configuration
or Context
Slowly enable the feature on all
racks. Features might be enabled
in various configurations.
5 New New New Deployment Delete the obsolete code.
59. HOW
DEV/DOG/PROD
Deployment never go to client first.
First versions are deployed to development environment.
Development environment is tested with production versions of
remaining services.
Good development versions are promoted to dogfood
environment. This version is used there internally against
production versions of other services.
Good dogfooding versions are promoted to production
environment. Futures are slowly enabled on production.
Possible issues:
- New service was not tested against all versions running in
production.
- Couple of new services deployed at the same time. They
were never tested together. Release manager should resolve
this issue and schedule the feature release.