2. ING’s Cloud Native Journey
No questions during the
presentation please
15 -16 April 2019 | The Hague
3. Introduction
» Thijs Ebbers, CISSP
» ING Enterprise Architect, Infrastructure Domain
» Currently working on:
• ING’s Cloud Native Journey
- Container Hosting (“Kubernetes”)
- Data Services (Object (“S3”) & File Services (“NFS”))
- and all the Risk/Security topics touched by this innovation
3
4. About ING
» ING is a global financial institution with a strong European base, offering retail and wholesale
banking services to customers in over 40 countries. The purpose of ING is empowering
people to stay a step ahead in life and in business.
» ING Bank has more than 52,000 employees. As at the end of 2018, we had 38.4 million retail
customers, with 12.5 million considered primary customers.
5. Just a little example…
» “A critical CVE (+patch) for a commonly used Java library has just become known
which allows for remotely exploitable privilege escalations. Your IT department
tells you 26 images in your repositories (which amount to around 200 active
containers in your production clusters) are vulnerable. Your CEO has heard
rumours and anxiously asked you for your plan to resolve the situation.”
» You respond to him that:
• your developer just tested the patch successfully
• the IT operations department will apply the patch in production starting immediately
• after the patches have been applied you will use your state of the art Vulnerability Scanning
capability to scan your production environments
• he will have the report proving the company is safe on his desk within 48 hours
» Raise your hands please if this is the right approach
5
6. For those of you who raised their hands:
» You failed since:
• Containers cannot be patched (they should be immutable…)
- A new build process to create a higher version of the image including the patch must be
started.
- This build process should use a pipeline which should include all relevant (automated!)
tests and deliver the new version to a repository.
- Then all environments must be redeployed with this version.
• Containers should not be scanned in runtime
- You scan the images in your build pipeline and enforce the immutability of the
containers into the production environment.
• Having implemented a proper cloud native ecosystem your CEO should at
latest have confirmation his company is secure (again) by the next morning
» Let’s start with giving you some insights in this new Cloud Native world:
6
7. The Cloud Native ecosystem Kube
7
» A model of a Cloud Native
ecosystem without local
persistency (“12 Factor”)
and fit for a regulated
Enterprise
» Purpose: To make you
familiar with the concepts
& terminology
» Not to be confused with the
CNCF’s Cloud Native
landscape
(https://landscape.cncf.io/)
8. Side 0 – The DevOps Team’s input:
8
» Container Image
• Immutable, Stateless, Short Living
• Base Image (the “operating system”)
- Where was it obtained ?
- Is it vulnerability free ?
- Who will provide patched versions
(in time…) ?
• Code (standard SDLC)
» Deployment Config
• YAML file containing all information needed
to deploy an image successfully
» Network Config
• All information needed to have
communication paths outside the cluster
to/from your application in place
9. Side 1 – The Data Services:
9
» Defined by Bindings
• Data Service instance location +
secrets to connect it + driver
(optional)
» Purposes:
• To persist your state outside the
cluster
• To push out logs & events
10. Side 2 – The Security Services:
10
» Purpose: Externalizing your
users / certificates / passwords
(Directories, PKI solutions,
Password Vaults, …)
» No interfaces for SIEM and
VS/TSCM in Runtime !
• SIEM listens on Topics
• VS/TSCM is performed during
Build (& enforce immutability
in Runtime)
11. Side 3a – The Container Hosting Platform:
11
» Kubernetes (“k8s”): “Distributed Linux Kernel”
• Use a (Managed) k8s Distribution*
(OpenShift/Rancher/PKS/GKE/EKS/AKS/…)
* https://www.cncf.io/certification/software-conformance/
» Node: (Physical/Virtual) machine hosting k8s
code supplying resources to the Cluster
» Cluster: Namespace manager
• Production – Non Production
• (virtual) Data Center 1 – (virtual) Data Center 2
– (virtual) Data Center n
• Payload specific Clusters
» Namespace:
• SLA on resources (CPU/Memory)
• Unit of isolation (no access by default)
» Platform: The collection of Clusters
12. Side 3b – The Container Hosting Platform:
12
Power failure kills 2
nodes
13. Side 3b – The Container Hosting Platform:
13
Power failure kills 2
nodes
14. Side 3b – The Container Hosting Platform:
14
Chaos Monkey kills
the container
15. Side 3b – The Container Hosting Platform:
15
Your app is effectively
hosted like this temporarily
» Very Dynamic behaviour
• App stays online !
• New container instances will be
spun up
• K8s scheduler will eventually
return the situation to “normal”
» Replica’s
• Enforce a minimum safe number
in your production clusters !
16. Side 3c – The Container Hosting Platform:
16
» Services make your app available
for 1 or more namespaces in the
cluster
» Ingress/Routes make your app
available for 1 or more sources
outside the cluster (e.g. the
cluster in the other DC)
» Internal Firewalling (“network
policies”) open network ports for
other namespaces or external IP
adresses
17. Side 4 – The CI/CD Platform:
17
» The CI/CD platform
supports/manages the creation of
deployable artifacts, either via a
pipeline or via legacy methods
(portals,…)
» The Scanning engines here provide
your VS/TSCM evidence (in
combination with immutability of
your nodes & containers…) as well
as detecting license violations and
unwanted configuration settings
18. Side 5 – The Network Platform:
18
» Load Balancing provides the
capabilities to balance load over
multiple clusters (and hence enables
HA/DR/LCM of clusters)
» The DMZ’s provide capabilities to
securely connect the applications
hosted on the Container platform to
the Internet or other insecure
networks (e.g. the Workplace area’s)
» Firewalling enables access to/from
other networks e.g. Data Services,
Security Services, CD/CI, legacy
application landscapes, ...
19. Where is ING now on its Journey ?
» Everything described earlier is/will be achieved one step at a time…
» Throughout 2018 multiple ING DevOps teams enjoyed learning and
experimenting in ING’s Non-Production container environments.
» In Asia as of last November ING went live with its fully digital,
mobile-only bank in the Phillipines: https://www.ing.com.ph/
The front-end of this bank is hosted on ING’s container hosting
platform.
» In Europe multiple ING application landscapes will start onboarding
ING’s container hosting platform in 2019.
20. Summary / Lessons Learned / Best Practices
» Plotted on the ISC2 CISSP CBK so you can tick that
nice “Group A : Multiple Domains” box ;-)
20
21. Asset Security
» Do not register individual containers (as they are short living processes).
• Register the application(s), the namespace(s), the cluster(s) with their nodes and the relations
between them in your CMDB.
» Separate Production and non-Production workloads in separate immutable
clusters (Workloads share kernel & memory space ! / CVE-2019-5736).
• Only allow verified (scanned & signed, vulnerability free) workloads on your production
cluster.
• Do not allow any valuable data to be hosted/accessible from your non-production cluster.
» Limit your consumers resource utilization
• Enforce CPU and Memory limits on name spaces to prevent malbehaving consumers “eating”
all resources in your clusters (thereby stealing other tenants resources).
» Only host suitable workloads (Silver Bullets do not exist…)
• Do not allow local persistency within your stateless cluster -> Use external Data Services.
21
22. Communication and Network Security
» Automate everything!
• Data Service connectivity
• Firewall configuration
» Least privilege is key. Do not:
• Give unnecessary access to your master nodes/etcd/API server (CVE-
2018-1002105). For metrics use scrapers instead (push, not pull access)
• Open up your entire cluster to/from Data Services. Instead use fine
grained access / enforced path / endpoint isolation (e.g. Namespace ->
Tablespace / S3 bucket / Topic)
• Connect your clusters directly to untrusted networks (e.g. the Internet
or the Workplace area). Use a proper DMZ in between
22
23. Identity and Access Management
» Easy ☺ : Nobody (no exceptions!) should be allowed to log on to
your immutable (Production) infrastructure (not on nodes, not into
container runtimes)
• The only access is via your Deploy Pipeline or Cloud Automation
• Events/Logging/Metrics should be pushed outside your cluster for
observability
• Hosted applications must use an external user access repository (no local
persistency…)
• In case of issues in production you re-deploy the last known good version
• (You can be a bit more lenient on your non-Production cluster…)
23
24. Security and Risk Management
» CIA
• C – same as always (least privilege)
• I – run your images immutable, persist your state externally
• A – Resilience by replicating your runtimes over multiple nodes /
clusters / locations
» Vendor/Contractor Security
• make sure they are capable to timely provide a higher version of their
software(-image) in case of vulnerabilities, and pay attention to the
origin of the base images used… (no more patching ! -> redeploy a
version without the vulnerability)
24
25. Security Architecture and Engineering
» Design for failure. Your Nodes/Containers will fail !
• Disperse nodes over local/remote availability zones
• Small/More is beautiful (The impact of 1 node/container instance out of 8 failing is better to
digest then 1 out of 2…)
• The Node is the unit of failure (hence local raid, dual power supplies, dual NIC’s could be
reconsidered… (use the budget to buy more nodes…))
• Enforce a minimum replica setting for your production clusters
» Design for short lifecycles/immutability. Your Nodes/Containers will develop
vulnerabilities!
• Cycle your nodes and containers regularly. Interval should be shorter than the maximum
response time for low and medium vulnerabilities in your organisations security policy
(Because you won’t need to scan your runtime estate in this case…)
• Have the automation & procedures in place for an immediate emergency cycle in the case of
unmitigated high- or critical vulnerabilities
25
26. Security Assessment and Testing
» Automate everything!
• Code Scanning
• Scanning for Vulnerabilities, License Violations and unwanted
configuration settings
• Compliance testing / Evidence generation
» As you no longer can/should have controls in runtime by
definition you must implement all controls & tests in your
build/development pipeline (“Shift Left”)
26
27. Security Operations
» Automate everything!
• Secrets Management
• Customer onboarding/Cloud provisioning
• Data Service connectivity
• Compliance testing / Evidence generation
» Logging & Monitoring is key ! (and persisted OUTSIDE your cluster !)
» Patching is no longer possible -> Redeploy
» Disaster Recovery (-testing) -> Redeploy
• either your External LoadBalancer config
• or your containers with a purposely faulty image on 1 location
» No Backups, no Restores (data is persisted OUTSIDE your clusters !)
27
28. Software Development Security
» Automate everything!
• Secrets Management
• Scanning for Vulnerabilities, License Violations and unwanted
configuration settings
• Image Signing
• Customer onboarding/Cloud provisioning
• Compliance testing / Evidence generation
» As you no longer can/should have controls in runtime by
definition you must implement all controls & tests in your
build/development pipeline (“Shift Left”)
28
29. And in general (1/2):
» Properly implemented Cloud Native landscapes are at least as
secure (and probably more secure, time will tell…) as traditional IT
landscapes, because:
• Immutability and short lifespan of both your hosted application
instances and your cluster components
• Inherent availability due to “defined state” enforced by the kubernetes
scheduler
• Removing the interfaces for interactive access, patching, scanning and
backup/restore will dramatically reduce the available attack surface
• The risk of sharing kernel and memory space can be adequately
mitigated
29
30. And in general (2/2):
» Container hosting/Kubernetes deployment is only a small part of the
ecosystem you need to implement to transform to a cloud native
enterprise
» There is no Big Bang here. Take incremental steps to improve and
consciously accept remaining risks until they can be mitigated
» Technology is only part of your challenge… (eventually technology will
work…)
» People and Process will cost you the most effort/energy/time spend…
In order to successfully complete your cloud-native journey you need:
• An Agile mindset throughout your organization (not just the developers…)
• Risk and Governance processes suitable for this new world…
- Trying to prove compliancy using a traditional risk/control framework is a wasted effort…
30