1. Making a million firewalls sing
Scalable Networking
in
Apache CloudStack
June 19 2012
Chiradeep Vittal
2. Agenda
• Who am I & what am I doing here?
• Apache CloudStack
• Networking modes in Apache CloudStack
• Scaling challenges in Cloud Networking
• Scale Up
• Scale Out
3. Who
Chiradeep Vittal (@chiradeep)
– Founding engineer @ Cloud.com (2008)
– Architect @ Citrix Systems (2011-)
– Maintainer @ Apache CloudStack (2012-)
– Not a Network Ninja
Why
– Challenges of Cloud Networking
– Apache CloudStack
– Real World Cloud Networking
4. Apache CloudStack
• Secure, multi-tenant cloud
orchestration platform
– Turnkey platform for delivering IaaS
clouds
– Over 100 commercial deployments:
Build your cloud the way the private and public
world’s most successful clouds – Full featured GUI, end-user API and
are built
admin API
5. Apache CloudStack
• Open Source
• Apache License
• Incubating in the Apache
Software Foundation since
April 2012
Build your cloud the way the
world’s most successful clouds • Open Source since May 2010
are built
• In production since 2009
6. Apache CloudStack
• Flexibility and scale
• Hypervisor agnostic
• Flexible network topologies
• Multiple storage options
Build your cloud the way the • Proven to scale to tens of
world’s most successful clouds thousands of hypervisors
are built
7. Server Virtualization++ Cloud
• 10x more
scaleable
• 2-5x lower
cost
• 100% more
open
Built for traditional Designed around big data,
enterprise apps & client- massive scale & next-gen apps
server compute •Cloud architecture for 1000s of
• Enterprise arch for 100s of hosts
hosts •Scale-out (multi-site server farms)
• Scale-up (server clusters) •Apps assume failure
• Apps assume reliability •Autonomic [1:1,000’s]
• IT Mgmt-centric [1:Dozens] •Open, value-added stack
• Proprietary vendor stack
11. End-user experience
• Deploy a VM in a network
– VM Template = Windows 2008 with Joomla on
VMWare
– Service offering {m1.large} = 2 x CPU x 2.0Ghz, 8
GB RAM
– Disk Offering {Super fast}
– Network Offering {Gold} = Source NAT + LB+ FW +
20 Mbps Internet access
12. End-user experience
• Deploy a VM in a network
– VM Template = Windows 2008 with Joomla on VMWare
– Service offering {m1.large} = 2 x CPU x 2.0Ghz, 8 GB RAM
– Disk Offering {Super fast}
– Network Offering {Gold} = Source NAT + LB+ FW + 20 Mbps
Internet access
• Network Offering Gold is realized by
– VLAN isolation
– Source NAT & FW on Juniper SRX
– LB on F5 BigIp
– DHCP, DNS on virtual appliance
13. End-user experience
• Deploy a VM in a network
– VM Template = Windows 2008 with Joomla on VMWare
– Service offering {m1.large} = 2 x CPU x 2.0Ghz, 8 GB RAM
– Disk Offering {Super fast}
– Network Offering {Gold} = Source NAT + LB+ FW + 20 Mbps Internet access
• Network Offering Gold is realized by
– VLAN isolation
– Source NAT & FW on Juniper SRX
– LB on F5 BigIp
– DHCP, DNS on virtual appliance
• CloudStack orchestration:
– Pick a free VLAN, pick a free public IP, free private IP
– Pick hypervisor with spare capacity
– Pick primary storage of SSD type accessible in hypervisor cluster
– Pick a Juniper SRX and F5 with spare capacity
– Spin up a new virtual appliance if necessary that runs DHCP and DNS service
• Pick hypervisor, call hypervisor APIs to provision virtual appliance on selected VLAN
– Call hypervisor APIs to provision VM on selected VLAN
– Call SRX and F5 APIs to place their internal interfaces on the VLAN, public interfaces on public VLAN
– Call SRX API to provision source NAT, default FW rules
14. Networking Styles
Server Virt ++
• VLAN (or no) isolation
• Multiple service levels
• Interoperate with legacy
networks at L2 or L3
• Legacy workloads requiring
multicast and broadcast
• Assumes reliable
infrastructure
• Difficult / expensive to scale
out
• Bonding, multi-link, multi-
path, redundant networks, STP
15. Networking Styles
Server Virt ++ Cloud Style
• VLAN (or no) isolation • L3 isolation or overlays
• Multiple service levels • Single or few service levels
• Interoperate with legacy
networks at L2 or L3 • Interoperate with legacy
• Legacy workloads requiring networks using gateways or
multicast and broadcast at L3
• Assumes reliable • Workloads assume
infrastructure unreliable infrastructure
• Difficult / expensive to scale
out • Generally do not support
• Bonding, multi-link, multi- multicast or broadcast
path, redundant networks, STP • Scales out massively
16. Software Defined Networking
• Built-in overlay controller (using vanilla GRE
between Open vSwitch on hypervisor)
Or
• Integration hooks available
– E.g., call SDN controller API to create logical switch
when a network is created
– Call SDN API when VM nic is added to a network
– Nicira NVP, Midonet (more coming)
17. Physical Network
Operations
End
Admin and
Users
Cloud API
CloudStack
Mgmt Server
Cluster Router
MySQL
Edge Services Availability Zone
L3/L2 Core
Access
Layer
Switches
Secondary
Servers
… … … … … Storage
Pod 1 Pod 2 Pod 3 Pod N
18. Network virtualization with VLANs
Tenant 1 Virtual Network 10.1.1.0/24
Tenant 10.1.1.2
Gateway address 1 VM 1
10.1.1.1
Tenant 10.1.1.3
1 VM 2
Internet Tenant 10.1.1.4
1 VM 3
Tenant 10.1.1.5
1 VM 4
19. Network virtualization with VLANs
Tenant 1 Virtual Network 10.1.1.0/24
Public Public IP address Tenant
Network 65.37.141.11 10.1.1.2
Gateway address 1 VM 1
65.37.141.36 10.1.1.1
Tenant 1 Tenant 10.1.1.3
Edge Services 1 VM 2
Appliance(s)
NAT
Internet DHCP
Tenant 10.1.1.4
1 VM 3
FW
Tenant 10.1.1.5
1 VM 4
20. Network virtualization with VLANs
Tenant 1 Virtual Network 10.1.1.0/24
Public Public IP address Tenant
Network 65.37.141.11 10.1.1.2
Gateway address 1 VM 1
65.37.141.36 10.1.1.1
Tenant 1 Tenant 10.1.1.3
Edge Tenant 1
Services 1 VM 2
Edge Services
Appliance(s)
Appliance(s)
NAT
Internet DHCP
Tenant 10.1.1.4
1 VM 3
FW
Load
Balancing
VPN Tenant 10.1.1.5
1 VM 4
21. Network virtualization with VLANs
Tenant 1 Virtual Network 10.1.1.0/24
Public Public IP address Tenant
Network 65.37.141.11 10.1.1.2
Gateway address 1 VM 1
65.37.141.36 10.1.1.1
Tenant 1 Tenant 10.1.1.3
Edge Tenant 1
Services 1 VM 2
Edge Services
Appliance(s)
Appliance(s)
NAT
Internet DHCP
Tenant 10.1.1.4
1 VM 3
FW
Load
Balancing
VPN Tenant 10.1.1.5
1 VM 4
Tenant 2 Virtual Network 10.1.1.0/24
Public IP address
65.37.141.24 Gateway address Tenant 10.1.1.2
65.37.141.80 10.1.1.1 2 VM 1
Tenant 2 Tenant 10.1.1.3
Edge Services 2 VM 2
Appliance
VPN
Tenant 10.1.1.4
NAT
2 VM 3
DHCP
22. Scaling with VLANs
Scale out edge services using virtual appliances
10.1.1.0/24
VLAN 100
VM 1
10.1.1.2
65.37.141.11 10.1.1.1
1 CS
65.37.141.11 Virtual VM 2
2 Router 10.1.1.3
DHCP, DNS
NAT
Load Balancing 10.1.1.4 VM 3
VPN
VM 4
10.1.1.5
23. Scaling with VLANs
Scale out edge services using virtual appliances Scale up using hardware devices
10.1.1.0/24 10.1.1.0/24
VLAN 100 VLAN 100
VM 1 65.37.141.11 10.1.1.1 10.1.1.2 VM 1
10.1.1.2
1 Juniper
65.37.141.11 SRX
10.1.1.1 NAT,
1 CS Firewall
65.37.141.11 Virtual VM 2 VPN VM 2
10.1.1.3 10.1.1.3
2 Router
65.37.141.112 10.1.1.112
DHCP, DNS Netscaler
NAT Load
Load Balancing 10.1.1.4 VM 3 VM 3
Balancer 10.1.1.4
VPN
VM 4 VM 4
10.1.1.5 10.1.1.5
CS
DHCP, Virtual
Router
DNS
24. Multi-tier virtual networking
Internet
Loadbalancer Virtual appliance/
Hardware Devices
(virtual or HW)
Network Services
• IPAM
• DNS Web VM
1
• LB [intra]
• S-2-S VPN
• Static Routes Web VM
• ACLs 2
• NAT, PF
• FW [ingress & egress] VLAN 353
Web VM
• BGP 3
Web VM
4
Web subnet
10.1.1.0/24 VLAN 101
25. Multi-tier virtual networking
Internet
Loadbalancer Virtual appliance/
Hardware Devices
(virtual or HW)
MPLS VLAN
Network Services
App VM
• IPAM 1
• DNS Web VM
1
• LB [intra]
• S-2-S VPN App VM
• Static Routes Web VM 2 VLAN 2724
• ACLs 2
• NAT, PF
• FW [ingress & egress] VLAN 353
Web VM
• BGP 3 DB VM 1
Web VM
4
Web subnet App subnet DB Subnet
10.1.1.0/24 VLAN 101 10.1.2.0/24 10.1.3.0/24
26. Multi-tier virtual networking
Internet
IPSec or SSL site-to-site VPN
Loadbalancer Virtual appliance/ Customer
Hardware Devices Premises
(virtual or HW)
MPLS VLAN
Network Services
App VM
• IPAM 1
• DNS Web VM
1
• LB [intra]
• S-2-S VPN App VM
• Static Routes Web VM 2 VLAN 2724
• ACLs 2
• NAT, PF
• FW [ingress & egress] VLAN 353
Web VM
• BGP 3 DB VM 1
Web VM
4
Web subnet App subnet DB Subnet
10.1.1.0/24 VLAN 101 10.1.2.0/24 10.1.3.0/24
28. Problem:
Manage Configuration of
1000s of virtual appliances (or VRF)
Dozens of HW appliances
Solution:
Database-driven state management of appliances
Message queues + Retry Logic
Idempotent updates,
Recreatable virtual appliances
29. Problem:
Manage Configuration of
1000s of virtual appliances (or VRF)
Dozens of HW appliances
Solution:
Database-driven state management of appliances
Message queues + Retry Logic
Idempotent updates,
Recreatable virtual appliances
Problem:
Single-tenant HW appliances
Solution:
CloudStack API layers multi-tenancy, provides abstraction
No direct access to devices
30. Problem:
Hardware appliances with no APIs
CLI only
Limited concurrent login sessions
Solution:
Recommend appliances with APIs
Integrate with Network Orchestrators
32. Layer 3 cloud networking
Web DB Web
VM VM VM
Web DB
Security Security
Group Group
Web Web DB
VM VM VM
… … …
Web Web
VM VM
Ingress Rule: Allow VMs in Web Security Group access to VMs in DB Security Group on Port 3306
33. L3 isolation with distributed firewalls
Tenant 10.1.0.2
Public Public IP address
1 VM 1
Internet 65.37.141.11
65.37.141.24
65.37.141.36 10.1.0.1
Pod 1 L2 Tenant 10.1.0.3
65.37.141.80 Switch 2 VM 1
Tenant 10.1.0.4
1 VM 2
L3 Core
Pod 2 L2
Switch
10.1.8.1
…
10.1.16.1
Load Pod 3 L2
Balancer Switch
…
34. L3 isolation with distributed firewalls
Tenant 10.1.0.2
Public Public IP address
1 VM 1
Internet 65.37.141.11
65.37.141.24
65.37.141.36 10.1.0.1
Pod 1 L2 Tenant 10.1.0.3
65.37.141.80 Switch 2 VM 1
Tenant 10.1.0.4
1 VM 2
L3 Core
Pod 2 L2
Switch
10.1.8.1
…
10.1.16.1
Load Pod 3 L2
Balancer Switch
… Tenant
1 VM 3
10.1.16.47
Tenant
10.1.16.85
1 VM 4
35. L3 isolation with distributed firewalls
Tenant 10.1.0.2
Public Public IP address
1 VM 1
Internet 65.37.141.11
65.37.141.24
65.37.141.36 10.1.0.1
Pod 1 L2 Tenant 10.1.0.3
65.37.141.80 Switch 2 VM 1
Tenant 10.1.0.4
1 VM 2
L3 Core
Pod 2 L2
Switch
10.1.8.1
…
Tenant 10.1.16.12
10.1.16.1 2 VM 2
Load Pod 3 L2
Balancer Switch
Tenant
2 VM 3 10.1.16.21
… Tenant
1 VM 3
10.1.16.47
Tenant
10.1.16.85
1 VM 4
37. A Million Firewalls?
VM VM VM
… … VM VM
… …
VM VM …
VM VM
VM VM
VM VM VM VM
VM VM VM
… … VM VM
… …
VM VM …
VM VM
VM VM
VM VM VM VM
VM VM VM
… … VM VM
… …
VM VM …
VM VM
VM VM
VM VM VM VM
VM VM VM
… … VM VM
… …
VM VM …
VM VM
VM VM
VM VM VM VM
VM VM VM
… … VM VM
… …
VM VM …
VM VM
VM VM
VM VM VM VM
VM
…
VM
VM
…
VM
VM
…
VM
… … VM
…
VM VM
VM VM
VM VM VM VM
VM VM VM
… … VM VM
… …
VM VM …
VM VM
VM VM
VM VM VM VM
VM VM VM
… … VM VM
… …
VM VM …
VM VM
VM VM
VM VM VM VM
39. Problem:
Manage the state of 100s of thousands of firewalls
Solution:
Well-known software scaling techniques
• Message queues
• Consistency tradeoffs
• Idempotent configuration & retries
CloudStack uses
• special purpose queues
• optimized for large security groups
• eventual consistency for rule updates
40. Problem:
Firewall (iptables) rules explosion on the host firewall
Allow Security Group {Web} on TCP port 3060
-A FORWARD -m tcp –p tcp –dport 3060 –src 10.1.16.31 – j ACCEPT
-A FORWARD -m tcp –p tcp –dport 3060 –src 10.1.45.112 – j ACCEPT
-A FORWARD -m tcp –p tcp –dport 3060 –src 10.1.189.5 – j ACCEPT
…
-A FORWARD -m tcp –p tcp –dport 3060 –src 10.21.9.77 – j ACCEPT
For large security groups, performance suffers
41. Problem:
Firewall (iptables) rules explosion on the host firewall
Solution:
Use ipsets:
ipset –N web_sg iptreemap
ipset –A web_sg 10.1.16.31
ipset –A web_sg 10.1.16.112
ipset –A web_sg 10.1.189.5
…
ipset –A web_sg 10.21.9.77
-A FORWARD –p tcp –m tcp –dport 3060 –m set –match-set web_sg src -j ACCEPT
42. Multi-tier networking with Overlay
Internet
IPSec or SSL site-to-site VPN
Loadbalancer Customer
Virtual Router
Premises
(virtual appliance)
MPLS VLAN
Network Services App VM
• IPAM Web VM
1
• DNS 1
• LB [intra]
App VM
• S-2-S VPN
Web VM 2 GRE Key 2724
• Static Routes 2
• ACLs
• NAT, PF
• FW [ingress & egress] Web VM GRE Key 353
DB VM 1
• BGP 3
Web VM
4
Web subnet App subnet DB Subnet
10.1.1.0/24 GRE Key 101 10.1.2.0/24 10.1.3.0/24
43. Multi-tier networking with Overlay
Internet
vswitches
Loadbalancer
(virtual appliance)
Network Services App VM
• IPAM 1
Web VM
• DNS 1
• LB [intra]
• S-2-S VPN App VM
Web VM 2 GRE Key 2724
• Static Routes
2
• ACLs
• NAT, PF
• FW [ingress & egress] Web VM GRE Key 353
DB VM 1
• BGP 3
Web VM
4
Web subnet App subnet DB Subnet
10.1.1.0/24 GRE Key 101 10.1.2.0/24 10.1.3.0/24
44. Check it Out
• Apache CloudStack
– http://wiki.cloudstack.org
– Download it
– Use it
– Contribute to it
• Citrix CloudPlatform
– Based on Apache CloudStack
– Commercial support
Hinweis der Redaktion
Need a better slide than this
Need a better slide than this
Need a better slide than this
Two broad classes of workloads are emerging: traditional enterprise workloads architected with reliable infrastructure assumptions, and a new cloud style where reliability tends to be the responsibility of the application
Flexibility in CloudStack Networking means being able to support various combinations of network services being delivered to the cloud user. The cloud operator should be able to configure different levels of service with different combinations of services and offer them as packages in a catalog, much like service offerings and disk offerings
Given a service there are many ways of realizing a service. A cloud operator may want to use one or more of these service providers (e.g., virtual appliances, hardware devices) to provide these services.
The combination of services and service providers have to work in different isolation contexts in a multi-tenant cloud. Some cloud operators do not want any isolation and merely want the self-service nature of the cloud. Others want to use traditional vlan isolation in order to interoperate with legacy services and equipment. Others want to adopt SDN approaches using overlays. By far the most scalable way is to use L3 isolation and security groups.
Cloud user wants to deploy a vm into a network with specified service offering m1.large, disk offering “Super Fast” and the “Gold” network offering. The gold offering translates into the following combination of services: source NAT, load balancing, firewall and 20 Mbps internet access
The cloud operator has configured the “Gold” offering to be realized by the following service providers: isolation with VLAN, source NAT and FW on a Juniper SRX, LB on F5 etc
When the user calls the deploy API (or clicks the last button on the deploy wizard) the following steps need to happen. CloudStack orchestrates the hypervisors, storage and network devices so that these elements deliver the chosen levels of service.
For the 2 styles of cloud, the reference network architecture tends to be quite different. For server virt with self service, it tends to use VLAN etc.
The new style of networking (called “Basic Zone” in CloudStack) uses L3 at all levels of the datacenter architecture.
CloudStack also supports L2-style networking on an L3-architected datacenter using overlays.
With VLAN or L2 isolation, each tenant gets a contiguous range of ips in each network they create.
We can provide NAT, DHCP and FW services for example by starting a virtual appliance to provide gateway services to this network and provide the edge services. The virtual appliance has one NIC on the public VLAN and one nic on the VLAN assigned to the network.
If we wanted additional services like LB and VPN, the same virtual appliance or additional appliances or hardware devices can provide services (for example, load balancer and VPN)
Every network created by any tenant can get its own unique set of services either by sharing hardware devices with other tenants or using dedicated appliances / devices. Each network gets its own VLAN
Since there are hundreds to thousands of tenants in a datacenter, we can scale out the edge services using multiple virtual appliances. Virtual appliances are cheap and disposable – if they fail, they can be recreated automatically by CloudStack.
If some tenants require more performance than that can be offered with a virtual appliance, they can choose a network offering that is backed by more powerful hardware appliances. For example, CloudStack can orchestrate a Juniper SRX and a Citrix Netscaler device together to offer a combination of powerful firewall and load balancing services.
A 3-tier web app can be setup by a cloudstack end-user by simply making api calls to instantiate different networks with different services. A virtual router or other device provides services such as inter-vlan routing, ACLs and internet access via source NAT. A separate LB appliance or device can provide performant LB for the web tier.
You can add the app tier and web tier as well. These tiers don’t require load balancing..
Additionally you can connect the entire set of networks to a site-to-site VPN using ipsec or an MPLS VLAN.
No solution to this problem. For this we turn to using L3 isolation which requires a different set of APIs and a different way of architecting the network
Related VMs are placed into security groups: for example, web vms are placed in the web security group and the db vms are in the DB security group. By default all ingress traffic to the vm is dropped. To allow web vms to communicate to DB vms, the cloud user calls an api to allow access on the database’s tcp port.
Each pod has a different subnet. When a VM is started in a pod, it acquires a free ip in that pod’s subnet. Different tenants can land up in the same pod and hence share the same L2 subnet. Because security groups deny all by default, each VM needs a host-based firewall (embedded in the hypervisor dom0) to enforce this. This also prevents stuff like DHCP and ARP snooping. To prevent attacks, multicast and broadcast are blocked by the firewall
As a tenant starts more vms, the vms can land in different pods. The cloud user cannot make any assumptions about L2 connectivity between their vms.
As vms get created and destroyed, CloudStack has to ensure the configuration of the host-based firewalls (iptables) is consistent with the security group rules programmed by the cloud user
40,000 hypervisors in a data center x 25 vms / hypervisor = 1 million firewalls to be orchestrated by CloudStack
If there are 1000 vms in the web security group, they do not have easily summarizable ips since they draw their ips from different subnets (pods). To allow web vms on tcp port 3060, therefore the DB VM firewall would need 1000 separate iptables rules. When a packet from the web vm arrives at the DB vm firewall, up to 1000 rules might have to be checked before a match is found and the packet is let through. The sequential matching imposed by iptables can cause latency issues.
An ipset is a kerneldatastructure that can match an ip very efficiently against a large set of ips. For example, using a tree structure, an ip address can be quickly tested for containment. The ipset is supplied to the iptables rule leading to a single iptable rule.
Using the L3 networking layout, we can impose an L2 overlay using techniques such as GRE tunnels, NVGRE, VxLAN and STT. For example, instead of using VLANs for isolation, we could use GRE keys (a 32-bit id) to scale it well beyond 4k networks. VLANs could still be used to interoperate with MPLS and legacy services.
Xenserver and KVM have open vswitch built in. This can be used to replace some of the traditional virtual router functions such as inter-network routing and ACLs. Most edge services would have to be provided using virtual appliances in this case since hardware devices usually do not terminate the overlay technologies.