Many companies continue to manaully create and manage their cloud infrastructure via web consoles. Documenting these procedures is challenging, especially since the interfaces are always evolving. Reviewing the changes is also difficult, and it often involves having a coworker watching over your shoulder. Rolling back a bad change requires deleting your current work and attemtping to manually re-create the old infrastructure from memory. Scaling or deploying the infrastructure to new environments also often involves manually re-creating it.
Hashicorp's Terraform allows for the management of infrastructure as code. While a growing number of groups have started to utilize this tool, most are only just beginning to scratch the surface of its potential. Yes, Terraform can be used to create and manage resources in AWS and other cloud providers. However, thanks to an ever growing number of providers, it can manage resources in many other popular cloud services. At Yelp, we use Terraform to manage our AWS resources, DNS records in NS1, CDN configuration in Fastly and Cloudflare, and our charts and dashboards in SignalFx.
This setup provides us with the ability to maintain our infrastructure as code in a version control system that can be put through standard code review flows. If we discover an issue, we can revert to an older, working commit and restore our infrastructure to that point in time. Documentation can include code snippets that can be easily copied/pasted in an error free manner. Finally, resources managed by one Terraform provider can benefit from and utilize information from resources managed by another provider. This means that launching a new AWS EC2 instance can automatically update the necessary DNS records in NS1, and then create a dashboard filled with customized charts designed to monitor the instance.
My name is Nathan Handler
I work as a Site Reliability Engineer on the Operations team at Yelp
Here is my email address and Twitter handle; feel free to contact me at any time
Yelp's Mission is connecting people with great local businesses
Yelp used to be run out of traditional datacenters. Adding a new server involved manually adjusting bind dns records, imaging the machine, and adding the host to Puppet so it could be properly configured.
Suddenly, a wild cloud appeared!
What do we do? How do we transition our workflows to be able to handle the cloud and the challenges it provides?
We became very familiar with the AWS Console
Each time we needed to launch an instance, we would click through this 7 page form
Documenting this process is tough. We can include annotated screenshots, but they can't be easily searched and are painful to update.
Luckily, Amazon has a command line interface (CLI) that we can use to ease this process...
Unfortunately, this CLI is not the easiest to work with
Lots of arguments are required to launch an instance
It is quite easy to forget one (i.e. we often forgot to include the --iam-instance-profile)
However, with these commands, we can at least document them in runbooks easily
To help deal with the long/ugly commands, we created an internal tool (that we unfortunately never got around to open sourcing) similar to awless shown above.
With this tooling, launching a single host is straightforward and easy.
Code review is still tough. All we can do is review the command that is going to be run. We are still left guessing if it will do the right thing.
With all three of these approaches, we are making changes in production, without a test suite, without code review, and without a version control system that we can use to easily rollback.
This is crazy!
We should be using tools that support our workflow. Our tools should not define our workflow.
What are we looking for?
We want a tool that is version control friendly. We need to be able to see who made changes, when, and why.
Changes need to be reviewable using standard code review tools. Needing someone to watch over your shoulder is not acceptable
It needs to utilize existing APIs/SDKs. A tool that uses Selenium to click through the web console would be too brittle and error prone
Finally, the tool can't lock us in to a single vendor. We wanted to have the ability to migrate to a new vendor without having to learn a new tool and port all of our infra to it.
We opted to use HashiCorp's Terraform, which checks all of those boxes
What does terraform look like? In this simple example, we use the AWS provider to launch an ec2 instance.
For this simple case, we only specify an instance type and ami.
However, terraform is powerful enough to configure every aspect of these instances including networking and attached volume settings
A common question asked when I mention we are a big user of AWS is why we don't use CloudFormation.
CloudFormation can handle managing AWS resources...
...but what about NS1, signal fx, fastly, cloudflare, and other non-AWS resources?
By using Terraform, we manage all of these resources with one common tool
Terraform will create and manage the underlying resources for you.
However, this just gives you a box
You still need a classic configuration management tool to add stuff to the box and manage the contents after the box is up and running
As with any tool, Terraform is not perfect. Here are a few of the gotchas that we encountered along the way
One of the first hurdles we had to overcome with Terraform was the scare factor.
In addition to being a new tool (to us), Terraform has the potential to modify (and thus break or destroy) all of our infrastructure
While it does produce a plan of what changes it is planning to make, these plans can be a bit confusing and hard to understand at first
It took us a while before we really got comfortable enough with Terraform to fully embrace it and not be afraid. There is no good way around this other than making lots of changes, reviewing a lot of plans, and testing in a staging environment until you build confidence
One of the key hurdles we had to solve with AWS was how to manage resources in multiple AWS accounts
While terraform does now have a concept of workspaces, we've opted to have separate directories for our different dev, stage, and prod regions.
This makes it very obvious what environment we are modifying and means that a change in a dev region has zero chance of affecting prod
It also splits up our infrastructure making it easier for multiple users to work on Terraform in parallel
A tfvars file specifes the AWS account being managed. Additional tooling that I'll talk about later parses this file and sets the necessary environment variables based on ~/.aws/credentials
Managing DNS in Terraform feels pretty natural. For records, you specify a zone, domain, type, and ttl.
This works fine for a small number of records. However, we have a LOT of domains.
It took the provider ~15 minutes to make a simple change (as it would try to refresh its local state)
This made it slow to iterate and posed a problem during outages when trying to make DNS changes in a hurry.
We ultimately worked around this by separating out our more critical DNS infrastructure into a separate state file
Here is a sample of what managing NS1 with Terraform might look like. In this example, a zone is being created and then referenced in a new www CNAME record containing multiple answers and a filter, which shows that Terraform can even configure NS1 specific aspects of records.
Last year, we created and open sourced a SignalFx Terraform provider called SignalForm.
Using SingalForm, it is possible to manage SignalFx detectors, charts, and dashboards
By managing these resources alongside our infrastructure, we can ensure that when we launch in a new region that detectors and dashboards get automatically created
We can do other interesting stuff as well such as generating custom dashboards for each of our services
While creating these dashboards, we frequently had to run `terraform destroy` to revert to a clean slate and then re-create everything.
This was generally fine as we don't require perfect uptime from all of our dashboards
However, re-creating a dashboard changed its URL. How can we easily keep track of where to find our dashboards?
First solution involved a CLI tool that would output the URL for each SignalForm'ed dashboard
Eventually, we created a provider for our internal URL shortener allowing us to generate memorable short URLs for each SingalForm'ed dashboard
There weren't really any unexpected surprises with these providers.
Due to rather significant differences in capabilities and features in the two providers, we haven't created a module to wrap the two and keep the configs in sync
Instead, we manually configure each of them individually
As you can see, the Terraform fastly provider supports fully configuring a fastly service. In this example, we define a domain, backend, and some custom vcl.
The cloudflare provider is lot more simple. In this example, we define a simple A record. Cloudflare does not provide the same level of customization support as Fastly.
Terraform maintains a local state file to map its resources to real world infra
This file is updated after every change. Initially, we committed this file to git alongside our terraform code. However, it was quite common for users to forget to add this file after making a change. This would lead to Terraform not being aware of recently created infra and trying to re-create it (and other problems).
By the nature of this file, it was also problematic if multiple people attempted to edit terraform at the same time
We started out with local state as it is the default in Terraform and it made it easy to manually inspect the file as we were getting started
However, terraform also supports storing this state file remotely
To make sure that we would never forget to commit the state file again, we transitioned to managing the file in S3
At the time, we had to write some tooling to handle copying our existing file to S3. Now, terraform handles this in a rather seamless manner.
We worked around the issue of multiple people simultaneously editing terraform by implementing locking via zookeeper
Terraform versions greater than 0.9.0 have built-in locking capabilities for certain backends
In the case of S3, this locking is done via DynamoDB
We wrapped most of the core Terraform functionality inside of a Makefile
This makes it possible to do any necessary setup before applying terraform and take some actions afterwards.
It also gives us a way to support using multiple different versions of terraform for different parts of our infrastructure
Each tool Terraform interacts with requires API keys. In some cases, like AWS, we have multiple accounts and therefore multiple API keys.
Our Makefile has a target that will set all of the necessary environment variables for Terraform. It parses Terraform variables to determine the correct AWS profile to use. Other keys are pulled from dotfiles in the user's home directory
By having Terraform manage all of our infrastructure, we had to start re-thinking permissions.
Suddenly, adding a tag to an ec2 instance required s3 access (to update the remote state file)
And launching a new machine required NS1 access to create its DNS records
Terraform providers give you the core building blocks to model your infrastructure
Modules in terraform behave similar to modules in other tools and allow you to group related components together into an easily reusable component
They allow you to abstract away most of the implementation details and instead focus on modeling your infrastructure in a way that lines up with concepts familar to your developers
When you apply terraform, it will handle creating all of the necessary resources. However, you often need to be able to find and interact with these resources afterwards. A common example is needing to add the IP address of a freshly launched server to a topology file somewhere. Outputs allow you to easily expose and retrieve these key attributes from among the thousands of non-essential attributes stored in Terraform.
So what do Outputs actually look like?
In this case, we define an output named 'ip' that has a value of an AWS elastic IP address's public ip. We could have just as easily defined an output corresponding to a Spot Fleet Request ID or an EC2 IP. When we run 'terraform apply', it clearly shows the outputs. We can also use the 'terraform output' command to manually query these values.
Many companies require writing up change control documentation before shipping changes in production.
A common component of such a document is a section explaining the planned rollback procedure in the event of issues
Terraform makes this relatively easy.
`git revert` and `terraform apply`. While this will restore your infrastructure to how it was before, it will not necessarily restore the systems running on that infra
Automatically applying terraform is an issue we are currently working on.
It is a bit scary as Terraform isn't the best about dealing with errors and will happily leave the infrastructure in a partially applied state
However, it is necessary as we grow and have more people touching terraform at the same time
Another issue with such a workflow is the tool doing the automatic applications (i.e. Jenkins) would need to have full access to AWS and all other vendors we use
Automatic applications will also help cleanup our infrastructure in the event that it is manually modified outside of terraform (it will get brought back in sync)
Certain tasks are quite hard to model in Terraform natively
You can specify a `count` as a sort of loop and basic conditionals
As your Terraform code grows and becomes more complex, you might want to consider creating a traditional script (python, golang, bash, etc) to output terraform code
You can still use the traditional terraform tooling to apply/manage the changes; your code will just be simpler
In a perfect world, launching in a new region would be as simple as adding the region to a list.
The servers would get launched, dns records and dashboards created, and all is well in the world.
While this is technically all possible with Terraform, in reality, it is a bit challenging in practice
As noted, we were forced to separate our infrastructure into multiple state files for performance and isolation reasons
Our method for doing this makes it difficult to easily share variables among a subset of folders
We also often want to gradually launch a new region to be sure each of the various components functions correctly
While we haven't quite achieved this perfect world scenario, Terraform has greatly simplified our infra management
Questions?
We're Hiring
Here are some links if you want to learn more about Yelp