3. @NYTDevs | developers.nytimes.com
Who I Am
A software architect focusing on server
configuration and resiliency, with
sidelines in DevOps, release
engineering, and testing.
Started as a LAMP developer but has
always been a generalist interested in
all aspects of the data center.
5. @NYTDevs | developers.nytimes.com
Scope of this Presentation
Everything that follows pertains to the use of
Varnish to accelerate serving content on the
<www.nytimes.com> hostname, only.
There are several other Varnish clusters at
NYTimes.com.
7. @NYTDevs | developers.nytimes.com
NYTimes.com: Traffic
<www.nytimes.com> normal daily peak is
~75,000 requests/second – just this hostname.
● primarily APIs
● HTML traffic is ~4,000 req/sec
Traffic spikes up to 4x during a
breaking news event R.I.P. Leonard Nimoy
9. @NYTDevs | developers.nytimes.com
Mission Statement
“Leverage the latest technology in order to
improve the user experience, enhance our
journalism, and provide a more effective
environment for our advertisers.”
Project document
10. @NYTDevs | developers.nytimes.com
Improve the User Experience
Technical goals:
1. 25% improvement in browser load time, minimum.
2. ...
Sounds like a job for page caching!
13. @NYTDevs | developers.nytimes.com
Exception to the Rule
A complete code rewrite (almost). Why?
● < insert usual suspects here >
● Deeply embedded server-side personalization
(includes ads)
Output was simply uncacheable.
14. @NYTDevs | developers.nytimes.com
Never Let a Crisis Go To Waste
☒ (Test|Behavior) Driven Development
☒ Web performance was core from Day 0
☒ Async wherever, whenever
☒ New APIs
☒ CSS: LESS (then), SASS (now)
17. @NYTDevs | developers.nytimes.com
Changing Horses in Midstream
Site functionality that must not break:
● redirects (mobile, registration, et. al.)
● user tracking
● web crawler detection
25. @NYTDevs | developers.nytimes.com
Cache Invalidation
Purge is not good enough (in Varnish 3).
PURGE causes cache misses on the highest-
traffic content.
Needed cache re(set|build|prime).
26. @NYTDevs | developers.nytimes.com
NYT Homepage
● Must always be in Varnish cache.
● Every article linked to on the homepage
should already be in Varnish cache.
No cache misses = long TTL.
27. @NYTDevs | developers.nytimes.com
But...
Some content changes frequently.
Latest version served in real-time after every
publish action.
Short TTL = more cache misses.
PURGE = more cache misses.
28. @NYTDevs | developers.nytimes.com
Cache Rules Everything Around Me
CREAM: an API to re(set|build|prime) a single
cache entry.
Publish event calls API synchronously.
30. @NYTDevs | developers.nytimes.com
Where We Are Today: Software
~2,300 lines of VCL code
● Minimum of inline C
10 VMODs
● std, utils, crashhandler, wurfl, boltsort, queryfilter
● 4 custom
31. @NYTDevs | developers.nytimes.com
Where We Are Today: Traffic
Of the ~4,000 page requests/second to
<www.nytimes.com>:
● ~1,500 now served by Varnish
● ~91% cache hit rate (down from ~96%)
32. @NYTDevs | developers.nytimes.com
Where We Are Today: Performance
Load test: ~3,000 requests/second/server with
current configuration
We could handle a 4x spike with 2 servers
We run 8 servers per data center
33. @NYTDevs | developers.nytimes.com
8 Servers? Why?!
Because:
● Biggest spike ever was 10x (2012 Election Night)
● 2 hypervisors => even number of server instances
● Takes too long for us to dynamically provision
● We can afford to stay over-provisioned
Yes, this causes extra backend network traffic.
Scaled out for resilience, scaling up for performance.
34. @NYTDevs | developers.nytimes.com
Next Steps for Us
1. Install Varnish Cache Plus 4
2. Utilize the Varnish Plus tools for monitoring.
3. Replace CREAM with VHA
Good afternoon.
My name is Adam Falk.
I am one of the software architects for the Web Products Department at the New York Times.
Today I am going to be talking about how we used Varnish as the linchpin of our recent re-architecture.
Now we only became a a Varnish Plus subscriber last year. We have as yet only used the support.
We are still running the Open Source version, and have custom solutions that can likely be replaced by tools now available to us via the subscription.
Here is a quick bio of me. I have been at The Times for 9 years, and a software architect for 4.
As you see, I am wearing a few hats.
The title of Software Architect can mean different things to different people and organizations. My duties revolve around configuration and resiliency.
The New York Times Company employs over 400 people dedicated to our digital identity.
This number includes Web Development, Native Mobile, E-Commerce, Content Management, Analytics, Project Management, QA, Operations, and even Newsroom specialties.
As I mentioned to another attendee, we see ourselves as a technology company. Even our printing plant in College Point is a high-tech operation utilizing robotics and automation.
Please note that I am limiting this talk to just one of several Varnish clusters we use to serve the entirety of NYTimes.com.
This will only cover <www.nytimes.com>.
The NYTimes.com website has over 15 million page URLs in the wild, and we create 200 to 300 new pages everyday.
This is the journalism: articles, slideshows, interactives, weekly magazine editions, and such--all accompanied by the best photos and graphics we can make. Those account for further millions of URLs, but are not served from <www.nytimes.com>.
Much of that journalism exists online only as image-scans of the printed papers going back to 1851. Replacing them with clean ASCII text is an ongoing, multi-year effort.
Peak traffic to <www.nytimes.com> for a normal day is about 75,000 requests/second. Again, this is just for the www hostname.
An aside: some of you may remember when the Syrian Electronic Army hacked our registrar and changed the nameservers for several hours. They had created a fake, defaced version of our site to serve in its place. Their servers fell over in seconds, and the world never saw it.
Now the bulk of that traffic is public APIs that need the same origin.
Our static assets are served by Content Delivery Networks via other hostnames, but not all.
A breaking news events can increase that by as much as 4x.
The image I have here is our most recent spike, not quite 2.5x. The news was Leonard Nimoy’s death. Thanks to fake celebrity death spam in social media, people seek out confirmation via obituaries on traditional journalism sites like NYTimes.com.
As a further narrowing of this talk’s scope, this presentation is going to drill down into the handling of the 4,000 HTML requests per second.
Many of you are aware of the new look of NYTimes.com that launched a year and a half ago.
It was not just a redesign, though; it was a rearchitecture.
Senior management recognized the need for it and give its full support.
This is a direct quote from project document.
[READ THE QUOTE]
They also had an informed understanding of what was needed to succeed.
Technical goal #1 was a minimum 25% improvement in browser load time.
Achieving this would involve work at all stages of the browser request, but we immediately knew that server-side page caching was going to be a needed first step.
Thanks to Varnish as well as a modern Javascript framework (BackboneJS), we achieved 50% improvement.
We did it by building a completely new multi-cluster system beside the legacy one.
I have omitted the true complexity of the legacy infrastructure from the slide.
It is also a multi-cluster system.
The new infrastructure is smaller and simpler, and has been a complete success.
Almost nothing of the legacy application stack was retained, for several reasons. You know the usual suspects:
massive code debt, antiquated technology, antiquated development processes, etc.
The bottom line was simply that the legacy page rendering framework was based upon architectural assumptions that were obsolete.
Everything was done server-side: login detection, personalization decisions, and--most importantly--advertisement serving.
Nothing was cacheable, and no amount of iterative refactoring was going to change that.
So we left that framework as-is, continuing to adequately do its job in the legacy stack.
The new stack would share nothing.
We did salvage the rendering application engine that recursively handles the modules would model our Content Management System’s output. That was forked, though, and so is not shared. Not a complete rewrite.
With the aforementioned support of senior management, we now had the freedom to build a system to the best of our abilities.
Testing was almost non-existent for the legacy system. We made it integral to the new process. We now have unit, integration, smoke, system, and end-to-end test suites.
Web performance was no longer an afterthought. New features could be kicked out of builds simply because they were performance dogs.
Our existing APIs suffered from the same problems as our application stack, so they were replaced. All of them utilize Varnish, as well.
I mention our CSS technologies here for two reasons:
They both allowed us to manage, through modularization, the complexity of our CSS codebase.
It is a case study on how much better the new architecture is: we switched from LESS to SASS in just one month.
That summarizes what we did and why.
Any questions so far?
I am going to jump across three spotlights and then invite your questions, either now or later in the break room.
The first is a recommendation,
the second is an overview of our custom implementation for cache invalidation.
Lastly, I will show the Varnish stack as-is today.
Even though the initial launch was only a fraction of site traffic (launch day going forward),
the new Varnish cluster needed to handle a sizeable range of existing web traffic functionality.
That meant that our VCL code was going to be fairly complex at launch, and, of course, would only increase over time.
So I made sure we followed the best practice of coding VCL with emphasis upon readability and maintainability.
We achieved that here with modularity: multiple VCL files each containing a single piece of functionality.
A file may contain only a dozen lines of code, and that is perfectly alright.
You will notice that I have 5 separate files for performing URL redirects, but you can infer each one’s functionality from only the filename.
The include statements in the default.vcl provide explicit control of load order and thus execution order.
They are all added here. It is that simple.
The power is in the clarity. Each Varnish function is executed in the explicit order.
First this vcl_recv…
…Then this one…
And so on for each Varnish function in its turn.
Unneeded functions are omitted, of course.
We achieved the Single Responsibility Principle and the benefits that accrue from it.
No matter the size of your VCL codebase, I cannot recommend this highly enough.
This has paid dividends in every metric I could apply to code.
I want to specifically mention the rapid speed at which new developers to Varnish configuration are able to understand where and how to make their first VCL code changes.
Any questions at this point?
As a segue into the next section, I give you my favorite programming joke.
So cache invalidation. We are still using Varnish 3.
As the New York Times style guide instructs, I will give you the most important fact first:
We created an event-driven cache reset system rather than use Varnish purge.
Why?
Traffic to NYTimes.com exhibits the usual fat head-long tail pattern, based on content age.
As you would expect, we focus on maximizing the performance of the head.
The primary criterion for what is head is the homepage.
It is the ultimate in editorial curation.
It also has Most Popular lists, which helps those get even more traffic.
All of this cascades into our feeds and social media channels.
It must always be in cache, as well as the pages it links to.
So a long cache TTL.
But we have a business requirement.
An article on a breaking story can be updated as often as once a minute.
The newsroom wants every new version immediately live.
A short TTL is not the solution, and neither is cache purge, because they both result in cache misses.
We do not want our visitors having to make the round-trip to the backend server cluster (if we can possibly avoid it).
So we built CREAM.
It is a synchronous REST-ful API that was added to our CMS publishing process.
CREAM sends a HTTP request for the just-updated article to every Varnish server (in parallel).
That request includes a secret header that will force a miss in all relevant caching layers for that URL.
The web producer thus incurs the entire performance hit of page regeneration and recaching.
Our users continue to receive the previously cached version until the CREAM request updates the Varnish cache as it returns normally from the backend.
The latest version is now in Varnish cache with zero impact on site visitors.
I am going to skip asking for questions here, and push on to the end.
So here we are today:
The complex VCL I mentioned earlier comes to 2,300 lines of code.
We have kept the amount of inline C code to a minimum, mostly by just going all way with full custom VMODs.
We have the vendor recommended VMODs and the usual suspects familiar to many of you, with a few custom VMODs for specialty processing at scale.
Of the previously referenced 4,000 page requests per second, only about 1,500 are on the new architecture today.
All new development must be done using the new architecture, so old content is being migrated in chunks--some being as large as a quarter million URLs at once.
Adding the long tail has begun to have a negative effect upon the cache hit rate. My fellow architects and I have to regularly review the set of cache TTL values in light of this changing reality.
Next is server performance.
Load testing tells us that our current configuration can handle about 3,000 requests per second per server. So I could handle even a 4x spike with only 2 servers.
But I would be a fool to do so. So 3 servers, maybe 4?
Nope. 8!
I imagine little “WTF?” thought bubbles over your heads.
Each of our data centers is built upon a foundation of 2 hypervisors running all virtualized server instances.
So I always need an even number of instances. If a hypervisor should crash, we want to be able to continue serving on the remaining one without any cascading, secondary effects to users or to the now-crisis mode data center.
Furthermore, while I have been talking on and on about a 4x spike, that is not the high water mark for this river.
The 2012 U.S. Presidential Election peaked at 15,000 page requests/second -- 10x.
So now the minimum server count is 5, but the next higher multiple of 4 is 8.
See? I’m not crazy.
As the slide says, we do lack the ability to scale up dynamically in real-time, but that is almost a non-issue since we can simply stay over-provisioned indefinitely.
We also avoid having new servers coming online with empty server caches at exactly the moment when sub-second response times are critical.
It does result in a permanent increase in requests to the backend cluster, but it is equally over-provisioned.
The near-obsession for resilience drove us to this level of scale-out. Now we are doing the expected scaling up as we migration those millions of page URLs into this architecture.