2. What we do
Train companies
Small business
Mobile AppsConsumer
Website
Services
3. Some vitals
• ~40 Environments
• over 1000 servers
• over 100 products
• Windows/.NET
• New Relic .NET agent / Server Monitor
• Automation is key!
4. Before New Relic
• Application errors logged to disk
• Production support team look at logs
– After production issue identified from customer
reports
– After platform release to check change in patterns
• Ad-hoc and reactive
• Errors difficult to reproduce as usually
hours/days after the event and out of context
5. Introducing New Relic at thetrainline
• Zero capital outlay, subscription model, up and
running in an hour
• Identified a product: leisure website
• Continuous delivery pipeline with blue/green
deployments to all environments
• Needed solution for continuous monitoring
6. Introducing New Relic at thetrainline
• New Relic agent / server monitor part of
webserver recipe
• Deployed with high security enabled
• Out of the box
– Near-real time error logging / alerting
– Application / end-user performance
– Deployment markers
– User funnels
7.
8. Immediate value
• Error rate as a team key performance
indicator
• Drive down error rate through weekly health
checks
• Remediate top three errors by adding directly
to dev team backlog
• Stack traces visible and actionable by
developers without further analysis
10. Taking it further
• Roll out New Relic across all machines in all
environments
– New machines created by Chef automation install
New Relic by default
– Else use SCCM to manage installation
Application/server monitoring built in and
zero effort for dev teams
11. Taking it further
Custom attributes
• Mimic high security mode in newrelic.config
– Create and deploy Chocolatey package through Chef /
SCCM
• Observations:
– New Relic .NET agent doesn’t check in to verify
highSecurity setting matches once it has started
<highSecurity enabled=“true” />
12. More value…
• Use custom attributes to augmentTransaction
and PageView events with more information
to form other business metrics.
• Phoenix’s real-time payments dashboard
– Spread of payment methods
– Effect of payment outages
13.
14. Users of New Relic at thetrainline
• Monitoring/Production Support for near
real time running health of system
• Product owners home in and use funnels to
prioritise product spend
• Developers get rapid feedback on new
features
• Management get a holistic view of the
system through the map feature
15. What we’d like to see
• Javascript errors in Insights
• Better Javascript stack traces
• Per application retention period in Insights
• .NET async support
16. What’s next
• More custom attributes!
• Develop and run Node web apps in
production
– use New Relic node.js agent
– different deployment model, bundle agent/config
with the app
• Monitoring RabbitMQ instances
Editor's Notes
Well, firstly I’d like to welcome you all to thetrainline offices.
I’m Paul Kiddie, a web developer on the Tango team at thetrainline. This evening we’ll be talking about how we’ve used New Relic to improve customer happiness and use the insights there in order to hone in on the things that matter.
You can get me on pkiddie, or the trainline engineering twitter account at ttl underscore engineering.
So, to set the scene a little, these are our core business areas. Our development teams are aligned to each of these business areas.
We work with several train companies to provide the booking engine for them. We have an apps team responsible for the iOS and Android mobile apps. We also provide personalised booking engines for business, and underpinning all these are a set of core platform services.
I work on the consumer website [CLICK], as part of the Tango team.
Just to give you an idea of the size of our infrastructure, these are some vitals.
We have 40 environments including test environments, totalling 1000 severs, of which there are approximately 100 distinct deployable applications and services.
Out stack is primarily .NET on Windows, so we’ve been using the .NET agent and server monitor for Windows.
With a sizable estate we love automation for testability and repeatability.
So, the state of the world before New Relic was a dark place with our applications logging errors to disk, but unless a production issue was identified – most likely by customers spotting them, or after a platform release, these logs were left on disk and provided no/little value.
This was very reactive and the lead time for analysis was hours/days after the event and usually out of context and so difficult to reproduce for the developers to investigate.
Picking New Relic was an easy choice for us – since there was zero capital outlay to get us up and running – which we managed within an hour within our test environments.
We introduced New Relic by taking a product – the main website at www.thetrainline.com, ripping it our of legacy platform release cycle and built a continuous delivery pipeline with blue green deployments. But this required a solution to provide us with continuous monitoring, especially during switches to new builds of the website, to spot any problems before the customer does – by freeing errors from the logs on disk.
Through infrastructure automation the new relic .net agent and the server monitor were made part of a default application server build. We deployed with high security on to provide some guarantees that no sensitive data would be sent to new relic, and we’ve been enjoying the benefits since:
Alerting for Javascript/app errors for immediate rollback or fix then re-deploy
Performance metrics around business logic and end user speed.
Deployment markers (which appear as vertical lines on most of New Relic’s graphs)
Funnels
Like this one, which lets us hone in on parts of the booking flow we should be paying most attention to.
Note the insights query actually uses our our own session identifier passed in as a custom attribute. This is so we can guarantee we can reason over a users journey, since the new relic session might be blocked by default browser policies, for example on iOS devices, where you’d get a new session per pageview.
So now we’ve got these metrics, we now use some of them as key performance indicators for the team. For example, we have a target error rate to achieve, set per quarter.
We have weekly health checks where the New Relic data takes centre stage. We’ve hooked into the API to get week end error rates over the last six months (to give values like the email gives) and plotted them.
We then take the top three errors, add them to our backlog. These backlog items contain a link back to New Relic with a stack trace [most of the time] so developers are able to FOCUS on the fix rather than doing manual repetitive work to analyse and get a head start investigating and fixing.
And this is the result of our work. Unfortunately, the graph doesn’t go back any further than this otherwise you’d be seeing a headline error rate of 0.5%...
We’ve been happily using New Relic in the Tango team for months and months so the next step was to roll it out across the entire estate.
We’re big fans of Chocolatey for managing package installs in Windows, so our Chef recipes that provision our servers install our new relic Chocolatey package by default. For legacy boxes (which we’re in the process of replacing with automated provisioning) we’re using microsoft’s system center to manage it’s install.
Either way, dev teams now get the benefit of application and server monitoring with zero effort. We’re poking in a custom application name for each of our products at deployment time so we don’t applications reporting in as “My Application”. Now that we have it rolled out t also means we can begin to take advantage of the map feature and view our web and service tiers.
To get more out of New Relic we’ve been taking advantage of custom attributes. We were running in high security mode across the entire estate and we needed to find a controlled way to disable it whilst eliminating any monitoring outage.
But we still wanted the assurances that high security offers, so we mimic what high security mode does within the newrelic.config (as we’re using the .NET agent). This config is part of our Chocolatey package and is rolled out through our Chef recipes during automated provisioning, or SCCM otherwise. We then promoted the changes environment by environment.
An observation we noted that was running .NET agents don’t currently verify high security once they’ve started - they don’t check in periodically. This allowed us decouple the disabling of high security at New Relic’s end with the high security setting in the config, since these changes required an iisreset.
By using custom attributes we are able to augment the existing events in Insights (PageViews and Transactions) with extra information.
This information can form the basis of business metrics. One good example of this is Phoenix’s real-time payments dashboard. The Phoenix team are responsible for delivering a lot of the platform services the website relies on.
This shows us the revenue we are delivering and other, more insightful information around the use of different payment methods - information wasn’t available before, at least, in near-real time.
These headline figures provide glanceable information, and allow us to to ask questions like if we improve error rates across the tiers, then how we affect our transaction rates?
A similar dashboard to this allows us to assess the effect of a payment outage on the business.
We have a breadth of users of New Relic at thetrainline.
Monitoring and production support use it to monitor the running health of the system. They’ve created Insights dashboards reporting page by page performance, to detect performance issues quickly, and drill down further.
Product owners can use the funnels view to determine which part of the booking flow need attention and priortise where development effort should go.
Developers get rapid feedback on new features
Management get a holistic view of thetrainline’s systems through custom Insights dashboards and using the map feature.
So just some of what we’d like to see and start doing is:
Reporting Javascript errors in Insights. We’ve seen some progress there as one of the latest agent updates gave us app errors in there, which means we can start using Insights rather than the API to plot trends over time.
Our Javascript is minified but we do offer source maps – we’d love New Relic to be able to use these to provide more context on Javascript errors.
Several of our apps are more critical than others – so tuning our data retention in Insights per application would be great.
Most importantly, a lot of our code is moving to use async/await in .NET – and whilst there are workarounds New Relic doesn’t natively support it natively.
So, what’s next for us at thetrainline?
Teams are just getting started with what custom attributes can offer.
As a company, we’re going more polyglot and in the web team we’ll be running node apps in production and taking full advantage of the node js New Relic agent. This has a different deployment model to the .NET agent (which is a server level install). Instead the node.js agent is installed per application, so this will provide some challenges to assure the high security settings are right in config.
Many of our services are moving to to use RabbitMQ, which means we need to understand at a glance the state of the system. We’re hoping New Relic can help here too!
Thanks for listening!