Ciaran O'Connell, Senior Director Engineering, Houghton Mifflin Harcourt
* How is infrastructure provisioned
* Do individual teams have control or centralised?
* How are costs managed across the organisation
2. “Every day, we serve 13,000 districts, 3
million teachers and 50 million students.
Education is our work.”
Learning Platform
Instructional Content
What We
Do?
Comprehensive Assessmen
and Intervention products
Professional Services
Our Mission: “Changing people’s lives by
fostering passionate curious learners”
5. ”We make Bedrock safe for
engineers not engineers safe for
Bedrock”
6. • Control your Infrastructure
• Control your Deployments
• Control your Applications
• Trust your System
• Control your Destiny
Defining Your Principles
7. Deliberately create and deliver
tools that make the “right” thing
the “easy” thing.
When you can not make the right
thing “easy” you must make the
wrong thing the increasingly
difficult course of action to take
Behavioural
Design
8. 8
Infrastructure as Code
“Enabling control, transparency and predictability”
Mesos Cluster
Aurora Scheduler
SERVICES RUNNING IN ISOLATED
DOCKER CONTAINERS SHARING
KERNAL RESOURCES
AURORA RUNS THESE CONTAINERS
AND KEEPS THEM RUNNING
FOREVER
CLUSTERING WITH MESOS ALLOWS
US TO PROGRAM INSTANCES AS
POOLS OF RESOURCES
INFRASTRUCTURE iS PROVISIONED
USING TERRAFORM WITH REPOS
HOSTED IN ARTIFACTORY
9. Test Automation Speed &Stability Monitoring
Security Results DataDriven Decisions
Fail fast, fast feedback. Strive for
quick and stable execution.
Potentially every mergecan go to
production.
Every Component is Monitored.
Makeyouservice highlyobservable.
Shift left on security. Build security
tests in to pipeline, static, dynamic No
bottlenecks
Test automation reports clearly aid
fast debugging. Highly visible to the
team.
Implementbusiness and technical
metrics on how the service is used.
Instrumentuser Behavior for
feedback loop.
Building Trust in the Pipeline
9
Automate as muchas possible,
includingnon-functional. Computers
perform repetitive tasks; people solve
problems
We are building a Learning Platform consisting of a sohistical set of Microservices and Frontend React application to Targeted Group instruction to deliver student outcomes
Dublin is the Engineering Hub for HMH, established in 2007 – 11 years, established site – majority of the 100+ engineers work in building out our flagship Ed Learning Platfotm
Extend this – engineering is the enabler to make this happen
We were founded in 1832, over 180 years in business
THE ROAD TO OUR FUTURE FROM OUR CURRENT STATE WILL TRANSFORM THE BUSINESS FROM PUBLISHING COMPANY FOCUSED ON CREATING GREAT CONTENT TO LEARNING COMPANY FOCUSED ON CREATING GREAT OUTCOMES. Great Content will be essential to the Learning Company vision. But as important will be great insight into student learning and services and the determination to partner with educators to improve student outcomes.
Engineering was the enabler to make this happen. Just as the company had to make a transformation, we had to the same in engineering from both a Archicural and the way we do business
We moved to cloud via a ‘lift and shift’ but I turned out to look almost exactly like their existing data centers, minus control of the hardware.
We had bottle necks of handovers – JIRA requests to a Ops team
We had a fairly standard stack, we did a lot of the good things you’re supposed to do
We could reliably build application stacks in AWS, we had features like auto-scaling and automation of build and deployment but our capabilities were limited. In particular we were limited in spinning up new environments for new applications. When application teams wanted to setup a new environment they had to funnel this through a central group.
This latency or chokepoint began to encourage bad behaviors in our application stacks as well.
Living in a world of hand offs, we've moved to a place where teams control their own destiny
We spent as much time managing the hardware as we did building new features or solving customer needs
We recognized we had a problem and that the problem came from growth.
moved it to the cloud. (Lift and Shift)
•
I turned out to look almost exactly like their existing data centers, minus control of the hardware.
We had bottle necks of handovers – JIRA requests to a Ops team
We called this process or project Bedrock, we like naming things.
I think it sounds reassuring, a foundational layer to build upon.
It’s important to note that this wasn’t just software, it was looking at how we deliver software as well
Neither philosophy nor technology alone can resolve systemic failures.
A revolutionary change in core values and expectations had to be embraced at a grass roots level.
We kicked off a skunkworks project with four to five people
Thought Leadership – Dev OPS team of four created that have responsibility in building
Bedrock safe from engineers rather than
Thought Leadership – Dev OPS team of four created that have responsibility in building
With so much of the process being defined in code, executed by tools and machines, it became necessary to look at the entire process from infrastructure, deployment, development, build, etc. all as a single system.
CONTROL INFRASTUCTURE
-----------------------------------------
New tools allow infrastructure to be treated and managed as code, closely
connected to application.
connected to application.
• •
Entire environments created within minutes, and torn down just as quickly. Most importantly: 100% consistency and reliability in configuration.
CONTROL YOUR DEPLOYMENTS
---------
Now, we define the deployment procedure in code, and it’s 100% Rolling back is also repeatable if something does go wrong.
CONTROL YOUR APPLICATIONS
human beings.
•
Many patterns have emerged to support this, but the most basic involve separation of major elements, no more tight coupling between data, business logic, etc.
Step 4: Trust the System
Starting with engineer intent, and progressing without human intervention
•through quality checks directly to production.
• • •
The system enforces your rules, practices, tests, etc.It can be faster to fix a bug in place, than to roll back and reset. Fix fast vs. never break
CONTROL YOUR DESTINY
------------------------------------
Empower teams. Teams must commit. Make sure that everyone knows what we mean by quality, and how we want to get there.
Empower Engineering Teams that can independently advance and deploy their service or application, and takes responsibility for that component across the full lifecycle.
Infrastructure:
Code defined network configuration, provisioned automatically.
Tools automatically calculate the difference between now and new, instantiate that difference.
Zookeeper keeps track of “Service Discovery” allowing everything to find each other.
Modern Application Tooling:
Clustering with Mesos allows us to program the individual instances as a single pool of resources.
Resources requested as quota, shared across any number of applications.
Need more, add more, high efficiency.
Modern Application:
Abstract the hardware from the application.
Docker Containers become the standardized unit for Software Development.
Applications running in a container share low level kernel resources, start almost instantly, and are much more efficient.
Each container, while sharing some cluster resources, is logically isolated from others.
Metrics scraped by Promethus into InfluxDB and Grafana for visualisation– data such a CPU. Memory, response latency and count of total responses vs error responses
Monitoring via Elastic Search/Kibana and Aurora Console. Runscope is configured to check the key endpoints with assertions for response status and response time as a minimum
Tracing using Zipkin
Security : CheckMarx – looking into. Utilize Artifactory technology for dynamic audits and whitelists (using for example https://jfrog.com/xray
Test Automation:
UI – Protractor for Angular, webdriver.io for React. Zalenium for scalable, dockerised selenium Grid. Galen for UI presentation.
Browserstack for OS/Browser/Device testing.
API – supertest, gatling
Performance Testing – Gatling + customized solution in bedrock for big load generation.
Code Quality – sonarqube, linting etc
Freedom n
Quality is everyone’s responsibility, not just the Quality Engineer. Close collaboration between dev and test.
Controlling Destiny ---- No external commitments, only the team can commit for themselves.
Step 5: Control Your Destiny
Set expectations that the people who build the application, are
•responsible for it from design -> operations.
Run what you write
Outside the team management can set goals, and vision, but the No handoffs or throw it over the wall expectations can survive.
Accountability
Set expectations
Celebrate the successes and the failures
We Empower Engineering Teams that can independently advance and deploy their service or application, and takes responsibility for that component across the full lifecycle.
An example of this would be our Identity team who take care of ingesting student, teacher and class data from thousands of educational institutions.
This work is highly pressured, it all revolves around the back to school period.
This team had a very bad experience in 2015 due to some of the infrastructure and application issues I spoke about earlier but also product roadmap issues. We had made the fatal mistake of defining arbitrary feature targets a year in advance and then relentlessly tried to implement those targets before inevitably falling short
In 2016 we took a different approach, we asked the team to look carefully at the market and their technological choices and to define targets for themselves. We gave them the room to define a feature set that they felt was both feasible and would actually make a meaningful difference to the customer experience.
Empowerment and Engagement: with customers, other departments, each other. Sharing of ideas etc
Accountability of engineers
Set Expectations: the higher the better
While it is cliché there is much truth in the fact that it is our failures that teach us.
Evolve ideas daily.
Effectively a few people would get the hardware and “set up the environment” by manually configuring the machines and network the same way you would set up a new laptop.
•
Most infrastructure, cloud based or otherwise is managed using low
•complexity practices
connected to application.
• •
Entire environments created within minutes, and torn down just as quickly. Most importantly: 100% consistency and reliability in configuration.
For years, people tried to take these complex applications and run them on top of changing infrastructure, despite unknown dependencies between the two.
Now, we define the deployment procedure in code, and it’s 100% Rolling back is also repeatable if something does go wrong.
Many patterns have emerged to support this, but the most basic involve separation of major elements, no more tight coupling between data, business logic, etc. Content as content, not a stand alone application.
As these segmentations became popular, each area developed a specialized and
•optimized tool set.
Heterogeneous technology stacks became possible, and in fact desirable. Right tool for
•the job.
With so much of the process being defined in code, executed by tools and machines, it became necessary to look at the entire process from infrastructure, deployment, development, build, etc. all as a single system.
Step 4: Trust the System
Starting with engineer intent, and progressing without human intervention
•through quality checks directly to production.
• • •
The system enforces your rules, practices, tests, etc.It can be faster to fix a bug in place, than to roll back and reset. Fix fast vs. never break