Challenge
After moving from
Rackspace to
AWS in 2015,
VSCO began building
Node.js and
Go microservices in addition to running its
PHP monolith. The team containerized the microservices using
Docker, but "they were all in separate groups of
EC2 instances that were dedicated per service," says Melinda Lu, Engineering Manager for the Machine Learning Team. Adds Naveen Gattu, Senior Software Engineer on the Community Team: "That yielded a lot of wasted resources. We started looking for a way to consolidate and be more efficient in the AWS EC2 instances."
Solution
The team began exploring the idea of a scheduling system, and looked at several solutions including Mesos and Swarm before deciding to go with
Kubernetes. VSCO also uses
gRPC and
Envoy in their cloud native stack.
Impact
Before, deployments required "a lot of manual tweaking, in-house scripting that we wrote, and because of our disparate EC2 instances, Operations had to babysit the whole thing from start to finish," says Senior Software Engineer Brendan Ryan. "We didn't really have a story around testing in a methodical way, and using reusable containers or builds in a standardized way." There's a faster onboarding process now. Before, the time to first deploy was two days' hands-on setup time; now it's two hours. By moving to continuous integration, containerization, and Kubernetes, velocity was increased dramatically. The time from code-complete to deployment in production on real infrastructure went from one to two weeks to two to four hours for a typical service. Adds Gattu: "In man hours, that's one person versus a developer and a DevOps individual at the same time." With an 80% decrease in time for a single deployment to happen in production, the number of deployments has increased as well, from 1200/year to 3200/year. There have been real dollar savings too: With Kubernetes, VSCO is running at 2x to 20x greater EC2 efficiency, depending on the service, adding up to about 70% overall savings on the company's EC2 bill. Ryan points to the company's ability to go from managing one large monolithic application to 50+ microservices with "the same size developer team, more or less. And we've only been able to do that because we have increased trust in our tooling and a lot more flexibility, so we don't need to employ a DevOps engineer to tune every service." With Kubernetes, gRPC, and Envoy in place, VSCO has seen an 88% reduction in total minutes of outage time, mainly due to the elimination of JSON-schema errors and service-specific infrastructure provisioning errors, and an increased speed in fixing outages.