Centralising Operations with ChatOps
At its core ChatOps means building tools that make it easier to operate your infrastructure via a Bot than via the Terminal … by placing tools directly in the middle of the conversation everyone is pairing all the time ~ Jesse Newland
ChatOps is all about conversation-driven development. The idea, put simply, is that team members interacting with each other in a chat room can issue commands that a bot listens to and is configured to execute. These commands can range from deploying code to retrieving logs to provisioning new services. ChatOps, in a very real sense, helps to integrate people, bots and tools together in an automated and transparent way.
ChatOps automation at Talis
Back in 2014 our Development, Customer Services and Consulting teams were increasingly using HipChat for collaboration. We created rooms for projects as well as business functions and through various integrations available at the time we were able to push notifications from third party services we rely on directly into relevant chat rooms, for example:
Through GitHub integration we are notified of code changes being pushed, pull requests being raised and commented on and when branches were merged:
Through Bamboo integration we are notified of passing / failing builds:
Through Zendesk integration we are notified whenever our users raise a support ticket:
Through Pagerduty integration we are alerted to potential issues with our infrastructure or services:
These integrations along with others provide a constant flow of information, and through HipChat there is a permanent, searchable record. Whilst useful, these integrations are effectively just notifications that are broadcast to everyone in the chat room; they are not interactive. It was in June of 2014 that we introduced our first chat bot called Zeus (when we began we joked we might need an entire pantheon hence the naming as you’ll see shortly) based on the popular Hubot. This was extremely easy to extend and came with a comprehensive list of plugins that integrated with many of the services we use. It was also very simple to get up and running on Heroku. Now instead of simply receiving notifications in HipChat we could actually query those services interactively and take certain actions. So through HipChat we began to contrive a common user interface to these services. Here’s a very small flavour of just some of the functionality we enabled through Zeus for working with Zendesk, Pagerduty and Github:
@zeus help Zeus list new tickets - returns a list of all new tickets Zeus list open tickets - returns a list of all open tickets Zeus list pending tickets - returns a list of pending tickets Zeus pager ack <incident1> <incident2> ... <incidentN> - ack all specified incidents Zeus pager ack <incident> - ack incident #<incident> Zeus pager incidents - return the current incidents Zeus pager maintenance <minutes> <service_id1> <service_id2> ... <service_idN> - schedule a maintenance window for <minutes> for specified services Zeus pager me <schedule> <minutes> - take the pager for <minutes> minutes Zeus pager me as <email> - remember your pager email is <email> Zeus pager my schedule <days> - show my on call shifts for the upcoming <days> in all schedules (default 30 days) Zeus pager resolve <incident> - resolve incident #<incident> Zeus pager schedule <schedule> <days> - show <schedule>'s shifts for the next <days> (default 30 days) Zeus pager schedules <search> - list schedules matching <search> Zeus pager services - list services Zeus pager trigger <user> <msg> - create a new incident with <msg> and assign it to <user> Zeus pending tickets - returns a count of tickets that are pending Zeus show [me] <repo> pulls -- Show open pulls for Zeus_GITHUB_USER/<repo>, if Zeus_GITHUB_USER is configured Zeus show [me] <user/repo> pulls [with <regular expression>] -- Shows open pull requests for that project by filtering pull request's title. Zeus show [me] org-pulls [for <organization>] -- Show open pulls for all repositories of an organization, default is Zeus_GITHUB_ORG Zeus ticket <ID> - returns information about the specified ticket Zeus who's on call - return a list of services and who is on call for them
We have been using tools like Puppet and Ansible to automate provisioning of infrastructure and deployment. The tooling we have around these processes is something that everyone could use and extend and is aligned closely to our DevOps culture. But as with all these things as we grew (both in terms of our team and our infrastructure) we recognised it was becoming harder to have visibility of these operations when they were being run from developer machines, equally it was painful to ensure everyone had the correct version of tools required etc. We initially mitigated much of this through the use of Vagrant but ultimately we wanted to make these important operations easy to perform, be repeatable, consistent and guarantee that we had monitoring and logging around them. HipChat, once again, seemed like a good fit for providing a common easy to understand user interface to these operations.
At the beginning of 2016 we introduced our second bot, Hercules, also based on Hubot. Whereas Zeus provides an interface to third party services we integrate with, Hercules is focussed entirely on infrastructure and services. Over several months we’ve migrated much of our key infrastructure provisioning and also deployment and release automation to Hercules. Since these tasks are now effectively centralised it was easy to add logging and alerting and to ensure the entire team has visibility around what is happening or being changed. For example, here’s how we deploy a new version of our Reading Lists application:
Whilst Hercules echoes the main events in HipChat, a detailed log is always available on our centralised logs server and accessible through Kibana:
One interesting part of this process is the peer approval we use. Many of the operations that Hercules can perform require peer approval. This is a safety net that ensures someone else in the team knows you’re about to make a change and can be on hand to assist if there are any problems. To implement this we use the Hubot Approval middleware. Integrating this into Hercules was relatively simple and only requires you to add your own custom middleware that effectively looks up the approver. Below is a simplified version of what we use and provides an easy to use drop-in if you want to experiment with it:
ChatOps helps to integrate people, bots and tools together in a transparent way. This transparency is an important feature for us because it means that anyone in the team can join a conversation and immediately get “caught up” on everything that has been happening whether they are in the office or working from home. They no longer need to worry about having the correct tools and dependencies installed locally before performing an important operational task whether that’s deploying a new release, or deploying an entirely new stack or service. This has been an important and hugely beneficial tool for us and it’s one that we’ll continue to evolve as we grow and our needs and use cases evolve.
However it is important to understand that ChatOps isn’t something you can just “install”, it isn’t a product. It’s utility and how it manifests varies from team to team and organisation to organisation, and often the biggest obstacles you’ll face aren’t technical they are cultural - so start small and build upon that.