Centralising Operations with ChatOps

09 September 2016

Written by

Nadeem Shabir
Automation Lead

At its core ChatOps means building tools that make it easier to operate your infrastructure via a Bot than via the Terminal … by placing tools directly in the middle of the conversation everyone is pairing all the time ~ Jesse Newland

ChatOps is all about conversation-driven development. The idea, put simply, is that team members interacting with each other in a chat room can issue commands that a bot listens to and is configured to execute. These commands can range from deploying code to retrieving logs to provisioning new services. ChatOps, in a very real sense, helps to integrate people, bots and tools together in an automated and transparent way.

ChatOps automation at Talis

Back in 2014 our Development, Customer Services and Consulting teams were increasingly using HipChat for collaboration. We created rooms for projects as well as business functions and through various integrations available at the time we were able to push notifications from third party services we rely on directly into relevant chat rooms, for example:

Through GitHub integration we are notified of code changes being pushed, pull requests being raised and commented on and when branches were merged:

HipChat GitHub Integration

Through Bamboo integration we are notified of passing / failing builds:

HipChat Bamboo Integration

Through Zendesk integration we are notified whenever our users raise a support ticket:

HipChat Zendesk Integration

Through Pagerduty integration we are alerted to potential issues with our infrastructure or services:

HipChat Pagerduty Integration

These integrations along with others provide a constant flow of information, and through HipChat there is a permanent, searchable record. Whilst useful, these integrations are effectively just notifications that are broadcast to everyone in the chat room; they are not interactive. It was in June of 2014 that we introduced our first chat bot called Zeus (when we began we joked we might need an entire pantheon hence the naming as you’ll see shortly) based on the popular Hubot. This was extremely easy to extend and came with a comprehensive list of plugins that integrated with many of the services we use. It was also very simple to get up and running on Heroku. Now instead of simply receiving notifications in HipChat we could actually query those services interactively and take certain actions. So through HipChat we began to contrive a common user interface to these services. Here’s a very small flavour of just some of the functionality we enabled through Zeus for working with Zendesk, Pagerduty and Github:

@zeus help

Zeus list new tickets - returns a list of all new tickets
Zeus list open tickets - returns a list of all open tickets
Zeus list pending tickets - returns a list of pending tickets
Zeus pager ack <incident1> <incident2> ... <incidentN> - ack all specified incidents
Zeus pager ack <incident> - ack incident #<incident>
Zeus pager incidents - return the current incidents
Zeus pager maintenance <minutes> <service_id1> <service_id2> ... <service_idN> - schedule a maintenance window for <minutes> for specified services
Zeus pager me <schedule> <minutes> - take the pager for <minutes> minutes
Zeus pager me as <email> - remember your pager email is <email>
Zeus pager my schedule <days> - show my on call shifts for the upcoming <days> in all schedules (default 30 days)
Zeus pager resolve <incident> - resolve incident #<incident>
Zeus pager schedule <schedule> <days> - show <schedule>'s shifts for the next <days> (default 30 days)
Zeus pager schedules <search> - list schedules matching <search>
Zeus pager services - list services
Zeus pager trigger <user> <msg> - create a new incident with <msg> and assign it to <user>
Zeus pending tickets - returns a count of tickets that are pending
Zeus show [me] <repo> pulls -- Show open pulls for Zeus_GITHUB_USER/<repo>, if Zeus_GITHUB_USER is configured
Zeus show [me] <user/repo> pulls [with <regular expression>] -- Shows open pull requests for that project by filtering pull request's title.
Zeus show [me] org-pulls [for <organization>] -- Show open pulls for all repositories of an organization, default is Zeus_GITHUB_ORG
Zeus ticket <ID> - returns information about the specified ticket
Zeus who's on call - return a list of services and who is on call for them

We have been using tools like Puppet and Ansible to automate provisioning of infrastructure and deployment. The tooling we have around these processes is something that everyone could use and extend and is aligned closely to our DevOps culture. But as with all these things as we grew (both in terms of our team and our infrastructure) we recognised it was becoming harder to have visibility of these operations when they were being run from developer machines, equally it was painful to ensure everyone had the correct version of tools required etc. We initially mitigated much of this through the use of Vagrant but ultimately we wanted to make these important operations easy to perform, be repeatable, consistent and guarantee that we had monitoring and logging around them. HipChat, once again, seemed like a good fit for providing a common easy to understand user interface to these operations.

At the beginning of 2016 we introduced our second bot, Hercules, also based on Hubot. Whereas Zeus provides an interface to third party services we integrate with, Hercules is focussed entirely on infrastructure and services. Over several months we’ve migrated much of our key infrastructure provisioning and also deployment and release automation to Hercules. Since these tasks are now effectively centralised it was easy to add logging and alerting and to ensure the entire team has visibility around what is happening or being changed. For example, here’s how we deploy a new version of our Reading Lists application:

Chat Log

Whilst Hercules echoes the main events in HipChat, a detailed log is always available on our centralised logs server and accessible through Kibana:

Kibana Log

One interesting part of this process is the peer approval we use. Many of the operations that Hercules can perform require peer approval. This is a safety net that ensures someone else in the team knows you’re about to make a change and can be on hand to assist if there are any problems. To implement this we use the Hubot Approval middleware. Integrating this into Hercules was relatively simple and only requires you to add your own custom middleware that effectively looks up the approver. Below is a simplified version of what we use and provides an easy to use drop-in if you want to experiment with it:

# Description
#   Middleware that adds a group function to user object
#   This also prepopulates a structure (userGroups) that assigns every user
#   to a set of groups. hubot-approval works based on peer or specifid group
#   approval so we need to set this data up.
#
#   The version property is used to ensure that we aren't settting the
#   structure on every request, and also gives us a way to reload that data
#   without restarting the bot
#
# Author:
#  Nadeem Shabir <ns@talis.com>

module.exports = (robot) ->
  # Lookup for all approvers
  userGroups = {
    '1' : {name: "Joe Bloggs",  groups:["approvers"]}
    '2' : {name: "Jane Smith",  groups:["admin", "approvers"]}
  }
  version = 1

  robot.listenerMiddleware (context, next, done) ->
    environment = context.response.match[1]

    # allows us to repopulate the brain with changes to the above
    # list using the reload command without restarting the bot
    if robot.brain.get('userGroupsVersion') != version
      robot.brain.set 'userGroups', userGroups
      robot.brain.set 'userGroupsVersion', version

    # The purpose of this middleware is to add a function call
    # groups to the user object. This function returns the list
    # of groups that a user belongs to, which the hubot-approval
    # middleware then uses.
    context.response.message.user.groups = (cb) ->
      groups = robot.brain.get('userGroups')
      if groups.hasOwnProperty(context.response.message.user.id)
        cb(groups[context.response.message.user.id]['groups'])
      else
        cb([])
    next()

Final Thoughts

ChatOps helps to integrate people, bots and tools together in a transparent way. This transparency is an important feature for us because it means that anyone in the team can join a conversation and immediately get “caught up” on everything that has been happening whether they are in the office or working from home. They no longer need to worry about having the correct tools and dependencies installed locally before performing an important operational task whether that’s deploying a new release, or deploying an entirely new stack or service. This has been an important and hugely beneficial tool for us and it’s one that we’ll continue to evolve as we grow and our needs and use cases evolve.

However it is important to understand that ChatOps isn’t something you can just “install”, it isn’t a product. It’s utility and how it manifests varies from team to team and organisation to organisation, and often the biggest obstacles you’ll face aren’t technical they are cultural - so start small and build upon that.

Comments on HN