The Agile Infrastructure Project Part 1: Configuration Management
Tim Bell Gavin McCance CERN IT Department CH-1211 Genève 23 Switzerland
www. cern. ch/i t
Configuration and Operations Tools https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure https://agileinf.cern.ch/jira/
IT Technical Forum – 27 Jan 2012
2
Project scope • The project is reviewing the entire CERN computer-centre management toolset – – – – – – –
What happens from the bare metal up Asset management, inventory Sysadmin tools and maintenance workflows Service management and configuration tools Dynamic configuration for ‘virtual’ hosts Operations monitoring Workflow automation and continuous deployment
IT Technical Forum – 27 Jan 2012
3
Configuration and Operations Tools
IT Technical Forum – 27 Jan 2012
4
Why? • Current production system built around the Quattor toolset is successfully managing 10k servers – (CERN) Quattor + many CERN components
• Why are we changing the toolset?
IT Technical Forum – 27 Jan 2012
5
What are the issues • Uncompressible technical debt – The cost to develop and maintain our own solution is not reducing and clearly exceeds our resources – Small community (less funding) and general support problem. At CERN, we’ve fallen into the “sticky hands” support model
• We need better automation and integration between the sub-components – Lack of automated workflow: everything is a ticket • emailScript™ : your added value in the process is often your CERN password
– The 15-min “CDB commit walk” – context switch cost
IT Technical Forum – 27 Jan 2012
6
What are the issues • Transferrable skills and training – Learning curve for our tools is steep and remains high
– It’s easier to hire people who have skills in a widelyused tool than your internal tools • Depending on where you look
IT Technical Forum – 27 Jan 2012
7
Jobs adverts – indeed.com Index of millions of worldwide job posts across thousands of job sites
Puppet
Quattor
IT Technical Forum – 27 Jan 2012
These are the sort of posts our departing staff will be applying for.
8
Integration is hard • IPv6, virtualisation, Windows Server all need a solution – We could leverage lots of open source tools • But piecemeal integration of these requires high investment due to our complex system • Years of organic growth have made the system way too ‘hairy’ • It’s often easier to reinvent rather than integrate
– Lack of ‘dynamic-ness’ in the infrastructure • We hack the config system for dynamic VMs
• It’s critical to look at the system as a whole
IT Technical Forum – 27 Jan 2012
9
Where to look? • Large ops community out there taking the “tool chain” approach whose scaling needs match ours: O(100k) servers, many apps • Become standard and join this community
IT Technical Forum – 27 Jan 2012
10
Use Puppet for the core •
The tool space has exploded in the last few years – In configuration management and ops – Large, shared ‘tool forges’, and lots of experience
•
Puppet and Chef are the clear leaders for the ‘core’ tool – other tools in our ‘scope’ try to integrate with those
•
Many large-scale enterprises use Puppet – It’s declarative approach fits better with what we’re used to – Large installations: friendly, wide-base community and commercial support and training – You can buy books on it
IT Technical Forum – 27 Jan 2012
11
Scaling challenges: nodes • •
Currently we have O(10k) physical nodes IaaS approach: – – – –
•
Moving to virtual machines More (smaller, load-balanced) service nodes VMs for raw compute (batch or pilot jobs) Homogeneous: compute + storage on the same node
Add another computer centre, 24/48 SMT cores per node, you get 100k – 300k virtual nodes to be managed – 99.6%(1) node update success-rate means 1200 manual interventions to “fix it” (1)
in a recent intervention on lxbatch
IT Technical Forum – 27 Jan 2012
12
Scaling challenges: people
•Many, diverse applications (“clusters”) managed by different teams •..and 700+ other “unmanaged” Linux nodes in VMs that could benefit from a simple configuration system IT Technical Forum – 27 Jan 2012
13
Agile Infrastructure 1st Try •
First started investigating tools in September using ‘parttime’ resources from CF, DB, DSS, GT, OIS and PES – Trying iterative “agile-sprint” style (Scrum): short sprints, feedback, sprint review, visible – Take first, best-guess at architecture and tool selection, iterate
•
Mixed success with this agile style – What works: Good visibility and reviews. Daily “scrum” meeting useful. Weekly review meeting open to management. – What doesn’t: The “time boxing” part of of Scrum sprints is hard with part-time resources – The project planning now foresees more dedication of staff
IT Technical Forum – 27 Jan 2012
14
Agile Infrastructure 1st Try •
We’re currently running: – OpenStack as cloud software for virtual machines, image management, bulk storage • Future IT forum presentation
– Puppet for the configuration management core – …with Foreman as a dashboard
IT Technical Forum – 27 Jan 2012
15
Foreman dashboard
IT Technical Forum – 27 Jan 2012
16
Agile Infrastructure 1st Try •
We’re currently running: – OpenStack as cloud software for virtual machines, image management, bulk storage • Future IT forum presentation
– Puppet for the configuration management core – …with Foreman as a dashboard
•
None of the tools are “perfect” out-of-the-box – ..but we’d rather submit patches to a good open source tool than reimplement it – We’ve experienced very good community support: RFCs and patches are quickly accepted – Very active community: often problems are fixed and missing features implemented before you even report them
IT Technical Forum – 27 Jan 2012
17
Agile Infrastructure 1st Try • We’re currently running: – yum for software distribution (replacing spma) – git for template management: why git? • Almost all the Puppet (and Chef) usage schemes out there assume you use git to handle the templates • Many of the tools we can benefit from also assume git • We should not be different from the rest of the community
IT Technical Forum – 27 Jan 2012
18
Puppet • Client/server architecture – “puppetmaster”: horizontally scalable Rails application – X509 cert authenticated nodes: integrate with CERN CA
IT Technical Forum – 27 Jan 2012
19
Puppet •
Puppet runs on the client, applying the configuration changes – It detects the current state and only runs if there’s something to do
•
It runs every few minutes – new configuration will be ~immediately applied (“fail-fast”). – This is a change from CDB where ‘latent’ changes can be stacked up
•
Normal mode is client-side compile (“assume success”) – No more CDB commit waits – Change from CDB: the compilation fails later
•
Good monitoring is a pre-req: puppet sends reports back to the puppetmaster – The Foreman tool can collect these for you IT Technical Forum – 27 Jan 2012
20
Puppet language •
Puppet uses it’s own Ruby-like language for the templates to “assert” the desired state of the nodes – With Ruby fall-back for hard stuff (we’ve only needed this once)
•
Being declarative rather than procedural, there are quirks – Takes a bit of practice to ‘get it’ – There are books, online docs, online cook-books, and a large community to help
•
It dispenses with the need for ncm components – All the work is done by puppet on the node itself – you just provide the template part to assert what you want done – Less software -> easier to move to new OS versions IT Technical Forum – 27 Jan 2012
21
Externals •
Puppet uses an external DB for much of the configuration that we currently store in textual CDB templates
•
Node function + hardware – Moving a host between clusters is a DB update
•
Your configuration can use variables the node detects itself – e.g. reconfigure daemons based on where a newly live-migrated VM has found itself
•
Query the compiled configuration of other hosts – e.g. Open my firewall to the lxadm nodes IT Technical Forum – 27 Jan 2012
22
Moving towards PaaS •
Parametrisable recipes – Just fill in the blanks
•
The aim is to make it easy to use “pre-canned” recipes without even touching a Puppet template – e.g. stick a standard CERN SSO-enabled apache / mod_wsgi / Django server on my box – …with these parameters
•
Moving us in the PaaS direction – Ultimately, it would be better if you never even needed to log into this node • (J2EE public service, IT web hosting service, MySQL service)
IT Technical Forum – 27 Jan 2012
23
Standard workflow Iterate n minutes
CDB on lxadm check out from CDB
update templates
CDB commit
run and notify with check on check on nc-client node(s) test node
Iterate 1 minute
Puppet on lxadm check out from git
update templates
git commit and push
run and check on test node
notify with mcollective
check on foreman
Iterate Puppet-apply on test node update run check out from git on templates puppet-apply the test node
check on test node
IT Technical Forum – 27 Jan 2012
git commit and push
notify with mcollective
check on foreman
24
Modernising our processes •
Our software processes for the computer centre are fairly limited – fire-and-forget broadcasts to project-elfms
•
…and rather manual – The manual test/ -> preprod/ -> prod/ template dance – Our toolset RPMs are ‘built on laptop’ and uploaded to ‘swrep’ by hand
•
Add standard CI (e.g. Jenkins, Bamboo, Cruise) and automated build (Koji) as the only route to get new packages into the CC – .. then automate the testing – e.g. suitably tagged RPMs are automatically deployed to /test nodes.
IT Technical Forum – 27 Jan 2012
25
Modernising our processes •
We’re working out which of the many puppet / git models suits us – code review, sign-off and automated notification for changes that will affect multiple clusters – How to automate the test/preprod/prod advancement
•
Pre-req is flexible monitoring and alarming – you need to trust that an automation failure will be signaled to you
•
Script-generated emails are banned – Need good monitoring to hang these notifications on
•
Integrate components rather than use emailScript™ – Script-generated tickets (where your value in the process is your password), are banned IT Technical Forum – 27 Jan 2012
26
Current tool snapshot (liable to change) Puppet Foreman mcollective, yum
Jenkins AIMS/PXE Foreman
JIRA
Openstack Nova
git, SVN
Koji, Mock Yum repo Pulp
Hardware database
Lemon Puppet stored config DB
IT Technical Forum – 27 Jan 2012
27
Preliminary timelines Year
What
Actions
2011
Agree overall principles
2012
Prepare formal project plan Establish IaaS in CERN CC Production Agile Infrastructure Monitoring Implementation as per WG Migrate lxcloud Early adopters to Agile Infrastructure
2013
LSD 1 New Data Centre
Extend IaaS to remote CC Business Continuity Support Experiment App re-work Migrate CVI General migration to Agile with SLC6 and Windows 8
2014
LSD 1 (to November)
Phase out Quattor/CDB/…
• Aggressive schedule if we are to make it for new data centre IT Technical Forum – 27 Jan 2012
28
Initial steps • Decide on tools now and integrate them together to make a production setup (Q1) – We can still change.. But we’re starting to commit…
• Looking for early adopters (from Q1) – In particular to understand the people-scaling / ACL issues: which of the git/puppet models is best? • e.g. PES/OIS services: batch/VMs, JIRA, Drupal • https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/ EarlyAdopters2012
– Help with integration / coding – Help with ideas – Help with building the task list
IT Technical Forum – 27 Jan 2012
29
Summary •
IT has started a new project to move our infrastructure to a new toolset based around industry standard open source components – – – – –
•
Puppet for the core configuration tool Better integration between components Use of more modern software processes to aid deployment Better monitoring Engage with the community rather than re-implement
Overall project scope is wider (future IT forums) – Cloud and virtualisation, improved monitoring
• Please get involved early and give feedback •
https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure IT Technical Forum – 27 Jan 2012
30
Backup slides
IT Technical Forum – 27 Jan 2012
31
Code ownership model •
The “sticky hands” support model (“you touched it last!”)
•
We’re working out an FE-based model where – Code is owned by the related service Functional-Element – “Ownership” confers the responsibility to maintain a decent “standard config” for the computer centre, and the responsibility to roll out new versions of that code/config – Patches from interested people can be offered, and if you take them, you support them • not the guy that gave you the patch
IT Technical Forum – 27 Jan 2012
32
mcollective and messaging •
mcollective is a notification framework – Mix of CERN’s not.d / wassh – It broadcast instructions to run “pre-canned” tasks to nodes selected by a filter • collects the results from the nodes • then renders that result for the CLI • e.g. restart all my webservers, do a puppet run now
•
It requires a messaging framework that all nodes subscribe to (to receive the notification) – Typically: AcvtiveMQ or RabbitMQ – Both Openstack and our (future) monitoring system need a CC wide messaging system as well