This is a service for managing Firefox release operations (RelOps) hardware. It is a rewrite of the build-slaveapi based on tecken to help migrate from buildbot to taskcluster.
The service consists of a Django Rest Framework web API, Redis-backed Celery queue, and one or more Celery workers. It should be run behind a VPN.
+-----------------------------------------------------------------------------+
| VPN |
| |
+------------+ | +--------------+ +----------------+ +-----------+ +--------+ |
| | | | Roller | | Roller | | Roller +-----> | |
| TC Dash. +-------> API +-----> Queue +-----> Workers | | HW 1 | |
| | | | | | | | <-----+ | |
| <-------+ <-----+ <-----+ | +--------+ |
| | | | | | | | | |
+------------+ | +----+---+-----+ +----------------+ | | +--------+ |
| | +-----> | |
| | | | HW 2 | |
| | <-----+ | |
| | | +--------+ |
| | | |
| | | +--------+ |
| | +-----> | |
| | | | HW 3 | |
| | <-----+ | |
| +-----------+ +--------+ |
| |
| |
+-----------------------------------------------------------------------------+
After a Roller admin registers an action with taskcluster, a sheriff or RelOps operator on a worker page of the taskcluster dashboard can use the actions dropdown to trigger an action (ping, reboot, reimage, etc.) on a RelOps managed machine.
Under the hood, the taskcluster dashboard makes a CORS request to Roller API, which checks the Taskcluster authorization header and scopes then queues a Celery task for the Roller worker to run. (There is an open issue for sending notifications back to the user).
URL for worker-context Taskcluster actions that needs to be registered.
URL params:
-
$worker_idthe Taskcluster Worker ID e.g.ms1-10. 1 to 128 characters in long. -
$worker_groupthe Taskcluster Worker Group e.g.mdc1usually a datacenter for RelOps hardware. 1 to 128 characters in long.
Query param:
$task_namethe celery task to run. Must be inTASK_NAMESinsettings.py
Taskcluster does not POST data/body params.
Example request from Taskcluster:
POST http://localhost:8000/api/v1/workers/dummy-worker-id/group/dummy-worker-group/jobs?task_name=ping
Authorization: Hawk ...
Example response:
{"task_name":"ping","worker_id":"dummy-worker-id","worker_group":"dummy-worker-group","task_id":"e62c4d06-8101-4074-b3c2-c639005a4430"}Where task_name, worker_id, and worker_group are as defined in the request and task_id is the task's Celery AsyncResult UUID.
To run the service fetch the roller image and redis:
docker pull mozilla/relops-hardware-controller
docker pull redis:3.2The roller web API and worker images run from one docker container.
Copy the example settings file (if you don't have the repo checked out: wget https://raw.githubusercontent.com/mozilla-services/relops-hardware-controller/master/.env-dist):
cp .env-dist .envIn production, use --env ENV_FOO=bar instead of an env var file.
Then docker run the containers:
docker run --name roller-redis --expose 6379 -d redis:3.2
docker run --name roller-web -p 8000:8000 --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller -d web
docker run --name roller-worker --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller -d workerCheck that it's running:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f45d4bcc5c3a mozilla/relops-hardware-controller "/bin/bash /app/bi..." 3 minutes ago Up 3 minutes 8000/tcp roller-worker
c48a68ad887c mozilla/relops-hardware-controller "/bin/bash /app/bi..." 3 minutes ago Up 3 minutes 0.0.0.0:8000->8000/tcp roller-web
d1750321c4df redis:3.2 "docker-entrypoint..." 9 minutes ago Up 8 minutes 6379/tcp roller-redis
curl -w '\n' -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' http://localhost:8000/api/v1/workers/tc-worker-1/group/ndc2/jobs\?task_name\=ping
<h1>Bad Request (400)</h1>
docker logs roller-web
[2018-01-10 08:27:23 +0000] [5] [INFO] Starting gunicorn 19.7.1
[2018-01-10 08:27:23 +0000] [5] [INFO] Listening at: http://0.0.0.0:8000 (5)
[2018-01-10 08:27:23 +0000] [5] [INFO] Using worker: egg:meinheld#gunicorn_worker
[2018-01-10 08:27:23 +0000] [8] [INFO] Booting worker with pid: 8
[2018-01-10 08:27:23 +0000] [10] [INFO] Booting worker with pid: 10
[2018-01-10 08:27:23 +0000] [12] [INFO] Booting worker with pid: 12
[2018-01-10 08:27:23 +0000] [13] [INFO] Booting worker with pid: 13
172.17.0.1 - - [10/Jan/2018:08:31:46 +0000] "POST /api/v1/workers/tc-worker-1/group/ndc2/jobs HTTP/1.1" 400 26 "-" "curl/7.43.0"
172.17.0.1 - - [10/Jan/2018:08:31:46 +0000] "- - HTTP/1.0" 0 0 "-" "-"Roller uses an environment variable called DJANGO_CONFIGURATION that
defaults to Prod to pick which composable
configuration
to use.
In addition to the usual Django, Django Rest Framework and Celery settings we have:
-
TASKCLUSTER_CLIENT_IDThe Taskcluster CLIENT_ID to authenticate with -
TASKCLUSTER_ACCESS_TOKENThe Taskcluster access token to use
-
CORS_ORIGINWhich origin to allow CORS requests from (returning CORS access-control-allow-origin header) Defaults tolocalhostin Dev andtools.taskcluster.netin Prod -
TASK_NAMESList of management commands can be run from the API. Defaults topingin Dev andrebootin prod.
-
BUGZILLA_URLURL for the Bugzilla REST API e.g. https://landfill.bugzilla.org/bugzilla-5.0-branch/rest/ -
BUGZILLA_API_KEYAPI for using the Bugzilla REST API -
XEN_URLURL for the Xen RPC API http://xapi-project.github.io/xen-api/usage.html -
XEN_USERNAMEUsername to authenticate with the Xen management server -
XEN_PASSWORDPassword to authenticate with the Xen management server -
ILO_USERNAMEUsername to authenticate with the HP iLO management interface -
ILO_PASSWORDPassword to authenticate with the HP iLO management interface -
FQDN_TO_SSH_FILEPath to the JSON file mapping FQDNs to SSH username and key file paths example in settings.py. The ssh keys need to be mounted when docker is run. For example withdocker run -v host-ssh-keys:.ssh --name roller-worker. The ssh user on the target machine should use ForceCommand to only allow the commandrebootorshutdowndefaultssh.json -
FQDN_TO_IPMI_FILEPath to the JSON file mapping FQDNs to IPMI username and passwords example in settings.py defaultipmi.json -
FQDN_TO_PDU_FILEPath to the JSON file mapping FQDNs to pdu SNMP sockets example in settings.py defaultpdus.json -
FQDN_TO_XEN_FILEPath to the JSON file mapping FQDNs to Xen VM UUIDs example in settings.py defaultxen.json
Note: there is a bug for simplifying the FQDN_TO_* settings
To list available actions/management commands:
docker run --name roller-runner --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller manage.py
Type 'manage.py help <subcommand>' for help on a specific subcommand.
Available subcommands:
[api]
file_bugzilla_bug
ilo_reboot
ipmi_reboot
ipmitool
ping
reboot
register_tc_actions
snmp_reboot
ssh_reboot
xenapi_rebootTo show help for one:
docker run --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller manage.py ping --help
usage: manage.py ping [-h] [--version] [-v {0,1,2,3}] [--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback] [--no-color]
[-c COUNT] [-w TIMEOUT] [--configuration CONFIGURATION]
host
Tries to ICMP ping the host. Raises for exceptions for a lost packet or
timeout.
positional arguments:
host A host
optional arguments:
-h, --help show this help message and exit
...
-c COUNT stop after sending NUMBER packets
-w TIMEOUT stop after N seconds
...And test it:
docker run --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller manage.py ping -c 4 -w 5 localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.042 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.074 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.086 ms
64 bytes from localhost (127.0.0.1): icmp_seq=4 ttl=64 time=0.074 ms
--- localhost ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3141ms
rtt min/avg/max/mdev = 0.042/0.069/0.086/0.016 msIn general, we should be able to run tasks as a manage.py commands and tasks should do the same thing when run as commands as via the API.
Note: bug for not requiring redis to run management commands
- Create an ssh key and user limited to
shutdownorrebootwith ForceCommand on the target hardware - Add the ssh key and user to the mounted worker ssh keys directory
- Add the machine's FQDN to any relevant
FQDN_TO_*config files
- Check that the
TASK_NAMESsettings only includes tasks we want to register with Taskcluster - Check
TASKCLUSTER_CLIENT_IDandTASKCLUSTER_ACCESS_TOKENare present as env vars or in settings (via taskcluster-cli login) The client will need the Taskcluster scopequeue:declare-provisioner:$provisioner_id#actions - Run:
docker run --link roller-redis:redis --env-file .env mozilla/relops-hardware-controller manage.py register_tc_actions https://roller-dev1.srv.releng.mdc1.mozilla.com my-provisioner-idNote: An arg like --settings relops_hardware_controller.settings or --configuration Dev may be necessary to use the right Taskcluster credentials
Note: This does not need to be run from the roller server since the first argument is the URL to Taskcluster to send the action.
- Check the action shows up in the Taskcluster dashboard for a worker on the provisioner e.g. https://tools.taskcluster.net/provisioners/my-provisioner-id/worker-types/dummy-worker-type/workers/test-dummy-worker-group/dummy-worker-id (this might require creating a worker)
- Run the action from the worker's Taskcluster dashboard
This is similar to prod deployment, but uses make, docker-compose, and env files to simplify starting and running things.
To build and run the web server development mode and have the worker reload and purge the queue on file changes run:
make start-web start-workerTo run tests and watch for changes:
make current-shell # requires the start-web / the web server to be running
docker-compose exec web bash
app@ca6a901df6b4:~$ ptw .
Running: py.test .
=========================================================== test session starts ============================================================
platform linux -- Python 3.6.3, pytest-3.3.2, py-1.5.2, pluggy-0.6.0
Django settings: relops_hardware_controller.settings (from environment variable)
rootdir: /app, inifile: pytest.ini
plugins: flake8-0.9.1, django-3.1.2, celery-4.1.0
collected 74 items
...- Create
relops_hardware_controller/api/management/commands/<command_name>.pyandtests/test_<command_name>_command.pye.g. ping.py and test_ping_command.py - Run
make shellthen./manage.pyand check for the command in the api section of the output - Add the command name to
TASK_NAMESinrelops_hardware_controller/settings.pyto make it accessible via API - Add any required shared secrets like ssh keys to the settings.py or .env-dist
- register the action with taskcluster
