Infrastructure Troubleshooting

Monitoring & metrics

Logging

The Heroku apps are configured to use Papertrail as a log drain. The aggregated HTTP router, dyno and app output logs can be viewed and searched at: prototype | stage | prod.

See the Heroku logging and error codes documentation for help understanding the log output.

Note

Django is configured to only output log levels WARNING and above, unless debug mode is enabled (such in Vagrant).

Scenarios

Regression from a deployment

  1. Check that the deployment did not involve schema migrations, using the "Compare diff" link on the Heroku app "activity" tab. (If it did, it might be easier to fix the regression in place rather than rolling back the deployment.)
  2. Revert the deployment
  3. Notify the rest of the team, so that they do not unknowingly re-deploy the regression.
  4. If a bug had not already been filed for the regression, either file one now, or re-open the bug that caused the regression, explaining the issue encountered.

Web request HTTP 500s

  1. Check the New Relic "Error Analytics" section for details of the Python exception.
  2. (If needed) Search Papertrail for the exception name/message for additional information.
  3. File a bug with steps to reproduce and the exception stack trace.

Web request HTTP 503s

If the Heroku HTTP router returns an HTTP 503 for a request, it means that a web dyno was unable (or refused) to provide a response. There are several reasons this may occur, which can be differentiated by finding out which error code occurred.

To discover the error code, check either:

  • the Heroku app "metrics" tab with the "web" dyno type selected, then inspect the timeline.
  • the Papertrail logs, by searching for status=503

The most common error codes we see, are:

  • H12 - Request timeout

    This means that the HTTP request took longer than 30 seconds to complete, so was interrupted by the Heroku router. This occurs when a request backlog has formed due to the number or duration of web requests exceeding the capacity of the gunicorn workers available.

    This can be confirmed by looking at the New Relic "Overview" section (making sure the selector at the top of the page is set to "Web transactions time") and checking for spikes in the Request Queuing time (which measures how long requests were waiting before a gunicorn worker was able to begin processing them).

    To resolve, first try restarting the web dynos in case one of them has an issue (eg: failing dyno or noisy neighbours on that instance). If that doesn't work, check the New Relic web transactions page to see if either there has been a spike in the throughput of requests (in which case consider scaling the web dynos), or if an external service (such as MySQL or Redis) is taking longer to respond than before.

  • H13 - Connection closed without response:

    This means that the request exceeded the gunicorn's configured time limit (currently set at 20 seconds, see the --timeout gunicorn argument in Procfile), so it aborted the request.

    If most other web requests are succeeding, this suggests that the particular API endpoint and query-parameter combination needs optimisation (in which case file an API bug with steps to reproduce), or if not, try the suggested steps for the H12 error code above.

Celery queue backlogs

When the RabbitMQ queues used by Celery exceeds the configured threshold, CloudAMQP sends an email alert to the treeherder-internal@moco mailing list. If this occurs:

  1. Open the CloudAMQP management dashboard using the links above, then:
    • Check the "Metrics" tab, to see a timeline of total queue length.
    • Click the "RabbitMQ Manager" button, and switch to the "Queues" tab to see the per-queue breakdown and message incoming/delivery rates.
  2. Check New Relic's "Error Analytics" section, in case tasks are failing and being retried due to a Python exception.
  3. In the New Relic's "Transactions" section, switch to the "Non-web" transactions view (or use the direct links above), and click the relevant Celery task to see if there has been a change in either throughput or time per task.
  4. Depending on the information discovered above, you may want to try restarting or scaling the worker dynos associated with the backlogged queues.

New pushes/jobs not appearing

If new pushes or CI job results are not appearing in Treeherder's UI:

  1. Follow the steps in Celery queue backlogs to rule out task backlogs/Python exceptions.
  2. Check the upstream Pulse queues using Pulse Guardian (you must be an co-owner of the Treeherder queues to see them listed). If there is a Pulse queue backlog, it suggests that Treeherder's pulse_listener_{pushes,jobs} dynos have stopped consuming Pulse events, and so might need restarting.
  3. Failing that, it's possible the issue might lie in the services that send events to those Pulse exchanges, such as taskcluster-github or the Taskcluster systems upstream of those. Ask for help in the IRC channel #taskcluster.