My current job involves using many external APIs pretty aggressively. This has led to lots of little (and big) gotcha’s using a slew of services.
About two weeks ago, after dealing with a particularly annoying third party API outage, I made a quick angry tweet
The more I use web APIs, the more unreliable I realize they are. If your product talks to other sites, you MUST code extremely defensively
— Nick Vlku (@vlkun) May 2, 2012
Different definitions of “Resiliency”
This led to quite a lot of replies on Twitter and Facebook. The surprising thing was that nearly all the replies had a different definition of “resiliency” in mind. Along with this, there were many different approaches on how to be “defensive”. I thought I’d dive into the different definitions people mentioned, focus on how as the service provider you should handle these, and how as a service consumer you should mitigate for them.
Total Outage
No site or service is infallible and everything fails. That’s totally fine and totally acceptable. The key here is how exactly you fail, how you let your users know and how users can mitigate for it.
Definition
A total outage means literally what it says. The service is not responding to any requests. This type of outage’s severity can range from no responses from the end point at all (or a hung connection) to an error page of some sort being returned.
What a Service Provider Should Do
The absolute most important thing to do here is communicate to your developers and also client libraries. Having no information and making your developers look through Twitter public firehouse searches is not appropriate.
Related to these things, some good fundamentals to help alleviate an outage for your developer community are:
- Communicate to your developers via an API status page
- Ideally this page should not be on the same platform as your site. If a status page goes down when other things go done, its not an effective status page!
- Make this status page easy to subscribe to (remember RSS feeds?)
- Create an API status Twitter account that will post updates. (This is NOT a support account, but the support account should certainly retweet it when there are outages.)
- Communicate your outage effectively to actual API clients as well!
- This means practice proper web etiquette. If your service is having issues, don’t return back your standard HTML error page with a 200 status code! (One particular big API still does this!)
- If there are errors, send them as part of the payload of the response. Make sure your HTTP status code is proper (500 if you are have a FUBAR type of outage.)
- If your API currently doesn’t have an “error” field for all responses, add one in your next revision. This will communicate to your developers that they should expect stuff there if something goes BOOM. Leave it blank in normal responses.
Generally, as is pretty obvious, nearly all of the above focus around some form of communication. Nearly everything that a service provider can do to help with an outage will center around that.
Service Consumers, How do you mitigate?
Surprisingly, we’ve found the “Total Site Failure” error is one of the easiest to mitigate for. Obviously, the impact to your application depends on how dependent your app is (ie: a Twitter client would be completely down without Twitter vs a reader of many sites would still be up but limited.) On the client library side, we typically:
- Don’t assume all requests return back successfully. This is most important. Treat the API like any unreliable dependency (think of its network or disk. )
- Make sure to catch exceptions aggressively – This means wrap up all actual request calls ultimately in “catch all” exception blocks.
- Never make assumptions of how things will be returned back. Some services return back HTML error pages with 200 status codes when things go wrong! For instance, if you are auto-converting a response to JSON, wrap that process up in a Catch block!
- Whenever you make (synchronous or asynchronous) API calls, have appropriate timeouts for these calls (so your apps don’t block forever for a response that is not coming –or so you don’t have a bunch of asynchronous calls holding sockets.)
Your app or site should have some way of communicating to the end user that there is an outage and how exactly their functionality will be limited or crippled.
Partial Outage (relatively high error rates)
Similar to Total Outage, but a bit harder to debug — this is when the API only seems to return results sometimes.
Definition
The service is either returning bad data or no data on random requests. A variant of this can be a specific feature or endpoint is down, but the rest of the service is functioning properly.
What a Service Provider Should Do
Depending on the outage, this one might be tougher for a service provider to handle. If you can detect this outage with some monitoring software, you should adopt the same approaches as the Total Outage above. However, if this is a more subtle outage that flies under the radar, it’s important to make sure you have a clear line of communication with your API consumers. Your developer consumers could be the first to notice something is wrong, and if they have no easy way to reach out to you, they’ll just keep seeing errors and getting angrier and angrier as you don’t fix it.
This means you should have a readily available highly visible place to report issues. I strongly suggest putting this on your developer docs home page and wherever you make developers register for a key. The reporting form should try to provide as much guidance as possible to encourage the correct information being submitted. Some things I’d put on this form:
- Description
- End-point or service used
- Client library used
- Type of error. Message returned
- When did this start? How long has it been going on?
- Rate of requests you are making
- Consistent or inconsistent failures
- Hosting environment (if possible IPs of machines that are seeing failures)
- Contact information
This gives you a good starting point to help you debug exactly how your service is failing while allowing the user to fill in information that you weren’t expecting to capture.
I’d tie this message form to a service that sends email to people responsible for maintaining the API (perhaps with a threshold before firing alerts.) The key here is to quickly address and respond to these alerts (even if to clarify to the API consumer what they’re doing wrong.)
Service Consumers, How do you mitigate?
This one is a little trickier, because its hard to mitigate for an error you might never see happen directly. Typically, the best way to plan for this is to just do proper defensive coding around all your calls. This means (most of these are repeats from the Total Failure scenario: )
- make sure your calls capture all exceptions and log aggressively to a centralized service (like Ganglia)
- keep thresholds on those errors so you can be alerted when they hit a certain point
- never assume an API call will return valid data (even valid JSON or XML!)
- actually, never assume an API call will even return at ALL — use timeouts for your calls
- make your calls retry automatically after a failure — try to pick a sane number of retries (10 or so) and ideally use exponential back off for retries (of course, waiting n seconds between retries is a good starting point.)
- even after retrying successfully — still log these failures
The nice thing about a partial outage is that if you practice sane retry strategies, your services will still keep working (although they might be a bit slower.)
API Functionality Change
This isn’t exactly an outage, but multiple people responded to my tweet thinking this is what I meant. That being said, it definitely is a form of resiliency, and it can be just as catastrophic to a client as an outage.
Definition
This happens when the contract between the API consumer and the API changes in a way that breaks older functionality — could be a result of different data being returned back, different parameters expected, some data fields changed (eg: ID numbers have a new format), or endpoints disappear.
What a Service Provider Should Do
Once again, it comes down to communication. Changes that will break current functionality must be communicated aggressively and changes should be phased in over time. Ideally, your API should be revisioned so old clients can continue to work while new clients get new functionality. A good battle plan for this:
- Keep your API’s revision’ed. If you use different roots, dates often work better than version numbers (/v1/… vs /04212012/) because they give off more information. You can be really baller and make your API have a different root domain instead (04212012.api.yourcompany.com) — the added benefit here is you gain a bit more “REST-purity” by avoiding versioning information in your API’s URLs.
- Have a “latest” root, that always points to the latest version.
- Provide an API endpoint that gives current revision information, including what has been deprecated.
- If you are deprecating older methods, return some information back in the API for each deprecated call (possibly in the “error” field, or have a “misc” field)
- Communicate these features are being deprecated/removed via your official API documentation and your API support twitter account.
- Extra credit: Keep logs of who are making these calls and send emails to the developers as you are about deprecated.
Finally, to help find API libraries that are deprecated, give them the ability send across “expected version” information with each request. That way, if a client sends “<version-expected>12-04-2011</version-expected>” that is going to be severely deprecated, you can enforce different degrees of degradation of service. It also gives you a really easy way to catalog who is using what old API library.
Service Consumers, How do you mitigate?
The easiest way to stay on top of this is to watch the API’s developer site. Hopefully, the API provider will give sufficient enough notice to let you mitigate or change things appropriately. Some other things you can do:
- If the API library provides a facility of self-reporting “version”, have a log watcher that looks for words like “deprecated” in responses
- If the API provider provides generic endpoints (/user or /latest/user) *and* version specific endpoints (/v2/user or /12212010/user), always use the version specific endpoints. The contract will likely be more strongly enforced.
- Keep an eye out for new versions of API libraries that you use. When a new version is released, do a regression test to make sure everything that used to work, still does (You do have regression tests and a continuous integration environment, right?
)
Hope this helps
Well that’s my general guide for how to handle APIs based on the lessons we’ve learned here over the past few years. Generally, all of them focus on communication and I can’t stress how as long as you get that right, everything else is extra. (This communication applies to both consumers and providers btw!)
Please let me know what you think. I’m very interested in keeping this document updated with new things I’ve learned.