My current job involves using many external APIs pretty aggressively.  This has led to lots of little (and big) gotcha’s using a slew of services.

About two weeks ago, after dealing with a particularly annoying third party API outage, I made a quick angry tweet

Different definitions of “Resiliency”

This led to quite a lot of replies on Twitter and Facebook.  The surprising thing was that nearly all the replies had a different definition of “resiliency” in mind. Along with this, there were many different approaches on how to be “defensive”.  I thought I’d dive into the different definitions people mentioned, focus on how as the service provider you should handle these, and how as a service consumer you should mitigate for them.

Total Outage

Partial Outage

API Functionality Change

Total Outage

No site or service is infallible and everything fails.  That’s totally fine and totally acceptable.  The key here is how exactly you fail, how you let your users know and how users can mitigate for it.

Definition

A total outage means literally what it says.  The service is not responding to any requests.  This type of outage’s severity can range from no responses from the end point at all (or a hung connection) to an error page of some sort being returned.

What a Service Provider Should Do

The absolute most important thing to do here is communicate to your developers and also client libraries.  Having no information and making your developers look through Twitter public firehouse searches is not appropriate.

Related to these things, some good fundamentals to help alleviate an outage for your developer community are:

  • Communicate to your developers via an API status page
    • Ideally this page should not be on the same platform as your site.  If a status page goes down when other things go done, its not an effective status page!
    • Make this status page easy to subscribe to (remember RSS feeds?)
  • Create an API status Twitter account that will post updates. (This is NOT a support account, but the support account should certainly retweet it when there are outages.)
  • Communicate your outage effectively to actual API clients as well!
    • This means practice proper web etiquette.  If your service is having issues, don’t return back your standard HTML error page with a 200 status code! (One particular big API still does this!)
    • If there are errors, send them as part of the payload of the response.  Make sure your HTTP status code is proper (500 if you are have a FUBAR type of outage.)
    • If your API currently doesn’t have an “error” field for all responses, add one in your next revision.  This will communicate to your developers that they should expect stuff there if something goes BOOM.  Leave it blank in normal responses.

Generally, as is pretty obvious, nearly all of the above focus around some form of communication.  Nearly everything that a service provider can do to help with an outage will center around that.

Service Consumers, How do you mitigate?

Surprisingly, we’ve found the “Total Site Failure” error is one of the easiest to mitigate for.  Obviously, the impact to your application depends on how dependent your app is (ie: a Twitter client would be completely down without Twitter vs a reader of many sites would still be up but limited.)  On the client library side,  we typically:

  • Don’t assume all requests return back successfully.  This is most important.  Treat the API like any unreliable dependency (think of its network or disk. )
  • Make sure to catch exceptions aggressively – This means wrap up all actual request calls ultimately in “catch all” exception blocks.
  • Never make assumptions of how things will be returned back.  Some services return back HTML error pages with 200 status codes when things go wrong!  For instance, if you are auto-converting a response to JSON, wrap that process up in a Catch block!
  • Whenever you make (synchronous or asynchronous) API calls, have appropriate timeouts for these calls (so your apps don’t block forever for a response that is not coming –or so you don’t have a bunch of asynchronous calls holding sockets.)

Your app or site should have some way of communicating to the end user that there is an outage and how exactly their functionality will be limited or crippled.

Partial Outage (relatively high error rates)

Similar to Total Outage, but a bit harder to debug — this is when the API only seems to return results sometimes.

Definition

The service is either returning bad data or no data on random requests.  A variant of this can be a specific feature or endpoint is down, but the rest of the service is functioning properly.

What a Service Provider Should Do

Depending on the outage, this one might be tougher for a service provider to handle.  If you can detect this outage with some monitoring software, you should adopt the same approaches as the Total Outage above.  However, if this is a more subtle outage that flies under the radar, it’s important to make sure you have a clear line of communication with your API consumers.  Your developer consumers could be the first to notice something is wrong, and if they have no easy way to reach out to you, they’ll just keep seeing errors and getting angrier and angrier as you don’t fix it.

This means you should have a readily available highly visible place to report issues.  I  strongly suggest putting this on your developer docs home page and wherever you make developers register for a key.  The reporting form should try to provide as much guidance as possible to encourage the correct information being submitted.  Some things I’d put on this form:

  • Description
  • End-point or service used
  • Client library used
  • Type of error.  Message returned
  • When did this start?  How long has it been going on?
  • Rate of requests you are making
  • Consistent or inconsistent failures
  • Hosting environment (if possible IPs of machines that are seeing failures)
  • Contact information

This gives you a good starting point to help you debug exactly how your service is failing while allowing the user to fill in information that you weren’t expecting to capture.

I’d tie this message form to a service that sends email to people responsible for maintaining the API (perhaps with a threshold before firing alerts.)  The key here is to quickly address and respond to these alerts (even if to clarify to the API consumer what they’re doing wrong.)

Service Consumers, How do you mitigate?

This one is a little trickier, because its hard to mitigate for an error you might never see happen directly.  Typically, the best way to plan for this is to just do proper defensive coding around all your calls.  This means (most of these are repeats from the Total Failure scenario: )

  • make sure your calls capture all exceptions and log aggressively to a centralized service (like Ganglia)
  • keep thresholds on those errors so you can be alerted when they hit a certain point
  • never assume an API call will return valid data (even valid JSON or XML!)
  • actually, never assume an API call will even return at ALL — use timeouts for your calls
  • make your calls retry automatically after a failure — try to pick a sane number of retries (10 or so) and ideally use exponential back off for retries (of course, waiting seconds between retries is a good starting point.)
  • even after retrying successfully — still log these failures

The nice thing about a partial outage is that if you practice sane retry strategies, your services will still keep working (although they might be a bit slower.)

API Functionality Change

This isn’t exactly an outage, but multiple people responded to my tweet thinking this is what I meant.  That being said, it definitely is a form of resiliency, and it can be just as catastrophic to a client as an outage.

Definition

This happens when the contract between the API consumer and the API changes in a way that breaks older functionality — could be a result of different data being returned back, different parameters expected, some data fields changed (eg: ID numbers have a new format), or endpoints disappear.

What a Service Provider Should Do

Once again, it comes down to communication.  Changes that will break current functionality must be communicated aggressively and changes should be phased in over time.  Ideally, your API should be revisioned so old clients can continue to work while new clients get new functionality.  A good battle plan for this:

  • Keep your API’s revision’ed.  If you use different roots, dates often work better than version numbers (/v1/… vs /04212012/) because they give off more information.  You can be really baller and make your API have a different root domain instead (04212012.api.yourcompany.com) — the added benefit here is you gain a bit more “REST-purity” by avoiding versioning information in your API’s URLs.
  • Have a “latest” root, that always points to the latest version.
  • Provide an API endpoint that gives current revision information, including what has been deprecated.
  • If you are deprecating older methods, return some information back in the API for each deprecated call (possibly in the “error” field, or have a “misc” field)
  • Communicate these features are being deprecated/removed via your official API documentation and your API support twitter account.
  • Extra credit:  Keep logs of who are making these calls and send emails to the developers as you are about deprecated.

Finally, to help find API libraries that are deprecated, give them the ability send across “expected version” information with each request.  That way, if a client sends “<version-expected>12-04-2011</version-expected>” that is going to be severely deprecated, you can enforce different degrees of degradation of service.  It also gives you a really easy way to catalog who is using what old API library.

Service Consumers, How do you mitigate?

The easiest way to stay on top of this is to watch the API’s developer site.  Hopefully, the API provider will give sufficient enough notice to let you mitigate or change things appropriately.  Some other things you can do:

  • If the API library provides a facility of self-reporting “version”, have a log watcher that looks for words like “deprecated” in responses
  • If the API provider provides generic endpoints (/user or /latest/user) *and* version specific endpoints (/v2/user or /12212010/user), always use the version specific endpoints.  The contract will likely be more strongly enforced.
  • Keep an eye out for new versions of API libraries that you use.  When a new version is released, do a regression test to make sure everything that used to work, still does (You do have regression tests and a continuous integration environment, right? ;) )

Hope this helps

Well that’s my general guide for how to handle APIs based on the lessons we’ve learned here over the past few years.  Generally, all of them focus on communication and I can’t stress how as long as you get that right, everything else is extra.  (This communication applies to both consumers and providers btw!)

Please let me know what you think.  I’m very interested in keeping this document updated with new things I’ve learned.

Well, it’s been awhile again.  Hi everyone (insert excuses about work, time, etc.)  Anyways…

I’ve seen a lot of posts and comments on reddit lately talking about how using an out-of-the-box framework (Rails, Django, Cake, Grails, etc) doesn’t really buy you anything, and might actually harm you in the long run.  Most of these arguments tie into the 90-10 rule where you’ll spend 90% of your time on 10% of your code (the unique part your site.)  At that point, you need to optimize, scale and focus your energies on these parts of the site, and the framework will inevitably get in the way (see Twitter.)

I see some merits in that argument, but I don’t know how much I buy it.  There are a few things that frameworks do get you, and in the end if you’re writing something to last, you’re going to be building a custom framework anyway!  So, here are my reasons as to why frameworks aren’t bad.

Rapid Iteration/Bootstrap

Yes, I know… when you’re Twitter things like Rails’ ActiveRecord starts to suck.  But, you need to get TO Twitter before you start worrying about those types of optimizations.  Even if you know your site is going to have a steep growth/adoption curve, you still won’t gain much by writing everything from scratch.  No matter how brilliant you are (and I know you’re brilliant!), there is an incredibly high likelihood that you won’t even know what exactly will be the bottleneck once your users start using the site.   Who knows?  Premature optimization is an evil bitch.  You can spend months optimizing the database schema and perfecting your Hadoop map/reduces and then find out that your site’s JavaScript is killing computers that are slower than that quad-core under your desk.

Free Stuff

If you go it alone, you’re setting yourself up for more work in the long run.  Sure, there are libraries out there, but you are responsible for their integration.  You’re responsible for best practices.  Meanwhile, framework users can easily leverage the community for improved backends, plugins and templates.

Free Improvements and Core Testing!

A corollary to that is you get free improvements when you use a framework.  Upgrading to a newer version of Django will give you a slew of upgrades and performance improvements that cost you nothing more than a regression test of your software on the new version.  It’s a great pick-me-up for your site.  Also, you get free security patches (although at the same time, a large framework is more of a target than your own code) and the confidence of knowing that a slew of developers and users have run their own unit tests (not to mention running the framework in their production!)  And of course, if your site does become the next Twitter, just as many people will be trying to hack you as Rails.

A Common Vocabulary

Another amazingly useful thing about using a standard open source framework is you can leverage the community for help and recruiting.  Having a conversation with another developer using the same framework provides a common vocabulary that allows you have to amazingly efficient conversations.  “I am having some difficulties integrating Twitter OAuth as an authentication backend.”  “Did you try a signal?” etc etc.

Related to this is, its much easier to hire and bootstrap new employees when you are using a framework out in the open.  Throw a Django or Rails developer into your large codebase, and sure there will be some ramp up time (your architecture, custom code, hacks) BUT you will have a common starting point.  Throw a person into your custom framework and the ramp up will take considerably longer.

Finally,

You’re Going To Write One Anyway

If you decide to go it alone, you’re going to end up writing your own custom one anyway.  Now, I know for most developers it’s more fun to start your own from scratch instead of leveraging one out there.  Additionally, most developers find their own framework better.  50% of the time its just because they know it better.  They know how to hook into since they wrote it.

At work, my team wrote their own custom closed source CMS and Framework.  Sure, it works great and covers our needs, but I see these ramp up problems all the time when we hire people.  I also know that, if my team doesn’t update or secure the framework, nobody else will do it.  Instead of just developing our own site now, we also develop the framework.  This is possible because we have a team of developers, so we can split our time.

But, if you’re one guy working on a side project (or the gestation period of your startup), use a framework.  Sure, you’ll end up spending 90% of your time on that 10%.  Sure, you might be cursing a ‘limitation.’  But, for the most part, the reason you only need to spend another 10% of your time on the other 90% of the code is because it’s already written for you!

I love technology and love playing with the latest and greatest (which being ‘latest’ always means there’s something new to play with.)  In the wee hours of night after work, I’ve been playing with various frameworks and languages for a few of my side projects and discovered that I have fallen into this rut where I keep reimplementing the startup idea I have in different languages.

First, I tried to create my own framework.  That, as it became painfully obvious, was not the best idea ever (even though I’ve done that for the full time job with a team.)  When reinventing the wheel though, its best to look out there and see what else is avaible.  At least be influenced by it.  This is when my ADD kicks in something fierce.

I played with the following: Wicket, Tapestry, Rails, Grails, Zope, Restlet, Django, Rails on Java, and JSF (ICK!).  The saddest part is that, depending on the framework/language, I’ve developed pretty large core pieces of functionality in many of those frameworks.   Then I threw them out!  I guess one positive of all this back-and-forth is after playing with these frameworks, I can feel pretty confident about the one I settled on (Django), but even now I have pings of “let me try something new!” (or flipping back to Grails occasionally.)

Anyone else have these ADD issues with frameworks and languages?  How do you end up settling on one?

If anyone is interested, I can talk about a few of those above (why I went or didn’t go for them.)

Here is my first screencast, finally polished off.  In it, I discuss how to set up the Eclipse IDE to do your Python and Django development, complete with code complete, jump-to-code functionality and live  breakpoints + code replace.  I’ve recently migrated away from Textmate (a large reason for this was wanting to develop on both Linux and OSX) and found that Eclipse has really sped up learning Django and some of the libraries I’ve used.
I reference a few URLs for this screencast.  They are:

I also reference a chunk of code that your manage.py should be.  I’ve uploaded that to snippet to django snippets.  It was originally posted here in 2007, so I’ve copied it to Django snippets in case that post disappears.

With that all said and everything out of the way, here is the video.  Please let me know if you have any questions! (In case you’re wondering, I used IShowU HD Pro to record this and iMovie to edit [badly].)

Ye who enter this post, abandon all hope.

First of all, let me preface with the standard Apple-fanboy disclaimer. I’m a huge Apple fan (as the ridiculous number of Apple products littered all over my by apartment can attest to.) They really have pushed forward the entire Cell Phone industry, and have been a catalyst for the next generation of Something Great(tm.) That being said, …

The keynote left me a little “wanting.” The updates were boring, incremental and not what you’d typically expect from Apple. Apple doesn’t do incremental. It doesn’t do evolutionary. It does revolutionary. And to see a keynote that literally spent 5 minutes talking about voice-activated controls is borderline pathetic.

There was very little new announced or demoed. The hardware upgrades are what normally would be considered “bumps” in the Laptop world. I still don’t understand what “2x the magic” means, but I’m going to assume the magic is ram. I just find it awfully insulting that some of the features are not being ported over to the 3G (video recording, voice commands.) Jailbroken phones already have video recording! The one thing that would have been game-changing, a front facing camera, didn’t seem to make it out.

Regardless, I understand that the hardware form factor is pretty much “perfect” (See? I can be fanboyish) It’s perfect because it gets out of the way. The real form factor of the iPhone is iPhone OS. And Cupertino, that’s where we have a problem. The form factor has become totally stagnant.

Everyone remembers the first time they saw the iPhone OS. It really felt like something plucked out from the future. It’s like Doc Brown showed up in his Delorean and gave us this piece of technology from 20 years in the future.

But that was 2007. This is 2009. The iPhone OS still looks like 2007, and its built with some (bad) 2007 assumptions. The OS does not scale for people who install tons of apps. Ever try to move an app from Page 8 to Page 1 in Springboard? Shit is broken. It’s just an inelegant solution, and I’m not sure what is an elegant solution, but if anyone can figure it out, its the apple team.

Or so you’d think. The Core UI coming out of the iPhone lately has been, well, shitty. The notification system is a dramatic example of “not really thinking the UI out.”

Here’s a pic from the keynote:

Seriously?  Imagine getting 30 of these.
Seriously? Imagine getting 30 of these.

Imagine a spammy app. Something like Tweetie, or AIM, or a news app. Imagine getting 30 of these notifications? That design just doesn’t scale. The alternative format of collasping the alerts is just as shitty. Instead you either get messages like “Twitter (30) – View” or “Twitter (30) / ESPN (10) / AIM (82)” which are unusable. It just doesn’t work. It’s inelegant, and its almost embarrassingly implemented. Compare that with the Palm Pre:

Now that's more like it!
Now that's more like it!

Now that’s a 2009 new modern system. The design is built with notifications at the core. The notifications system slides in and out, regardless of context, doesn’t distract, and can be used to read details in place of your current app (not forcing you to switch context.) Now, I’m not just putting the Pre on a pedestal, Android does this too (and really really well.)

This completely ties into the biggest fundamental flaw in the OS. The lack of background applications (except for the Blessed Few Apple Apps.) I hate not being able to run Pandora or Slacker in the background (or AIM, or Tweetie, or push Gmail.) I (barely) understand the battery arguments, but shouldn’t the 3G S new super battery have made those points moot? Additionally, Apple needs to stop coddling me like a little child. If I want to run folding@home on my phone and give it 15 minutes battery life, let me do it. If an app truly destroys the battery, the review process will ferret this out (“This RSS reader sucks battery like crazy.”) Don’t throw the baby out with the bath water!

Now of course, other phones have their flaws too. The Pre doesn’t have an on-screen keyboard (which stinks if you want to shoot off a “Coo” text message) and I’ve become super fast with the iPhone’s corrective type system. Also, not being able to write directly to the graphics layer means no awesome games like Need For Speed Underground. But, I do think those are coming. Web OS’ 1.0, so far, is far superior to iPhone’s 1.0.

In another (related) rant or observation. Did anyone else notice the strange acrimony between Apple and AT&T during the keynote? There were some (snide) comments in passing about no tethering and (especially) no MMS. Take that and add in no newly discounted data plans, no nice upgrade pricing for 3G owners and rumors of Apple taking the entire $100 price cut on their end for the 3G, and it seems like AT&T is losing some of the love that they’ve had for the folks off 280. Maybe Apple is playing hardball with their exclusivity contract? Hmm…. food for thought.