Gmail was down a while. Google describes how it happened.
In essence, a mechanism designed to throttle load on heavily used parts of the infrastructure reduced total capacity. If demand is then not reduced this leads to congestion, similar to what happens in a traffic jam.
One way for gmail to reduce demand is to signal to the webbrowser to decrease the frequency with which it polls the gmail servers for new mail (I do not know if they already do this).
On a side note: can anyone think of a way to beta test systems of this size?
5 Comments on “Engineering large digital infrastructures is not trivial”
Leave a Reply
You must be logged in to post a comment.
Khyron
2 September 2009 at 06:27>Beta test what? It was an upgrade taking servers out service. The process is probably well known, well documented and otherwise reliable. There just wasn't headroom and it led to a series of cascading failures, much like the August 2003 problems in the NE United States.
Paul Kedrosky has spoken about these tightly coupled systems and cascading failures before too. Commonly,
Khyron
2 September 2009 at 06:30>Admittedly, I think it a bad idea to perform this kind of maintenance in the middle of the day in North America (which I imagine has more GMail users than most other regions, on a traffic basis). But hey, who am I? Besides, if they've done it before successfully, and I imagine they do it a lot, it's only a problem when something goes wrong.
pve
2 September 2009 at 06:47>Yz, you are quick to respond! Where did you pick it up!
Beta testing is of course impossible on this scale. I meant it ironically. I work with a number of projects that seem to take forever to engineer out a risk that you can only see in production. You might as well jump in.
Khyron
2 September 2009 at 07:25>Well, I bring it up because it sounds like a process failure, which you can't really beta test for.
(Ok, not entirely true, but a lot more difficult.)
When I read your question, I was thinking about a side conversation I had with Jonathan Heiliger @ GigaOm's Structure 09 conference back in June. I asked Jonathan about how they test new features, and he basically said
Khyron
2 September 2009 at 07:26>Oh, and I saw it linked back from the Google blog post. Remember who owns Blogger…