How good is Google Sites?

How good is ‘Google Sites’ as a web hosting platform?

For my contribution to the Computer Measurement Group’s (CMG) yearly conference (CMG Las Vegas 2012) I am reviewing a number of webhosting options. One of the basic options is ‘Google Sites’, which is a content management system (CMS) with hosting and a content distribution network, all rolled into one. You can have a reasonable website running in a few minutes, just add content. It is sort of an alternative to blog hosting on wordpress.com or posterous.com. And it is free.

The obvious question then is: how good is it, and what kind of load will it sustain? First some basic results: one of my sites is hosted at Google Sites, and it failed 15 out of 8611 tests in June 2012, which is an uptime better than 99.8%. The average load time of the first request is under 900 milliseconds, though it differs a bit by location. The load time of the full page is a bit longer. This takes around 1.5 seconds to start rendering and 2.5 seconds to be fully loaded. See http://www.webpagetest.org/result/120706_VE_BTR/ for a breakdown of the site download.

A more interesting question is: how does it scale? Can it handle more load than a dedicated server?

A regular dedicated server will run at more than 100 requests/second. If a web page visit results in 10 requests, this means such a server will deliver at least 10 pageviews per second, which sounds good enough for a typical blog. Most vanity bloggers will be happy to have 10 pageviews per hour :-).

Here is what I did, step by step. I started by creating a page on a fresh domain at Google sites. With Jmeter I set up a little script to poll for that page. This script was then uploaded to WatchMouse, for continuous performance evaluation, and to Blazemeter for load testing. After an initial trial with a single server we fired 8 servers with 100 threads (simulated users) each.

You can see the result in the next graph. You will see Google Sites easily handling over 150 requests per second, with a bandwidth of 3 Megabyte/second. Each request is a single HTTP request.

Interestingly, Google Sites does some kind of rate limiting, as we can see in the next picture. As the number of simulated users increases, the response time increases as well, already at low volumes. There is no ‘load sensitivity point’ indicative of resource depletion.

In the next picture you can see that the response rate just levels off.

In fact, it is even likely that Google Sites is rate limiting by source IP address. While this test was ran, the independent monitoring by WatchMouse showed no correlated variation.

Some final technical notes: If you want to maximize the requests/second you need lots of threads/simulated users with delays built-in. Jmeter is not good at simulating users that don’t have delays.

By the way, the site under test was served by more than 40 different IP addresses. The site has low latency to places around the world: for example locations in Ireland, China, San Francisco, Malaysia all have connect times less than 5 milliseconds. This substantiates the statement that Google Sites is using some kind of CDN.

Google’s spending spree: 2.4 million servers, and counting

Google just published its Q3 financial results. You can read it yourself at http://investor.google.com/earnings/2010/Q3_google_earnings.html

So, what is Google spending on IT, and how much servers would that buy? This is one of their best kept secrets. I looked at that earlier. Let’s have a new look. Some quotes:

 Other cost of revenues, which is comprised primarily of data center operational expenses, amortization of intangible assets, content acquisition costs as well as credit card processing charges, increased to $747 million, or 10% of revenues, in the third quarter of 2010

and

 In the third quarter of 2010, capital expenditures were $757 million, the majority of which was related to IT infrastructure investments, including data centers, servers, and networking equipment.

 So let us modestly assume that half the capital and half the operational expense is server related, $400 million each. Let us assume a cheap Google server costs $ 1000, and the associated network, datacenter facilities and such, another $ 1000. The run cost of the datacenter (power, cooling, etc) could match that. This leads to an investment pattern of 200.000 servers per quarter, 800.000 per year. With an average lifetime of 3 years, this puts the ballpark estimate of the size of Google’s server farm at 2.4 million servers. There are entire countries that do not have that many servers. There are entire countries that do not have that many PCs.

Since 2004, the server farm increased in size by a factor of 16, while revenu increased 10 fold (see also my 2005 estimates). Once more, Google increases the amount of compute power that goes into a dollar of revenu, Moore’s law notwithstanding.

Engineering large digital infrastructures is not trivial

Gmail was down a while. Google describes how it happened.

In essence, a mechanism designed to throttle load on heavily used parts of the infrastructure reduced total capacity. If demand is then not reduced this leads to congestion, similar to what happens in a traffic jam.

One way for gmail to reduce demand is to signal to the webbrowser to decrease the frequency with which it polls the gmail servers for new mail (I do not know if they already do this).

On a side note: can anyone think of a way to beta test systems of this size?

Which computing cloud is closer?

The ‘cloud’ stands for a worldwide infrastructure of computers that can deliver applications and content to any place on the Internet. Early examples of clouds are content distribution networks (CDN), which can serve web content from a worldwide distributed network of servers. Because the servers are closer to the user the user will see quicker response. Because there are multiple servers, larger numbers of users can be served.

I have done some measurements on the proximity of a number of content distribution networks to monitoring stations around the world. Earlier I reported on the distance that the Google Application Engine (GAE) has to the cloud. Here we do the same for more content distribution networks. Note that GAE actually allows to run applications on the cloud, and we are using here only its capacity to serve content.

The monitoring stations by Watchmouse are situated in 35 locations on all continents of the world. The minimum connect time to a provider as measured from a particular location is a good indicator of the distance to that provider.

There are other qualities a good CDN should have. It should start serving content quickly, and it should serve it with adequate speed (bandwidth). I’ll report on those measurements later.

Our Cloud Proximity Indicator is an aggregate measure, and is computing by averaging the distances to all monitoring stations.

Provider
Cloud Proximity (msec)
Akamai (MySpace)
6
Amazon Cloudfront
43
Mosso
49
GAE
61
CNN.com
66
SimpleCDN
94
Single host New York
127

Remarks and observations
• The round trip delay between Spain and New-Zealand (opposite on the globe) is around 290 milliseconds
• Akamai is really everywhere
• Google Applications Eengine, although fundamentally more powerful and still in beta, is pretty impressive. It is in the same league as most other CDNs.

Watching the cloud

Google App Engine is an infrastructure to deliver applications through Google’s cloud. You can drop applications written in Python in it, and let Google do the hosting. I am setting up a business based on this (GriddleJuiz).

So the first obvious questions are: where is the cloud, and does it perform? With the help of my friends from Watchmouse I ran a test on one of my Google App Engine sites and compared it with a regularly hosted website. In the chart you can see some of the results: the time it takes to connect to the site from various places in the world.

The interesting observations are:

  1. Time to connect to the regular site increase with distance. We are measuring the speed of light here, sort of.
  2. The Google cloud is in more than one place. For example, it is close to the Netherlands, but it also has a presence in East Asia, near Hong Kong.
  3. The cloud is probably also close to North America, but it puzzles me why it is not nearer.
  4. The regular site is closer to monitoring station NL2, where the cloud is closer to NL4.
  5. Google does not guess correctly the location of some WM monitoring stations. E.g. DK (Danmark) is way off, and might as well be in North America. This is a common misconception among Americans 🙂

These results are pretty reproducible by the way. We have done measurements over several 24 hour cycles. The next interesting thing of course is raw performance: how many hits/second can it pull? Stay tuned for more results.

Google has solved the hard part of scalable application infrastructure: duplication over a large distance. If you can do that, you can deploy any number of servers. Yet, there appears to be a lot of work left, the cloud does not always guess correctly where the user is.

Chrome: Google owns the web

In my previous post I discussed the technical qualities of Google’s new browser, Chrome. On a strategic business level, Chrome is the kick-off for a new battle for platform dominance.

How can substituting one piece of free software (the browser) for another have such business impact? To understand that, you will have to look at the business model of Microsoft, and how it is affected by the changing ecosystem.

Microsoft makes its money not from the browser (it is free), a little from the operating system on the desktop, and quite substantially from its Office suite of applications and back-end server and database products. The online services of Microsoft are clearly behind in the market.

The Achilles heel of this model is Office. It is essential, because it locks in users. Yet, a stable Web 2.1 (see previous post) enables other parties to produce online software on a much larger scale than is currently possible. Google is already doing this with Google docs, but between the lines of the design documentation of Chrome, you can read that they have reached the limits of current browsers.

Chrome will change that, and allow for superior online alternatives to Office. That also reduces the need for running Microsoft Windows on the desktop. One of the big reasons for running a Microsoft based IT architecture is its integration: the pieces fit together, more or less. But when essential pieces drop out, the other pieces will be under discussion as well. On the server side in particular, there are a lot of viable open source alternatives.

So Microsoft is in trouble, because Chrome has repossessed the Web. By the way, the other major collateral damage will be Firefox and other open source browsers.

Google Chrome: here is Web 2.1

Google’s new browser, Chrome, appears to be a major improvement not so much for its functionality but for its stability.

In software land, version 2 of something indicates the first serious incorporation of user feedback. In this way, Web 2.0 addressed user needs for more interactivity and multi-user, multi-site collaboration.

In software land, version 2.0 brings the new functionality, but you will have to wait for version 2.1 if you want stability. From my software years I know that this is an arduous task, and often involves major structural rethinking, even when that is hardly visible from the outside.

Looking through the design story of Chrome, it is clear that this is a major redesign that takes into account how the web is actually used today.

Of course, there is no traditional self contained packaged software labeled Web 2.1. Making web applications work involves an entire ecosystem of software components, of which the browser is only one. Nevetheless, Chrome enables other agents in the ecosystem to assume a more dependable browser. I’ll discuss the strategic business implications of that in my next post.

Hardware can fail, you know. Things can break.

Computers are terribly reliable, in general. Today’s computers execute millions of instructions each second, with an error rate that is inconceivable in other technologies. Yet, if you have hundreds of thousands of machines, you do need to take care of failures.

A Cnet article elaborates on the Google situation (a Google cluster has several thousands of machines):

In each cluster’s first year, it’s typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will “go wonky,” with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there’s about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover.

These look like interesting planning assumptions for both hardware and software planners. As they say:

“Our view is it’s better to have twice as much hardware that’s not as reliable than half as much that’s more reliable,” Dean said. “You have to provide reliability on a software level. If you’re running 10,000 machines, something is going to die every day.”

How one second delay can kill your online business

Just one second delay in delivering a web page can make a devastating impact on your online business.

Consider this: an item on an online auction site is easily viewed 300 times.

This means that a user in search of particular item could therefore view 300 items before making a choice.

If each of these views incurs an additional 1 second of delay, this will add 5 minutes to the total search. That is 5 whole minutes wasted. If you can choose between two shops where one lets you wait idling for 5 more minutes, which one do you choose?

Maybe this is why Google strives to perform really well, and in fact delivers its home page in 0.2 seconds on the average.

[network] The cost of delivering a single web page

The other day I wrote about Google’s technology cost, leading to an estimate of 0.5 dollarcents per delivered search result. There appeared to be a real contrast with the numbers Jim Gray was coming up with from his experience with the Terraserver. I interviewed him to dive deeper into these numbers and their composition. According to Jim, in October 2004 Terraserver was able to serve up to 3 million web pages per day (about 35 per second), each containing quite a bit of graphics. The total operating cost (excluding application development) was $ 387.000 per year.

A Terraserver webpage thus cost 350 microdollars (about 4 hundredths of a cent) to deliver. This is an order of magnitude less than Google’s, and we are not finished. According to Jim Gray, if he would scale up the Terraserver capacity by a factor of 100, getting it in the ballpark of Google, that cost would be even less. In the following table I have summarised the scale factors.

Most services and capital expenditure scale linear with the number of delivered pages, but the management of the hosting and the application do not need to be scaled up. The bottom line is a page cost that dives to 64 microdollars.

As a comparison, a friend of mine runs a simple database driven website on a shared hosting account. The yearly hosting cost is around $ 374, and it serves about 600.000 pages per month, resulting in a page cost of 51 microdollars.

What can explain the difference of a factor of 100 between these page costs, and Google’s? One hypothesis is that, in addition to serving up pages, Google spiders the web. However Terraserver also brings in Terabytes of geographical data every year. So, either Google spends a lot more machine cycles on search results than Terraserver, or most Google hardware is not dedicated to search. Tune in next week, and maybe I’ll have some more data.