Cloud asset management – basic features

One of the fundamental characteristics of cloud computing is rapid elasticity: resources such as server instances can appear and disappear rapidly.

Good IT management practice requires that we administer and monitor our resources independent from the provider of these resources. So we have services and assets on the one hand and asset administration and monitoring on the other hand. Asset administration often revolves around some kind of ‘configuration management database CMDB‘. The challenge in the cloud is that this database changes much more rapidly.

CMDBs are notoriously incomplete. With cloud computing this is likely to be worse. How can new tool functionality help?

In this post I want to give some examples and outline some basic features for this. My favourite way of doing this is taking a small example and then expand it to illustrate the concepts.

Take a website. Even if it is procured as a service rather than ran on company owned servers, it should be in the CMDB. Basic monitoring for it would check if the URL still responds. If it fails to respond in time, an alert is sent.

One way of extending this is to figure out what the ‘adjacent’ assets are. Examples include DNS records and nameservers, because these are essential for the correct delivery of service, as perceived by the users. Other examples are any associated third party content and content distribution networks.

How would this work? The asset management tool and the monitoring service are likely to be independent tools, so there could be some interaction where, driven by asset management policies, monitoring service rules are added as appropriate. I.e when you list a webserver in the CMDB it will be automatically added to the monitoring service, and the adjacent DNS assets are automatically added as well. This speeds up troubleshooting, as the monitoring service gives inside into the component that is failing.

Another situation is with auto-scaling servers. When load increases, auto-scaling automatically adds server instances.  It would be nice if auto-scaling would also register the instance in the CMDB as well as with the monitoring service. In fact, the instance could do this itself, by way of a ‘please monitor me’ message.

Now if you think this is simple, please figure out what to do when server instances are stopped. What do you want to happen with the CMDB entries and the monitoring data? You don’t want to delete it immediately.

Finally, turning things around, the CMDB should also contain the monitoring rules themselves. Remember they have configuration data, they cost money, so they are assets.

DNS attack measurements and graphs

As a followup to my earlier post, here are some more details on the denial of service attack on DNS Made Easy, Aug 7th 2010.

The first graph represents the time it took to resolve a domainname to an IP address, averaged by the hour. The domains all have their records served by DNS Made Easy. Each datapoint represents over 100 measurements.

The second graph represents the downtime of the websites with these domainnames. This includes all error types, including DNS resolve problems, server failure and other problems.

These graphs clearly show the degraded performance as a result of the denial of service attack occurring between 8 am and 4 pm (Greenwich Mean Time – UTC), which is a bit longer than the provider claims.

The third and fourth graph represent the same reports, but for domains that are totally independent of DNS Made Easy. Each datapoint represents approximately 1500 measurements. A somewhat similar response time effect is seen, although there does not appear to be much downtime. The downtime that is visible has other causes. Please note that these graphs have scales different from the first two graphs.

An attack on DNS is an attack on the Internet

On Saturday Aug 7th , 2010, DNS provider DNS Made Easy was the target of a very large denial of service attack.  As far as can be determined the total traffic volume exceeded 40 Gigabit/second, enough to saturate 1 million dialup Internet lines. Several of DNS Made Easy’s upstream providers had saturated backbone links themselves. There are indications that not only DNS Made Easy suffered from this attack, but the Internet as a whole.
An attack on DNS is an attack on the Internet in two ways. Name servers are a critical point in almost every Internet access. But as our research shows, the consequences of this attack were wider than the attack’s primary target.
According to DNS Made Easy, service impact was limited. According to our measurements it was around 5-10% on a global basis.
“In some regions there were no issues, in other regions  outages lasted a few minutes, while in other regions there were sporadic (up and down) outages for a couple of hours.  In Europe for instance there was never any downtime.  In Asia downtime continued longer than other regions. In United States the west coast was hit much harder and experienced issues longer than the central and east coast.”
DNS was designed from the ground up to be resilient to individual server failures. In theory this should make the loss of a few servers irrelevant. On top of this, the provider has implemented an anycast routing infrastructure, which works to ensure that DNS queries all over the world are resolved regionally. Note that because of the anycast routing of this provider, outages are related to the location where the clients (resolvers) are located, not the servers whose names are being queried.
However, measurements/analyses that I made in collaboration with WatchMouse.com have uncomfortable implications. WatchMouse regularly measures the performance, including the DNS resolve time of thousands of sites, through a network of more than 40 stations spread over all continents.  
In a dataset with sites whose DNS records were served by the provider, resolve times rose from a normal average of less than 100 milliseconds, to over 200 milliseconds in the hours of the attack. Average failure rates in this dataset are around 1%. During the attack hours, this rose to 5% and even 10%. As can be expected, these failure rates differed greatly by monitoring station, though it is hard to see a geographical pattern.
Another dataset consists of regular measurements of more than 300 sites, with a total of more than 300.000 individual measurements over a period of 8 days. In contrast, none of these sites had their DNS service from DNS Made Easy. These sites are operated by a wide variety of industries.
On the seven days leading up to the attack, the daily average DNS resolution time in this dataset was between 352 milliseconds and 379 milliseconds. On the 7th of August, the average was 453 milliseconds, which is a significantly higher. Averaged by the hour, resolution times rose to 600 and even 800 milliseconds. There are failure rate fluctuations in this dataset, but they appear to be uncorrelated to the attack.
Note that these measurements support the provider’s claim of shorter resolve times. A regular DNS lookup takes 350 milliseconds, but DNS Made Easy’s average is less than 100 milliseconds.
In conclusion, these results are disturbing because even sites that are TOTALLY UNRELATED to DNS Made Easy were affected in their response times.  The implication of this is that this denial of service attack was big enough to have collateral damage on the rest of the Internet. 

Amazon CloudFront movements

This article, a follow up on an earlier blogpost, gives a more detailed look at the location of the Amazon CloudFront service. This location is derived from the time it takes to connect to it from a number of locations.


The summary is that CloudFront is on the average about 40-50 milliseconds away from a random point on the Internet.  This is pretty good compared to a site located in e.g. New York (120 milliseconds) and is in the same league as other content distribution networks. In specific markets, it is pretty near: San Francisco: 3 milliseconds, New York: 13 milliseconds, Western Europe: 1-30 milliseconds.

According to Amazon, CloudFront is in 16 locations, in contrast to the S3 storage service and the EC2 compute service, which have only 4 points of presence around the world.

The following table gives distances (in milliseconds) from selected locations of the monitoring network to Amazon CloudFront (cities annotated with CF have CloudFront locations):

Distance
City
Country
CF
1
Amsterdam
Netherlands
CF
1
Ashburn
U.S.A.
CF
2
Santa Clara
U.S.A.
CF
2
Dallas
U.S.A.
CF
3
Hong Kong
China
CF
3
Singapore
Singapore
CF
4
Cologne
Germany
5
Nagano
Japan
7
Manchester
United Kingdom
9
New York
U.S.A.
CF
11
Kuala Lumpur
Malaysia
12
London
United Kingdom
CF
20
Padova
Italy
27
Dublin
Ireland
CF
36
Bangkok
Thailand
59
Mumbai
India
75
Haifa
Israel
154
Sydney
Australia
176
Rio de Janeiro
Brazil
244
Cape Town
South Africa

 

As the table shows, the proximity of CloudFront is uneven around the world. 

CloudFront changes its connectivity regularly, mostly for the better. An interesting data point for example is that on April 8, CloudFront created a presence near Hong Kong dropping the distance from 160 milliseconds to 4 milliseconds. The following graph gives more detail.


In line with our earlier research, this data too shows that maintaining good proximity in an ever changing Internet is not a trivial thing to do. See for example the ever changing proximities to New York and New Zealand.

Our research was done in collaboration with Jitscale (a cloud consultancy)  and WatchMouse (a website monitoring company) . Distances are measured by measuring a TCP connect to an http URL of an object provided by Cloudfront, and does not include DNS lookup.