Google Dance - The Index Update of the Google Search
Engine:
The name "Google Dance" has often been
used to describe the index update of the Google search engine. Google's
index update occurred on average once per month. During an index
update there was significant movement in search results and Google
showed new backward links for pages. However, in mid-2003 Google
started to update it's index continuously. It appears that, still,
there has to be an update of the complete index once in a while
and during this time new backward links are shown. But, because
of the continuous update, the effects on search results seem to
be rather insignificant.
We will keep this site up running because it provides
some information beyond the Google Dance. But there will no longer
be a monitoring of updated data centers during a "Dance".
The Technical Background of the
Google Dance:
The Google search engine pulls its results from
more than 10,000 servers which are simple Linux PCs that are used
by Google for reasons of cost. Naturally, an index update cannot
be proceeded on all those servers at the same time. One server after
the other has to be updated with the new index.
Many webmasters think that, during the Google Dance,
Google is in some way able to control if a server with the new index
or a server with an old index responds to a search query. But, since
Google's index is inverse, this would be very complicated. As we
will show below, there is no such control within the system. In
fact, the reason for the Google Dance is Google's way of using the
Domain Name System (DNS).
Google Dance and DNS:
Not only Google's index is spread over more than
10,000 servers, but also these servers are, as of now, placed in
13 different data centers. These data centers are mainly located
in the US (i.e. Santa Clara, California and Herndon, Virginia) and
in Dublin, Ireland.
In order to direct traffic to all these data centers,
Google could thoeretically record all queries centrally and then
send them to the data centers. But this would obviously be inefficient.
In fact, each data center has its own IP address (numerical address
on the internet) and the way these IP addresses are accessed is
managed by the Domain Name System.
Basically, the DNS works like this: On the Internet,
data transfers always take place in-between IP addresses. The information
about which domain resolves to which IP address is provided by the
name servers of the DNS. When a user enters a domain into his browser,
a locally configured name server gets him the IP address for that
domain by contacting the name server which is responsible for that
domain. (The DNS is structured hierarchically. Illustrating the
whole process would go beyond the scope of this paper.) The IP address
is then cached by the name server, so that it is not necessary to
contact the responsible name server each time a connection is built
up to a domain.
The records for a domain at the responsible name
server constitute for how long the record may be cached by a caching
name server. This is the Time To Live (TTL) of a domain. As soon
as the TTL expires, the caching name server has to fetch the record
for a domain again from the responsible name server. Quite often,
the TTL is set to one or more days. In contrast, the Time To Live
of the domain www.google.com is only five minutes. So, a name server
may only cache Google's IP address for five minutes and has then
to look up the IP address again.
Each time, Google's name server is contacted, it
sends back the IP address of only one data center. In this way,
Google queries are always directed to different data centers by
changing DNS records. On the one hand, the DNS records may be based
on the load of the single data centers. In this way, Google would
conduct a simple form of load balancing by its use of the DNS. On
the other hand, the geographical location of a caching name server
may influence how often it receives the single data centers' IP
addresses. So, the distance for data transmissions can be reduced.
How data centers, DNS and Google Dance are related,
is easily answered. During the Google Dance, the data centers do
not receive the new index at the same time. In fact, the new index
is transferred to one data center after the other. When a user queries
Google during the Google Dance, he may get the results from a data
center which still has the old index at one point im time and from
a data center which has the new index a few minutes later. From
the users perspective, the index update took place within some minutes.
But of course, this procedure may reverse, so that Google switches
seemingly between the old and the new index.
Finally, it shall be noted that Google did the
DNS load balancing by themselves until September 2003. Since then,
they use the services and, hence, the name servers of Akamai Technologies,
Inc.
IP Addresses and Domains of Google's
Data Centers:
The progression of a Google Dance could basically
be watched by querying the IP addresses of Google's data centers.
But queries on the IP addresses are normally redirected to www.google.com.
However, Google has domains which resolve to the single data centers'
IP addresses. These domains as well as their IP addresses are shown
in the following list.
| Domain: |
IP-Adress: |
| www-ex.google.com |
216.239.33.100 |
| www-sj.google.com |
216.239.35.100 |
| www-va.google.com |
216.239.37.100 |
| www-dc.google.com |
216.239.39.100 |
| www-ab.google.com |
216.239.51.100 |
| www-in.google.com |
216.239.53.100 |
| www-zu.google.com |
216.239.55.100 |
| www-cw.google.com |
216.239.57.100 |
| www-fi.google.com |
216.239.41.100 |
| www-gv.google.com |
216.239.59.100 |
| www-kr.google.com |
66.102.11.100 |
| www-mc.google.com |
66.102.7.100 |
| www-lm.google.com |
66.102.9.100 |
Note: Searches at www-zu and www-sj are currently
redirected to other data centers. Since results for searches at
their IP addresses fluctuate heavily during a Google Dance, also
these searches seem to be internally routed to other data centers.
As we can see from our statistics for Google's DNS records, there
are currently no searches at www.google.com directed to www-zu and
www-sj. So, we can assume that the data centers are offline.
Those that keep an eye on Google's index updates
often think that the Google Dance is over, when they see the new
index at www.google.com or when they don't see the old index at
www.google.com for some time. In fact, the update is not finished
until all the domains listed above provide results from the new
index.
The index updates at the single data centers seem
to happen at one point in time. As soon as one data center shows
results from the new index, it won't switch back to the old index.
This happens most likely because the index is redundant at each
data center and at first, only one part of the servers (eventually
half of them) is updated. During this period, only the other half
of the servers is active and provides search results. As soon as
the update of the first half of servers is finished, they become
active and provide search results while the other half receives
the new index. Thus, from the user's perspective, the update of
one data centers happens at one point in time.
Finally, it shall be noted that the access to the
single data centers is generally controlled by the DNS only, but
sometimes queries are redirected. However, this is easy to detect:
When for a query at one of the domains listed above, the links to
Google's cache do not comply with the IP address that belongs to
the domain, then the query is redirected. If this happens, Google
inhibits - for whatever reason - the access to one data center.
The Google Dance Test Domains www2
and www3:
The beginning of a Google Dance can always be watched
at the test domains www2.google.com and www3.google.com. Those domains
normally have stable DNS records which make the domains resolve
to only one (often the same) IP address. Before the Google Dance
begins, at least one of the test domains is assigned the IP address
of the data center that receives the new index first.
Building up a completely new index once per month
can cause quite some trouble. After all, Google has to spider some
billion documents an then to process many TeraBytes of data. Therefore,
testing the new index is inevitable. Of course, the folks at Google
don't need the test domains themselves. Most certainly, they have
many options to check a new index internally, but they do not have
a lot of time to conduct the tests.
So, the reason for having www2 and www3 is rather
to show the new index to webmasters which are interested in their
upcoming rankings. Many of these webmasters discuss the new index
at the Google forums out on the web. These discussions can be observed
by Google employees. At that time, the general public cannot see
the new index yet, because the DNS records for www.google.com normally
do not point to the IP address of the data center that is updated
first when the update begins.
As soon as Google's test community of forums members
does not find any severe malfunctions caused by the new index, Google's
DNS records are ready to make www.google.com resolve the the data
center that is updated first. This is the time when the Google Dance
begins. But if severe malfunctions become obvious during this test
phase, there is still the possibility to cancel the update at the
other data centers. The domain www.google.com would not resolve
to the data center which has the flawed index and the general public
could not take any notice about it. In this case, the index could
be rebuilt or the web could be spidered again.
So, the search results which are to be seen on
www2.google.com and www3.google.com will always appear on www.google.com
later on, as long as there is a regular index update. However, there
may be minor fluctuations. On the one hand, the index at one data
center never absolutely equals the index at another data center.
We can easily check this by watching the number of results for the
same query at the data center domains listed above, which often
differ from each other. On the other hand, it is often assumed that
the iterative PageRank calculation is not finished yet, when the
Google Dance begins so that preliminary values exert influence on
rankings at that point in time.
The New PageRank Values during the
Google Dance:
Most webmasters are interested in ranking changes
for their website during the Google Dance. But, besides that, many
also want to know about their new PageRank values. Normally, the
Google Toolbar fetches the PageRank values from the data center
that is specified by its IP address in the actual DNS record for
www.google.com. Hence, when the Google Dance begins, the Toolbar
usually displays the old PageRank values.
Google submits PageRank values in simple text files
to the Toolbar. In former times, this happened via XML. The switch
to text files occured in August 2002. The PageRank files can be
requested directly from the domain www.google.com. Basically, the
URLs for those files look like follows (without line breaks):
http://www.google.com/search?client=navclient-auto&ch=0123456789&
features=Rank&q=info:http://www.domain.com/
There is only one line of text in the PageRank
files. The last cipher in this line is PageRank.
The parameters incorporated in the above shown
URL are inevitable for the display of the PageRank files in a browser.
The value "navclient-auto" for the parameter "client"
identifies the Toolbar. Via the parameter "q" the URL
is submitted. The value "Rank" for the parameter "features"
determines that the PageRank files are requested. If it is omitted,
Google's servers still transmit XML files. The parameter "ch"
transfers a checksum for the URL to Google, whereby this checksum
can only change when the Toolbar version is updated by Google.
The PageRank files that are requested by the Google
Toolbar are cached by the Internet Explorer. So, their URLs and
the checksums can simply been found out by having a look at the
folder Temporary Internet Files. Knowing the checksums of your URLs,
you can view the PageRank files in your browser. Since the PageRank
files are kept in the browser cache and, thus, are clearly visible,
and as long as requests are not automated, watching the PageRank
files in a browser should not be a violation of Google's Terms of
Service. However, you should be cautious. The Toolbar submits its
own User-Agent to Google. It is:
Mozilla/4.0 (compatible; GoogleToolbar
1.1.60-deleon; OS SE 4.10)
1.1.60-deleon is a Toolbar version which may of
course change. OS is the operating system that you have installed.
So, Google is able to identify requests by browsers, if they do
not go out via a proxy and if the User-Agent is not modified accordingly.
Now, let's see how we can get the new PageRank
values. Taking a look at IE's cache, you will notice that the PageRank
files are not requested from the domain www.google.com but from
IP addresses like 216.239.33.102. Additionally, the PageRank files'
URLs often contain a parameter "failedip" that is set
to values like "216.239.35.102;1111" (Its function is
not absolutely clear). However, it is pretty easy to get the new
PageRank values. Simply modify the IP addresses in the URL so that
the request goes to one of the data centers that already has the
new index. The necessary information is given above.
|