PR0 - Google's PageRank 0 Penalty:
By the end of 2001, the Google search engine introduced
a new kind of penalty for websites that use questionable search
engine optimization tactics: A PageRank of 0. In search engine optimization
forums it is called PR0 and this term shall also be used here. Characteristically
for PR0 is that all or at least a lot of pages of a website show
a PageRank of 0 in the Google Toolbar, even if they do have high
quality inbound links. Those pages are not completely removed from
the index but they are always at the end of search results and,
thus, they are hardly to be found.
A
PageRank of 0 does not always mean a penalty. Sometimes, websites
which seam to be penalized simply lack inbound links with an sufficiently
high PageRank. But if pages of a website which have formerly been
placed well in search results, suddenly show the dreaded white PageRank
bar, and if there have not been any substantial changes regarding
the inbound links of that website, this means - according to the
prevailing opinion - certainly a penalty by Google.
We can do nothing but speculate about the causes
for PR0 because Google representatives rarely publish new information
on Google's algorithms. But, non the less, we want to give a theoretical
approach for the way PR0 may work because of its serious effects
on search engine optimization.
The Background of PR0:
Spam has always been one of the biggest problems
that search engines had to deal with. When spam is detected by search
engines, the usual proceeding is the banishment of those pages,
websites, domains or even IP addresses from the index. But, removing
websites manually from the index always means a large assignment
of personnel. This causes costs and definitely runs contrary to
Google's scalability goals. So, it appears to be necessary to filter
spam automatically.
Filtering spam automatically carries the risk of
penalizing innocent webmasters and, hence, the filters have to react
rather sensibly on potential spam. But then, a lot of spam can pass
the filters and some additional measures may be necessary. In order
to filter spam effectively, it might be useful to take a look at
links.
That Google uses link analysis in order to detect
spam has been confirmed more or less clearly in WebmasterWorld's
Google News Forum by a Google employee who posts as "GoogleGuy".
Over and over again, he advises webmasters to avoid "linking
to bad neighbourhoods". In the following, we want to specify
the "linking to bad neighbourhoods" and, to become more
precisely, we want to discuss how an identification of spam can
be realized by the analysis of link structures. In particular, it
shall be shown how entire networks of spam pages, which may even
be located on a lot of different domains, can be detected.
BadRank as the Opposite of PageRank:
The theoretical approach for PR0 as it is presented
here was initially brought up by Raph Levien (www.advogato.org/person/raph).
We want to introduce a technique that - just like PageRank - analyzes
link structures, but, that unlike PageRank does not determine the
general importance of a web page but rather measures its negative
characteristics. For the sake of simplicity this technique shall
be called "BadRank".
BadRank
is in priciple based on "linking to bad neighbourhoods".
If one page links to another page with a high BadRank, the first
page gets a high BadRank itself through this link. The similarities
to PageRank are obvious. The difference is that BadRank is not based
on the evaluation of inbound links of a web page but on its outbound
links. In this sense, BadRank represents a reversion of PageRank.
In a direct adaptation of the PageRank algorithm,
BadRank would be given by the following formula:
BR(A) = E(A) (1-d) + d (BR(T1)/C(T1)
+ ... + BR(Tn)/C(Tn))
where
- BR(A) is the BadRank of page A,
- BR(Ti) is the BadRank of pages Ti which are outbound links of
page A,
- C(Ti) is here the number of inbound links of page Ti and
- d is the again necessary damping factor.
In the previously discussed modifications of the
PageRank algorithm, E(A) represented the special evaluation of certain
web pages. Regarding the BadRank algorithm, this value reflects
if a page was detected by a spam filter or not. Without the value
E(A), the BadRank algorithm would be useless because it was nothing
but another analysis of link structures which would not take any
further criteria into account.
By means of the BadRank algorithm, first of all,
spam pages can be evaluated. A filter assigns a numeric value E(A)
to them, which can, for example, be based on the degree of spamming
or maybe even better on their PageRank. Thereby, again, the sum
of all E(A) has to equal the total number of web pages. In the course
of an iterative computation, BadRank is not only transfered to pages
which link to spam pages. In fact, BadRank is able to identify regions
of the web where spam tends to occur relatively often, just as PageRank
identifies regions of the web which are of general importance.
Of
course, BadRank and PageRank have significant differences, especially,
because of using outbound and inbound links, respectively. Our example
shows a simple, hierarchically structured website that reflects
common link structures pretty well. Each page links to every page
which is on a higher hierachical level and on its branch of the
website's tree structure. Each page links to pages which are arranged
hierarchically directly below them and, additionally, pages on the
same branch and the same hierarchical level link to each other.
The following table shows the distribution of inbound
and outbound links for the hierarchical levels of such a site.
| Level |
inbound Links |
outbound Links |
| 0 |
6 |
2 |
| 1 |
4 |
4 |
| 2 |
2 |
3 |
As to be expected, regarding inbound links, a hierarchical
gradation from the index page downwards takes place. In contrast,
we find the highest number of outbound links on the website's mid-level.
We can see similar results, when we add another level of pages to
our website while the above described linking rules stay the same.
| Level |
inbound Links |
outbound Links |
| 0 |
14 |
2 |
| 1 |
8 |
4 |
| 2 |
4 |
5 |
| 3 |
2 |
4 |
Again, there is a concentration of outbound links
on the website's mid-level. But most of all, the outbound links
are much more evenly distributed than the inbound links.
If we assign a value of 100 to the index page's
E(A) in our original example, while all other values E equal 1 and
if the damping factor d is 0.85, we get the following BadRank values:
| Page |
BadRank |
| A |
22.39 |
| B/C |
17.39 |
| D/E/F/G |
12.21 |
First of all, we see that the BadRank distributes
from the index page among all other pages of the website. The combination
of PageRank and BadRank will be discussed in detail below, but,
no matter how the combination will be realized, it is obvious that
both can neutralize each other very well. After all, we can assume
that also the page's PageRank decreases, the lower the hierarchy
level is, so that a PR0 can easily be achieved for all pages.
If we now assume that the hierarchically inferior
page G links to a page X with a constant BadRank BR(X)=10, whereby
the link from page G is the only inbound link for page X, and if
all values E for our example website equal 1, we get, at a damping
factor d of 0.85, the following values:
| Page |
BadRank |
| A |
4.82 |
| B |
7.50 |
| C |
14.50 |
| D |
4.22 |
| E |
4.22 |
| F |
11.22 |
| G |
17.18 |
In this case, we see that the distribution of the
BadRank is less homogeneous than in the first scenario. Non the
less, a distribution of BadRank among all pages of the website takes
place. Indeed, the relatively low BadRank of the index page A is
remarkable. It could be a problem to neutralize its PageRank which
should be higher compared to the rest of the pages. This effect
is not really desirable but it reflects the experiences of numerous
webmasters. Quite often, we can see the phenomenom that all pages
except for the index page of a website show a PR0 in the Google
Toolbar, whereby the index page often has a Toolbar PageRank between
2 and 4. Therefore, we can probably assume that this special variant
of PR0 is not caused by the detection of the according website by
a spam filter, but the site rather received a penalty for "linking
to bad neighbourhoods". Indeed, it is also possible that this
variant of PR0 occurs when only hierarchical inferior pages of a
website get trapped in a spam filter.
The Combination of PageRank and
BadRank to PR0:
If we assume that BadRank exists in the form presented
here, there is now the question in which way BadRank and PageRank
can be combined, in order to penalize as much spammers as possible
while at the same time penalizing as few innocent webmasters as
possible.
Intuitively, implementing BadRank directly in the
actual PageRank computations seems to make sense. For instance,
it is possible to calculate BadRank first and, then, divide a page's
PageRank through its BadRank each time in the course of the iterative
calculation of PageRank. This would have the advantage, that a page
with a high BadRank could pass on just a little PageRank or none
at all to the pages it links to. After all, one can argue that if
one page links to a suspect page, all the other links on that page
may also be suspect.
Indeed, such a direct connection between PageRank
and BadRank is very risky. Most of all, the actual influence of
BadRank on PageRank cannot be estimated in advance. It is to be
considered that we would create a lot of pages which cannot pass
on PageRank to the pages they link to. In fact, these pages are
dangling links, and as it has been discussed in the section on outbound
links, it is absolutely necessary to avoid dangling links while
computing PageRank.
So, it would be advisable to have separate iterative
calculations for PageRank and BadRank. Combining them afterwards
can, for instance, be based on simple arithmetical operations. In
principle, a subtraction would have the desirable consequence that
relatively small BadRank values can hardly have a large influence
on relatively high PageRank values. But, there would certainly be
a problem to achieve PR0 for a large number of pages by using the
subtraction. We would rather see a PageRank devaluation for many
pages.
Achieving the effects that we know as PR0 seems
easier to be realized by dividing PageRank through BadRank. But
this would imply that BadRank receives an extremely high importance.
However, since the average BadRank equals 1, a big part of BadRank
values is smaller than 1 and, so, a normalization is necessary.
Probably, normalizing and scaling BadRank to values between 0 and
1 so that "good" pages have values close to 1, and "bad"
pages have values close to 0 and, subsequently, multiplying these
values with PageRank would supply the best results.
A very effective and easy to realize alternative
would probably be a simple stepped evaluation of PageRank and BadRank.
It would be reasonable that if BadRank exceeds a certain value it
will always lead to a PR0. The same could happen when the relation
of PageRank to BadRank is below a certain value. Additionally, it
would make sense that if BadRank and/or the relation of BadRank
to PageRank is below a certain value, BadRank takes no influence
at all.
Only if none of these cases occurs, an actual combination
of PageRank and BadRank - for instance by dividing PageRank through
BadRank - would be necessary. In this way, all unwanted effects
could be avoided.
A Critical View on BadRank and PR0:
How Google would realize the combination of PageRank
and BadRank is of rather minor importance. Indeed, a separate computation
and a subsequent combination of both has the consequence that it
may not be possible to see the actual effect of a high BadRank by
looking at the Toolbar. If a page has a high PageRank in the original
sense, the influence of its BadRank can be negligible. But if another
page links to it, this could have quite serious consequences.
An even bigger problem is the direct reversion
of the PageRank algorithm as we have presented it here: Just as
an additional inbound for one page can do nothing but increasing
this page's PageRank, an additional outbound link can only increase
its BadRank. This is because of the addition of BadRank values in
the BadRank formula. So, it does not matter how many "good"
outbound links a page has - one link to a spam page can be enough
to lead to a PR0.
Indeed, this problem may appear in exceptional
cases only. By our direct reversion of the PageRank algorithm, the
BadRank of a page is divided by its inbound links and single links
to pages with high BadRank transfer only a part of that BadRank
in each case. Google's Matt Cutts' remark on this issue is: "If
someone accidentally does a link to a bad site, that may not hurt
them, but if they do twenty, that's a problem." (searchenginewatch.com)
However, as long as all links are weighted uniformly
within the BadRank computation, there is another problem. If two
pages differ widely in PageRank and both have a link to the same
page with a high BadRank, this may lead to the page with the higher
PageRank suffering far less from the transferred BadRank than the
page with the low PageRank. We have to hope that Google knows how
to deal with such problems. Nevertheless it shall be noted that,
regarding the procedure presented here, outbound links can do nothing
but harm.
Of course, all statements regarding how PR0 works
are pure speculation. But in principle, the analysis of link structures
similarly to the PageRank technique should be the way how only Google
understands to deal with spam.
|