Additional Factors Influencing PageRank:
It has been widely discussed if additional criteria
beyond the link structure of the web have been implemented in the
PageRank algorithm since the scientific work on PageRank has been
published by Lawrence Page and Sergey Brin. Lawrence Page himself
outlines the following potential influencing factors in his patent
specifications for PageRank:
- Visibility of a link
- Position of a link within a document
- Distance between web pages
- Importance of a linking page
- Up-to-dateness of a linking page
First of all, the implementation of additional
criteria in PageRank would result in a better approximation of human
usage regarding the Random Surfer Model. Considering the visibility
of a link and its position within a document implies that a user
does not click on links completely at haphazard, but rather follows
links which are highly and immediately visible regardless of their
anchor text. The other criteria would give Google more flexibility
in determing in how far an inbound link of a page should be considered
important, than the methods which have been described so far.
Whether or not the above mentioned factors are
actually implemented in PageRank can not be proved empirically and
shall not be discussed here. It shall rather be illustrated in which
way additional influencing factors can be implemented in the PageRank
algorithm and which options the Google search engine thereby gets
in terms of influencing PageRank values.
Modification of the PageRank Algorithm:
To implement additional factors in PageRank, the
original PageRank algorithm has again to be modified. Since we have
to assume that PageRank calculations are still based on numerous
iterations and for the purpose of short computation times, we have
to consider to keep the number of database queries during the iterations
as small as possible. Therefore, the following modification of the
PageRank algorithm shall be assumed:
PR(A) = (1-d) + d (PR(T1)×L(T1,A) +
... + PR(Tn)×L(Tn,A))
Here, L(Ti,A) represents the evaluation of a link
which points from page Ti to page A. L(Ti,A) withal replaces the
PageRank weighting of page Ti by the number of outbound links on
page Ti which was given by 1/C(Ti). L(Ti,A) may consist of several
factors, each of them having to be determined only once and then
being merged to one value before the iterative PageRank calculation
begins. So, the number of database queries during the iterations
stays the same, although, admittedly, a much larger database has
to be queried at each step in comparison to the computation by use
of the original algorithm, since now there is an evaluation of each
link instead of an evaluation of pages (by the number of their outbound
links).
Different Evaluation of Links within
a Document:
Two of the criteria for the evaluation of links
mentioned by Lawrence Page in his PageRank patent specifications
are the visibilty of a link and its position within a document.
Regarding the Random Surfer Model, those criteria reflect the probability
for the random surfer clicking on a link on a specific web page.
In the original PageRank algorithm, this probability is given by
the term (1/C(Ti)), whereby the probability is equal for each link
on one page.
Assigning different probabilities to each link
on a page can, for instance, be realized as follows:
We
take a look at a web consisting of three pages A, B anc C, where
each of these pages has outbound links to both of the other pages.
Links are weighted by two evaluation criteria X
and Y. X represents the visibility of a link. X equals 1 if a link
is not particularly emphasized, and 2 if the link is, for instance,
bold or italic. Y represents the position of a link within a document.
Y equals 1 if the link is on the lower half of the page, and 3 if
the link is on the upper half of the page.
If we assume a multiplicative correlation between
X and Y, the links in our example are evaluated as follows:
- X(A,B) × Y(A,B) = 1 × 3 = 3
- X(A,C) × Y(A,C) = 1 × 1 = 1
- X(B,A) × Y(B,A) = 2 × 3 = 6
- X(B,C) × Y(B,C) = 2 × 1 = 2
- X(C,A) × Y(C,A) = 2 × 3 = 6
- X(C,B) × Y(C,B) = 2 × 1 = 2
For the purpose of determinig the single factors
L, the evaluated links must not simply be weighted by the number
of outbound links on one page, but in fact by the total of evaluated
links on the page. Thereby, we get the following weighting quotients
Z(Ti) for the single pages Ti:
- Z(A) = X(A,B) × Y(A,B) + X(A,C) × Y(A,C) = 4
- Z(B) = X(B,A) × Y(B,A) + X(B,C) × Y(B,C) = 8
- Z(C) = X(C,A) × Y(C,A) + X(C,B) × Y(C,B) = 8
The evaluation factors L(T1,T2) for a link which
is pointing from page T1 to page T2 are hence given by
L(T1,T2) = X(T1,T2) × Y(T1,T2) / Z(T1)
Their values regarding our example are as follows:
- L(A,B) = 0.75
- L(A,C) = 0.25
- L(B,A) = 0.75
- L(B,C) = 0.25
- L(C,A) = 0.75
- L(C,B) = 0.25
At a damping factor d of 0.5, we get the following
equations for the calculation of PageRank values:
- PR(A) = 0.5 + 0.5 (0.75 PR(B) + O.75 PR(C))
- PR(B) = 0.5 + 0.5 (0.75 PR(A) + 0.25 PR(C))
- PR(C) = 0.5 + 0.5 (0.25 PR(A) + 0.25 PR(B))
Solving these equations gives us the follwing PageRank
values for our example:
- PR(A) = 819/693
- PR(B) = 721/693
- PR(C) = 539/693
First of all, we see that page A has the highest
PageRank of all three pages. This is caused by page A receiving
the relatively higher evaluated link from page B as well as from
page C.
Furthermore, we see that even by the evaluation
of single links, the sum of the PageRank values of all pages equals
3 (2079/693) and thereby the total number of pages. So, the PageRank
values computed by our modified PageRank algorithm can be used for
the general ranking of web pages by Google without any normalisation
being needed.
Different Evaluation of Links by
Page Specific Criteria:
Besides the unequal evaluation of links within
a document, Lawrence Page mentions the possibility of evaluating
links according to criteria which are based upon the linking page.
At first glance, this does not seem necessary since it is the main
principle of PageRank to rank pages the higher, the more high ranking
pages link to them. But, at the time of their scientific work on
PageRank, Page and Brin have already recognized that their algorithm
is vulnerable to artificial inflation of PageRank.
An artificial influence on PageRank might be exerted
by webmasters who generate a multitude of web pages whose links
distribute PageRank in a way that single pages within that system
receive a special importance. Those pages can have a high PageRank
without being linked to from other pages with high PageRank. So,
not only the concept of PageRank is undermined, but also the search
engine's index is spammed with an innumerable amount of web pages
which were solely created to influence PageRank.
In his patent specifications for PageRank, Lawrence
Page presents the evaluation of links by the distance between pages
as a means to avoid the artificial inflation of PageRank, because
the bigger the distance between two pages, the less likely has one
webmaster control over both. A criterium for the distance between
two pages may be if they are on the same domain or not. In this
way, internal links would be weighted less than external links.
In the end, any general measure of the distance between links can
be used to determine such a weighting. This comprehends if pages
are on the same server or not and also the geographical distance
between servers.
As another indicator for the importance of a document,
Lawrence Page mentions the up-to-dateness of the documents which
link to it. This argument considers that the information on a page
is less likely outdated, the more pages which have been modified
recently link to it. In contrast, the original PageRank concept,
just like any method of measuring link popularity, favours older
documents which gained their inbound links in the course of their
existence and have at a higher probability been modified less recently
than new documents. Basically, recently modified documents may be
given a higher evaluation by weighting the factor (1-d). In this
way, both those recently modified documents and the pages they link
to receive a higher PageRank. But, if a page has been modified recently,
is not necessarily an indicator for the importance of the information
presented on it. So, as suggested by Lawrence Page, it is advisable
not to favour recently modified pages but only their outbound links.
Finally, Page mentions the importance of the web
location of a page as an indicator of the importance of its outbound
links. As an example for an important web location he names the
root page of a domain, but, in the end, Google could exert influence
on PageRank absolutely arbitrarily.
To implement the evaluation of the linking page
into PageRank, the evaluation factor of the modified algorithm must
consist of several single factors. For a link that points from page
Ti to page A, it can be given as follows:
L(Ti,A) = K(Ti,A) × K1(Ti) × ... ×
Km(Ti)
where K(Ti,A) is the above presented weighting
of a single link within a page by its visibility or position. Additionally,
an evaluation of page Ti by m criteria which are represented by
the factors Kj(Ti) takes place.
To implement the evaluation of the linking pages,
not only the algorithm but also the proceedings of PageRank calculation
have to be modified. This shall be illustrated by an example.
We
take a look at a web consisting of three pages A, B and C, whereby
page A links to the pages B and C, page B links to page C and page
C links to page A.
The outbound links of one page are evaluated equally,
so there is no weighting by visibilty or position. But now, the
pages are evaluated by one criterium. In this way, an inbound link
from page C shall be considered four times as important as an inbound
link from one of the other pages.
After weighting by the number of pages, we get
the following evaluation factors:
- K(A) = 0.5
- K(B) = 0.5
- K(C) = 2
At a damping factor d of 0.5, the equations for
the computation of the PageRank values are given by
- PR(A) = 0.5 + 0.5 × 2 PR(C)
- PR(B) = 0.5 + 0.5 × 0.5 × 0.5 PR(A)
- PR(C) = 0.5 + 0.5 (0.5 PR(B) + 0.5 × 0.5 PR(A))
Solving the equations gives us the follwing PageRank
values:
- PR(A) = 4/3
- PR(B) = 2/3
- PR(C) = 5/6
At the current modifications of the PageRank algorithm,
the accumulated PageRank of all pages no longer equals the number
of pages. The reason therefore is that the weighting of the page
evaluation by the number of pages was not appropriate. To determine
the proper weighting, the web's linking structure would have to
be anticipated, which is not possible in case of the actual WWW.
Therefore, the PageRank calculated by an evaluation of linking pages
has to be normalized if there shall not be any unfounded effects
on the general ranking of pages by Google. Within the iterative
calculation, a normalization would have to take place after each
iteration to minimize unintentional distortions.
In the case of a small web, the evaluation of pages
often causes severe distortions. In the case of the actual WWW,
these distortions should normally equalise by the number of pages.
Indeed, it is to be expected that the evaluation of the distance
between pages will cause distortions on PageRank, since pages with
many inbound links surely tend to be linked to from different geographical
regions. But such effects can be anticipated by experience from
previous calculation periods, so that a normalisation would only
have to be marginal.
In either case, implementing additional factors
in PageRank is possible. Indeed, the computation of PageRank values
would take more time.
|