Additional Factors
Influencing PageRank
It has been widely discussed if additional
criteria beyond the link structure of the web have been
implemented in the PageRank algorithm since the scientific
work on PageRank has been published by Lawrence Page and
Sergey Brin. Lawrence Page himself outlines the following
potential influencing factors in his patent specifications
for PageRank:
Visibility of a link
Position of a link within a
document
Distance between web
pages
Importance of a linking
page
Up-to-dateness of a linking
page
First of all, the implementation
of additional criteria in PageRank would result in a better
approximation of human usage regarding the Random Surfer Model.
Considering the visibility of a link and its position within a
document implies that a user does not click on links completely
at haphazard, but rather follows links which are highly and
immediately visible regardless of their anchor text. The other
criteria would give Google more flexibility in determing in how
far an inbound link of a page should be considered important,
than the methods which have been described so far.
Whether or not the above
mentioned factors are actually implemented in PageRank can not
be proved empirically and shall not be discussed here. It shall
rather be illustrated in which way additional influencing
factors can be implemented in the PageRank algorithm and which
options the Google search engine thereby gets in terms of
influencing PageRank values.
Modification of the PageRank
Algorithm
To implement additional factors in PageRank,
the original PageRank algorithm has again to be modified. Since
we have to assume that PageRank calculations are still based on
numerous iterations and for the purpose of short computation
times, we have to consider to keep the number of database
queries during the iterations as small as possible. Therefore,
the following modification of the PageRank algorithm shall be
assumed:
PR(A) = (1-d) + d (PR(T1)xL(T1,A)
+ ... + PR(Tn)xL(Tn,A))
Here, L(Ti,A) represents the
evaluation of a link which points from page Ti to page A.
L(Ti,A) withal replaces the PageRank weighting of page Ti by
the number of outbound links on page Ti which was given by
1/C(Ti). L(Ti,A) may consist of several factors, each of them
having to be determined only once and then being merged to one
value before the iterative PageRank calculation begins. So, the
number of database queries during the iterations stays the
same, although, admittedly, a much larger database has to be
queried at each step in comparison to the computation by use of
the original algorithm, since now there is an evaluation of
each link instead of an evaluation of pages (by the number of
their outbound links).
Different Evaluation of Links within a
Document
Two of the criteria for the evaluation of
links mentioned by Lawrence Page in his PageRank patent
specifications are the visibilty of a link and its position
within a document. Regarding the Random Surfer Model, those
criteria reflect the probability for the random surfer clicking
on a link on a specific web page. In the original PageRank
algorithm, this probability is given by the term (1/C(Ti)),
whereby the probability is equal for each link on one
page.
Assigning different probabilities
to each link on a page can, for instance, be realized as
follows:
We take a look at a web
consisting of three pages A, B anc C, where each of these pages
has outbound links to both of the other pages. Links are
weighted by two evaluation criteria X and Y. X represents the
visibility of a link. X equals 1 if a link is not particularly
emphasized, and 2 if the link is, for instance, bold or italic.
Y represents the position of a link within a document. Y equals
1 if the link is on the lower half of the page, and 3 if the
link is on the upper half of the page. If we assume a
multiplicative correlation between X and Y, the links in our
example are evaluated as follows:
X(A,B) x Y(A,B) = 1 x 3 =
3
X(A,C) x Y(A,C) = 1 x 1 =
1
X(B,A) x Y(B,A) = 2 x 3 =
6
X(B,C) x Y(B,C) = 2 x 1 =
2
X(C,A) x Y(C,A) = 2 x 3 =
6
X(C,B) x Y(C,B) = 2 x 1 =
2
For the purpose of determinig the
single factors L, the evaluated links must not simply be
weighted by the number of outbound links on one page, but in
fact by the total of evaluated links on the page. Thereby, we
get the following weighting quotients Z(Ti) for the single
pages Ti:
Z(A) = X(A,B) x Y(A,B) + X(A,C) x
Y(A,C) = 4
Z(B) = X(B,A) x Y(B,A) + X(B,C) x
Y(B,C) = 8
Z(C) = X(C,A) x Y(C,A) + X(C,B) x
Y(C,B) = 8
The evaluation factors L(T1,T2)
for a link which is pointing from page T1 to page T2 are hence
given by
L(T1,T2) = X(T1,T2) x Y(T1,T2) /
Z(T1)
Their values regarding our
example are as follows:
L(A,B) = 0.75
L(A,C) = 0.25
L(B,A) = 0.75
L(B,C) = 0.25
L(C,A) = 0.75
L(C,B) = 0.25
At a damping factor d of 0.5, we
get the following equations for the calculation of PageRank
values:
PR(A) = 0.5 + 0.5 (0.75 PR(B) +
0.75 PR(C))
PR(B) = 0.5 + 0.5 (0.75 PR(A) +
0.25 PR(C))
PR(C) = 0.5 + 0.5 (0.25 PR(A) +
0.25 PR(B))
Solving these equations gives us
the follwing PageRank values for our example:
PR(A) = 819/693
PR(B) = 721/693
PR(C) = 539/693
First of all, we see that page A
has the highest PageRank of all three pages. This is caused by
page A receiving the relatively higher evaluated link from page
B as well as from page C.
Furthermore, we see that even by
the evaluation of single links, the sum of the PageRank values
of all pages equals 3 (2079/693) and thereby the total number
of pages. So, the PageRank values computed by our modified
PageRank algorithm can be used for the general ranking of web
pages by Google without any normalisation being
needed.
Different Evaluation of Links by Page
Specific Criteria
Besides the unequal evaluation of links
within a document, Lawrence Page mentions the possibility of
evaluating links according to criteria which are based upon the
linking page. At first glance, this does not seem necessary
since it is the main principle of PageRank to rank pages the
higher, the more high ranking pages link to them. But, at the
time of their scientific work on PageRank, Page and Brin have
already recognized that their algorithm is vulnerable to
artificial inflation of PageRank.
An artificial influence on
PageRank might be exerted by webmasters who generate a
multitude of web pages whose links distribute PageRank in a way
that single pages within that system receive a special
importance. Those pages can have a high PageRank without being
linked to from other pages with high PageRank. So, not only the
concept of PageRank is undermined, but also the search engine's
index is spammed with an innumerable amount of web pages which
were solely created to influence PageRank.
In his patent specifications for
PageRank, Lawrence Page presents the evaluation of links by the
distance between pages as a means to avoid the artificial
inflation of PageRank, because the bigger the distance between
two pages, the less likely has one webmaster control over both.
A criterium for the distance between two pages may be if they
are on the same domain or not. In this way, internal links
would be weighted less than external links. In the end, any
general measure of the distance between links can be used to
determine such a weighting. This comprehends if pages are on
the same server or not and also the geographical distance
between servers.
As another indicator for the
importance of a document, Lawrence Page mentions the
up-to-dateness of the documents which link to it. This argument
considers that the information on a page is less likely
outdated, the more pages which have been modified recently link
to it. In contrast, the original PageRank concept, just like
any method of measuring link popularity, favours older
documents which gained their inbound links in the course of
their existence and have at a higher probability been modified
less recently than new documents. Basically, recently modified
documents may be given a higher evaluation by weighting the
factor (1-d). In this way, both those recently modified
documents and the pages they link to receive a higher PageRank.
But, if a page has been modified recently, is not necessarily
an indicator for the importance of the information presented on
it. So, as suggested by Lawrence Page, it is advisable not to
favour recently modified pages but only their outbound
links.
Finally, Page mentions the
importance of the web location of a page as an indicator of the
importance of its outbound links. As an example for an
important web location he names the root page of a domain, but,
in the end, Google could exert influence on PageRank absolutely
arbitrarily.
To implement the evaluation of
the linking page into PageRank, the evaluation factor of the
modified algorithm must consist of several single factors. For
a link that points from page Ti to page A, it can be given as
follows:
L(Ti,A) = K(Ti,A) x K1(Ti) x ...
x Km(Ti)
where K(Ti,A) is the above
presented weighting of a single link within a page by its
visibility or position. Additionally, an evaluation of page Ti
by m criteria which are represented by the factors Kj(Ti) takes
place.
To implement the evaluation of
the linking pages, not only the algorithm but also the
proceedings of PageRank calculation have to be modified. This
shall be illustrated by an example.
We take a look at a web
consisting of three pages A, B and C, whereby page A links to
the pages B and C, page B links to page C and page C links to
page A. The outbound links of one page are evaluated equally,
so there is no weighting by visibilty or position. But now, the
pages are evaluated by one criterium. In this way, an inbound
link from page C shall be considered four times as important as
an inbound link from one of the other pages. After weighting by
the number of pages, we get the following evaluation
factors:
K(A) = 0.5
K(B) = 0.5
K(C) = 2
At a damping factor d of 0.5, the
equations for the computation of the PageRank values are given
by
PR(A) = 0.5 + 0.5 x 2
PR(C)
PR(B) = 0.5 + 0.5 x 0.5 x 0.5
PR(A)
PR(C) = 0.5 + 0.5 (0.5 PR(B) +
0.5 x 0.5 PR(A))
Solving the equations gives us
the follwing PageRank values:
PR(A) = 4/3
PR(B) = 2/3
PR(C) = 5/6
At the current modifications of
the PageRank algorithm, the accumulated PageRank of all pages
no longer equals the number of pages. The reason therefore is
that the weighting of the page evaluation by the number of
pages was not appropriate. To determine the proper weighting,
the web's linking structure would have to be anticipated, which
is not possible in case of the actual WWW. Therefore, the
PageRank calculated by an evaluation of linking pages has to be
normalized if there shall not be any unfounded effects on the
general ranking of pages by Google. Within the iterative
calculation, a normalization would have to take place after
each iteration to minimize unintentional
distortions.
In the case of a small web, the
evaluation of pages often causes severe distortions. In the
case of the actual WWW, these distortions should normally
equalise by the number of pages. Indeed, it is to be expected
that the evaluation of the distance between pages will cause
distortions on PageRank, since pages with many inbound links
surely tend to be linked to from different geographical
regions. But such effects can be anticipated by experience from
previous calculation periods, so that a normalisation would
only have to be marginal.
In either case, implementing
additional factors in PageRank is possible. Indeed, the
computation of PageRank values would take more
time.
ALGO SEO UK - Home
|