Theme-based
PageRank
For many years now, the topic or
theme-based homogeneity of websites has been dicussed as a
possible ranking criterion of search engines. There are
various theoretical approaches for the implementation of
themes in search engine algorithms which all have in common
that web pages are no longer ranked only based on their own
content, but also based on the content of other web pages.
For example, the content of all pages of one website may
take influence on the ranking of one single page of this
website. On the other hand, it is also conceivable that one
page's ranking is based on the content of those pages which
link to it or which it links to itself.
The potential implementation of a
theme-based ranking in the Google search engine is discussed
controversially. In search engine optimization forums and on
websites on this topic we can over and over again find advice
that inbound links from sites with a similar theme to our own
have a larger influence on PageRank than links from unrelated
sites. This hypothesis shall be discussed here. Therefore, we
first of all take a look at two relatively new approaches for
the integration of themes in the PageRank technique: on the one
hand the "intelligent surfer" by Matthew Richardson and Pedro
Domingos and on the other hand the Topic-Sensitive PageRank by
Taher Haveliwala. Subsequently, we take a look at the
possibility of using content analyses in order to compare the
text of web pages, which can be a basis for weighting links
within the PageRank technique.
The "Intelligent Surfer" by Richardson
and Domingos
Matthew Richardson and Pedro Domingos resort
to the Random Surfer Model in order to explain their approach
for the implementation of themes in the PageRank technique.
Instead of a surfer who follws links completely at random, they
suggest a more intelligent surfer who, on the one hand, only
follows links which are related to an original search query
and, on the other hand, also after "getting bored" only jumps
to a page which relates to the original query.
So, to Richardson and Domingos'
"intelligent surfer" only pages are relevant that contain the
search term of an initial query. But since the Random Surfer
Model does nothing but illustrate the PageRank technique, the
question is how an "intelligent" behaviour of the Random Surfer
influences PageRank. The answer is that for every term occuring
on the web a separate PageRank calculation has to be conducted
and each calculation is solely based on links between pages
which contain that term.
Computing PageRank this way
causes some problems. They especially appear for search terms
that do not occur so often on the web. To make it into the
PageRank calculations for a specific search term, that term has
not only to appear on someone's page, but also on the pages
that link to it. So, the search results would often be based on
small subsets of the web and may omit relevant sites. In
addition, using such small subsets of the web, the algorithms
are more vulnerable to spam by automatically generating
numerous pages.
Additionally, there are serious
problems regarding scalability. Richardson and Domingos
estimate the memory and computing time requirements for several
100,000 terms 100-200 times higher compared to the original
PageRank calculations. Regarding the large number of small
subsets of the web, these numbers appear to be
realistic.
The higher memory requirements
should not be so much of a problem because Richardson and
Domingos correctly state that the term specific PageRank values
constitute only a fraction of the data volume of Google's
inverse index. However, the computing time requirements are
indeed a large problem. If we assume just five hours for a
conventional PageRank calculation, then this would last about 3
weeks based on Richardson and Domingos' model, which makes it
unsuitable for actual employment.
Taher Haveliwala's Topic-Sensitive
PageRank
The approach of Taher Havilewala seems to be
more practical for actual employment. Just like Richardson and
Domingos, also Havilewala suggests the computation of different
PageRanks for different topics. However, the Topic-Sensitive
PageRank does not consist of hundreds of thousands of PageRanks
for different terms, but of a few PageRanks for different
topics. Topic-Sensitive PageRank is based on the link structure
of the whole web, whereby the topic sensitivity implies that
there is a different weighting for each topic.
The basic principle of
Haveliwala's approach has already been described in our section
on the "Yahoo-Bonus", where we have discussed the possibility
to assign a particular imporance to certain web pages. In the
words of the Random Surfer Model, this is realized by
increasing the probability for the Random Surfer jumping to a
page after "getting bored". Via links, this manual intervention
in the PageRank technique has an influence on the PageRank of
each page on the web. More precisely, we have reached taking
influence on PageRank by implementing another value E in the
PageRank algorithm:
PR(A) = E(A) (1-d) + d
(PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Haveliwala's
Topic-Sensitive-PageRank goes one step further. Instead of
assigning a universally higher value to a website or a web
page, Haveliwala differentiates on the basis of different
topics. For each of these topics, he identifies other authority
pages. On the basis of this evaluation, different PageRanks are
calculated - each separately, but for the entire
web.
For his experiments on
Topic-Sensitive PageRank, Haveliwala has chosen the 16
top-level categories of the Open Directory Project both for the
identification of topics and for the intervention in PageRank.
More precisely, Haveliwala assigns a higher value E to the
pages of those ODP categories for which he calculates PageRank.
If, for example, he calculates the PageRank for the topic
health, all the ODP pages in the health category receive a
relatively higher value E and they pass this value in the form
of PageRank on to the pages which are linked from there. Of
course, this PageRank is passed on to other pages and, if we
assume that health-related websites tend to link more often to
other websites within that topic, pages on the topic health
generally receive a higher PageRank.
Haveliwala confirms the
incompleteness of choosing the Open Directory Project in order
to identify topics, which for example results in a high degree
of dependence on ODP editors and in a rather rough subdivision
into topics. But, as Haveliwala states, his method shows good
results and it can surely be improved without big
effort.
However, one crucial point in
Haveliwala's work on Topic-Sensitive-PageRank is the
identification of the user's preferences. Having a
Topic-Specific PageRank is useless as long as we do not know in
which topics an actual user is interested. In the end, search
results must be based on the PageRank that matches the user's
preferences best. The Topic-Sensitive PageRank can only be used
if these are known.
Indeed, Haveliwala does supply
some practicable approaches for the identification of user
preferences. He describes, for example, the search in context
by highlighting terms on a web page. In this way, the content
of that web page could be an indicator for waht the user is
looking for. At this point, we want to note the potential of
the Google Toolbar. The Toolbar submits data regarding search
terms and pages that a user has visited to Google. This data
can be used to create user profiles which can then be a basis
for the identification of the user's preferences. However, even
without using such techniques, it is imaginable that a user
simply chooses the topic he is interested in before he does a
query.
The
Weighting of Links Based on Content
Analyses
That it is possible to weight single links
within the PageRank technique has been shown on the previous
page. The thought behind weighting links based on content
analyses is to avoid the corrumption of PageRank. By weighting
links this way, it is theoretically possible to diminish the
influence of links between thematically unrelated page, which
have been set for the sole purpose of boosting PageRank of one
page. Indeed, it is questionable if it is possible to realize
such weighting based on content analyses.
The fundamentals of content
analyses are based on Gerard Salton's work in the 1960s and
1970s. In his vector space model of information retrieval,
documents are modeled as vectors which are built upon terms and
their weighting within the document. These term vectors allow
comparisons between the content of documents by, for instance,
calculating the cosine measure (the inner product) of the
vectors. In its basic form, the vector space model has some
weaknesses. For instance, often the assumption that if and in
how far the same words appear in two documents is an indicator
for their similarity is criticized. However, numerous
enhancements have been developed that solve most of the
problems of the vector space model.
One person who excelled at
publications which are based on Salton's vector space model is
Krishna Bharat. This is interesting because Bharat meanwhile is
a member of Google's staff and, particularly, because he is
deemded to be the developer of "Google News" (news.google.com).
Google News is a service that crawls news websites, evaluates
articles and then provides them categorized and grouped in
different subjects on the Google New website. According to
Google, all these procedures are completely automated.
Therefore, other criteria like, for example, the time when an
article is published, are taken into account, but if there is
no manual intervention, the clustering of articles is most
certainly only possible, if the contents of the articles are
actually compared to each other. The questions is: How can this
be realized?
In their publication on a term
vector database, Raymie Stata, Krishna Bharat and Farzin
Maghoul describe how the contents of web pages can be compared
based on term vectors and, particularly, they describe how some
of the problems with the vector space model can be solved.
Firstly, not all terms in documents are suitable for content
analsysis. Very frequent terms provide only little
discrimination across vectors and, so, the most frequent third
of all terms is eliminated from the database. Infrequent terms,
on the other hand, do not provide a good basis for measuring
similarity. Such terms are, for example, misspellings. They
appear only on few pages which are likely unrelated in terms of
their theme, but because they are so infrequent, the term
vectors of the pages appear to be closely related. Hence, also
the least frequent third of all terms is eliminated from the
database.
Even if only one third of all
terms is included in the term vectors, this selection is still
not very efficient. Stata, Bharat and Maghoul perform another
filtering, so that each term vector is based on a maximum of 50
terms. But these are not the 50 most frequent terms on a page.
They weight a term by deviding the number of times it appears
on a page by the number of times it appears on all pages, and
those 50 terms with the highest weight are included in the term
vector of a page. This selection actually allows a real
differentiation between the content of pages.
The methods described above are
standards for the vector space model. If, for example, the
inner product of two term vectors is rather high, the contents
of the according pages tend the contents of the according pages
tend to be similar. This may allow content comparisons in many
areas, but it is doubtful if it is a good basis for weighting
links within the PageRank technique. Most of all, synonyms and
terms that describe similar things can not be identified.
Indeed, there are algorithms for word stemming which work good
for the english language, but in other languages word stemming
is much more complicated. Different languages are a general
problem. Unless, for instance, brand names or loan words are
used, texts in different languages normally do not contain the
same terms. And if they do, these terms normally have a
completely different meaning, so that comparing content in
different languages is not possible. However, Stata, Bharat and
Maghoul provide a method of resolution for these
problems.
Stata, Bharat und Maghoul present
a concrete application for their Term Vector Database by
classifying pages thematically. Bharat has also published on
this issue together with Monika Henzinger, presently Google's
Research Director, and they called it "topic distillation".
Topic distillation is based on calculating so-called topic
vectors. Topic vectors are term vectors, but they do not only
include terms of one page but rather the terms of many pages
which are on the same topic. So, in order to create topic
vectors, they have to know a certain amount of web pages which
are on several pre-defined topics. To achieve this, they resort
to web directories.
For their application, Stata,
Bharat und Maghoul have crawled about 30,000 links within each
of the then 12 main categories of Yahoo to create topic vectors
which include about 10,000 terms each. Then, in order to
identify the topic of any other web page, they matched the
according term vector with all the topic vectors which were
created from the Yahoo crawl. The topic of a web page derived
from the topic vector which matched the term vector of the web
page best. That such a classification of web pages works can
again be observed by the means of Google News. Google News does
not only merge articles to one news topic, but also arranges
them to the categories World, U.S., Business, Sci/Tech, Sports,
Entertainment and Health. As long as this categorization is not
based on the structure of the website where the articles come
from (which is unlikely), the actual topic of an article has in
fact to be computed.
At the time he published on term
vectors, Krishna Bharat did not work on PageRank but rather on
Kleinberg's algorithm, so that he was more interested in
filtering off-topic links than in weighting links. But from
classifying pages to weighting links based on content
comparisons, there is only a small step. Instead of matching
the term vectors of two pages, it is much more efficient to
match the topics of two pages. We can, for instance, create a
"topic affinity vector" for each page based on the degree of
affinity of the page's term vector and all the topic vectors.
The better the topic affinity vectors of two pages match, the
more likely are they on the same topic and the higher should a
link between them be weighted.
Using topic vectors has one big
advantage over comparing term vectors directly: A topic vector
can include terms in different languages by being based on, for
instance, the links on different national Yahoo versions.
Deviant site structures of the national versions can most
certainly be adapted manually. Even better may be using the ODP
because the structure of the sub-categories of the "World"
category is based on the main OPD structure. In this way,
measuring topic similarities between pages in different
languages can be realized, so that a really useful weighting of
links based on text analyses appears to be possible.
Is there an Actual Implementation of Themes
in PageRank?
That both
the approach of Havelivala and the approach of Richardson and
Domingos are not utilized by Google is obvious. One would
notice it using Google. However, a weighting of links based on
text analyses would not be apparent immediately. It has been
shown that it is theoretically possible. But it is doubtful
that it is actually implemented.
We do not want to claim that we
have shown the only way of weighting links on the basis of text
analyses. Indeed, there are certainly dozens of others.
However, the approach that we provided here is based on
publications of important members of Google's staff and, thus,
we want to rest a critical evaluation on it.
Like always, when talking about
PageRank, there is the question if our approach is sufficienly
scalable. On the one hand, it causes additional memory
requirements. After all, Stata, Bharat and Maghoul describe the
system architecture of a term vector database which is
different from Google's inverse index, since it maps from page
ids to terms and, so, can hardly be integrated in the existing
architecture. At the actual size of Google's index, the
additional memory requirements should be several hundred GB to
a few TB. However, this should not be so much of a problem
since Google's index is most certainly several times bigger. In
fact, the time requirements for building the database and for
computing the weigtings appear to be the critical
part.
Building a term verctor database
should be approximately as time-consuming as building an
inverse index. Of course, many procecces can probably be used
for building both but if, for instance, the weighting of terms
in the term vectors has to differ from the weighting of terms
in the inverse index, the time requirements remain substantial.
If we assume that, like in our approach, content analyses are
based on computing the inner products of topic affinity vectors
which have to be calculated by matching term vectors and topic
vectors, this process should be approximately as time-consuming
as computing PageRank. Moreover, we have to consider that the
PageRank calculations themselves beome more complicated by
weighting links.
So, the additional time
requirements are definitely not negligible. This is why we have
to ask ourselves if weighting links based on text analyses is
useful at all. Links between thematically unrelated page, which
have been set for the sole purpose of boosting PageRank of one
page, may be annoying, but most certainly they are only a small
fraction of all links. Additionally, the web itself is
completely inhomogeneous. Google, Yahoo or the ODP do not owe
their high PageRank solely to links from other search engines
or directories. A huge part of the links on the web are simply
not set for the purpose of showing visitors ways to more
thematically related information. Indeed, the motivation for
placing links is manifold. Moreover, the problably most popular
websites are completely inhomogeneous in terms of theme. Think
about portals like Yahoo or news websites which contain
articles that cover almost any subject of life. A strong
weighting of links as it has been described here could
influence those website's PageRanks significantly.
If the PageRank technique shall
not become totally futile, a weighting of links can only take
place to a small extent. This, of course, raises the question
if the efforts it requires are justifiable. After all, there
are certainly other ways to eliminate spam which often comes to
the top of search results through thematically unrelated and
probably bought links.
ALGO SEO UK - Home
|