Theme-based PageRank
For many years now, the topic or theme-based
homogeneity of websites has been dicussed as a possible ranking criterion of search engines. There are various
theoretical approaches for the implementation of themes in search engine algorithms which all have in common
that web pages are no longer ranked only based on their own content, but also based on the content of other web
pages. For example, the content of all pages of one website may take influence on the ranking of one single page
of this website. On the other hand, it is also conceivable that one page's ranking is based on the content of
those pages which link to it or which it links to itself.
The potential implementation of a theme-based ranking in the Google search engine is
discussed controversially. In search engine optimization forums and on websites on this topic we can over and over
again find advice that inbound links from sites with a similar theme to our own have a larger influence on PageRank
than links from unrelated sites. This hypothesis shall be discussed here. Therefore, we first of all take a look at
two relatively new approaches for the integration of themes in the PageRank technique: on the one hand the
"intelligent surfer" by Matthew Richardson and Pedro Domingos and on the other hand the Topic-Sensitive PageRank by
Taher Haveliwala. Subsequently, we take a look at the possibility of using content analyses in order to compare the
text of web pages, which can be a basis for weighting links within the PageRank technique.
The "Intelligent Surfer" by Richardson and
Domingos
Matthew Richardson and Pedro Domingos resort
to the Random Surfer Model in order to explain their approach for the implementation of themes in the PageRank
technique. Instead of a surfer who follws links completely at random, they suggest a more intelligent surfer who,
on the one hand, only follows links which are related to an original search query and, on the other hand, also
after "getting bored" only jumps to a page which relates to the original query.
So, to Richardson and Domingos' "intelligent surfer" only pages are relevant that
contain the search term of an initial query. But since the Random Surfer Model does nothing but illustrate the
PageRank technique, the question is how an "intelligent" behaviour of the Random Surfer influences PageRank. The
answer is that for every term occuring on the web a separate PageRank calculation has to be conducted and each
calculation is solely based on links between pages which contain that term.
Computing PageRank this way causes some problems. They especially appear for search
terms that do not occur so often on the web. To make it into the PageRank calculations for a specific search term,
that term has not only to appear on someone's page, but also on the pages that link to it. So, the search results
would often be based on small subsets of the web and may omit relevant sites. In addition, using such small subsets
of the web, the algorithms are more vulnerable to spam by automatically generating numerous pages.
Additionally, there are serious problems regarding scalability. Richardson and
Domingos estimate the memory and computing time requirements for several 100,000 terms 100-200 times higher
compared to the original PageRank calculations. Regarding the large number of small subsets of the web, these
numbers appear to be realistic.
The higher memory requirements should not be so much of a problem because Richardson
and Domingos correctly state that the term specific PageRank values constitute only a fraction of the data volume
of Google's inverse index. However, the computing time requirements are indeed a large problem. If we assume just
five hours for a conventional PageRank calculation, then this would last about 3 weeks based on Richardson and
Domingos' model, which makes it unsuitable for actual employment.
Taher Haveliwala's Topic-Sensitive
PageRank
The approach of Taher Havilewala seems to be
more practical for actual employment. Just like Richardson and Domingos, also Havilewala suggests the computation
of different PageRanks for different topics. However, the Topic-Sensitive PageRank does not consist of hundreds of
thousands of PageRanks for different terms, but of a few PageRanks for different topics. Topic-Sensitive PageRank
is based on the link structure of the whole web, whereby the topic sensitivity implies that there is a different
weighting for each topic.
The basic principle of Haveliwala's approach has already been described in our
section on the "Yahoo-Bonus", where we have discussed the possibility to assign a particular imporance to certain
web pages. In the words of the Random Surfer Model, this is realized by increasing the probability for the Random
Surfer jumping to a page after "getting bored". Via links, this manual intervention in the PageRank technique has
an influence on the PageRank of each page on the web. More precisely, we have reached taking influence on PageRank
by implementing another value E in the PageRank algorithm:
PR(A) = E(A) (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Haveliwala's Topic-Sensitive-PageRank goes one step further. Instead of assigning a
universally higher value to a website or a web page, Haveliwala differentiates on the basis of different topics.
For each of these topics, he identifies other authority pages. On the basis of this evaluation, different PageRanks
are calculated - each separately, but for the entire web.
For his experiments on Topic-Sensitive PageRank, Haveliwala has chosen the 16
top-level categories of the Open Directory Project both for the identification of topics and for the intervention
in PageRank. More precisely, Haveliwala assigns a higher value E to the pages of those ODP categories for which he
calculates PageRank. If, for example, he calculates the PageRank for the topic health, all the ODP pages in the
health category receive a relatively higher value E and they pass this value in the form of PageRank on to the
pages which are linked from there. Of course, this PageRank is passed on to other pages and, if we assume that
health-related websites tend to link more often to other websites within that topic, pages on the topic health
generally receive a higher PageRank.
Haveliwala confirms the incompleteness of choosing the Open Directory Project in
order to identify topics, which for example results in a high degree of dependence on ODP editors and in a rather
rough subdivision into topics. But, as Haveliwala states, his method shows good results and it can surely be
improved without big effort.
However, one crucial point in Haveliwala's work on Topic-Sensitive-PageRank is the
identification of the user's preferences. Having a Topic-Specific PageRank is useless as long as we do not know in
which topics an actual user is interested. In the end, search results must be based on the PageRank that matches
the user's preferences best. The Topic-Sensitive PageRank can only be used if these are known.
Indeed, Haveliwala does supply some practicable approaches for the identification of
user preferences. He describes, for example, the search in context by highlighting terms on a web page. In this
way, the content of that web page could be an indicator for waht the user is looking for. At this point, we want to
note the potential of the Google Toolbar. The Toolbar submits data regarding search terms and pages that a user has
visited to Google. This data can be used to create user profiles which can then be a basis for the identification
of the user's preferences. However, even without using such techniques, it is imaginable that a user simply chooses
the topic he is interested in before he does a query.
The Weighting of Links Based on
Content Analyses
That it is possible to weight single links
within the PageRank technique has been shown on the previous page. The thought behind weighting links based on
content analyses is to avoid the corrumption of PageRank. By weighting links this way, it is theoretically possible
to diminish the influence of links between thematically unrelated page, which have been set for the sole purpose of
boosting PageRank of one page. Indeed, it is questionable if it is possible to realize such weighting based on
content analyses.
The fundamentals of content analyses are based on Gerard Salton's work in the 1960s
and 1970s. In his vector space model of information retrieval, documents are modeled as vectors which are built
upon terms and their weighting within the document. These term vectors allow comparisons between the content of
documents by, for instance, calculating the cosine measure (the inner product) of the vectors. In its basic form,
the vector space model has some weaknesses. For instance, often the assumption that if and in how far the same
words appear in two documents is an indicator for their similarity is criticized. However, numerous enhancements
have been developed that solve most of the problems of the vector space model.
One person who excelled at publications which are based on Salton's vector space
model is Krishna Bharat. This is interesting because Bharat meanwhile is a member of Google's staff and,
particularly, because he is deemded to be the developer of "Google News" (news.google.com). Google News is a
service that crawls news websites, evaluates articles and then provides them categorized and grouped in different
subjects on the Google New website. According to Google, all these procedures are completely automated. Therefore,
other criteria like, for example, the time when an article is published, are taken into account, but if there is no
manual intervention, the clustering of articles is most certainly only possible, if the contents of the articles
are actually compared to each other. The questions is: How can this be realized?
In their publication on a term vector database, Raymie Stata, Krishna Bharat and
Farzin Maghoul describe how the contents of web pages can be compared based on term vectors and, particularly, they
describe how some of the problems with the vector space model can be solved. Firstly, not all terms in documents
are suitable for content analsysis. Very frequent terms provide only little discrimination across vectors and, so,
the most frequent third of all terms is eliminated from the database. Infrequent terms, on the other hand, do not
provide a good basis for measuring similarity. Such terms are, for example, misspellings. They appear only on few
pages which are likely unrelated in terms of their theme, but because they are so infrequent, the term vectors of
the pages appear to be closely related. Hence, also the least frequent third of all terms is eliminated from the
database.
Even if only one third of all terms is included in the term vectors, this selection
is still not very efficient. Stata, Bharat and Maghoul perform another filtering, so that each term vector is based
on a maximum of 50 terms. But these are not the 50 most frequent terms on a page. They weight a term by deviding
the number of times it appears on a page by the number of times it appears on all pages, and those 50 terms with
the highest weight are included in the term vector of a page. This selection actually allows a real differentiation
between the content of pages.
The methods described above are standards for the vector space model. If, for
example, the inner product of two term vectors is rather high, the contents of the according pages tend the
contents of the according pages tend to be similar. This may allow content comparisons in many areas, but it is
doubtful if it is a good basis for weighting links within the PageRank technique. Most of all, synonyms and terms
that describe similar things can not be identified. Indeed, there are algorithms for word stemming which work good
for the english language, but in other languages word stemming is much more complicated. Different languages are a
general problem. Unless, for instance, brand names or loan words are used, texts in different languages normally do
not contain the same terms. And if they do, these terms normally have a completely different meaning, so that
comparing content in different languages is not possible. However, Stata, Bharat and Maghoul provide a method of
resolution for these problems.
Stata, Bharat und Maghoul present a concrete application for their Term Vector
Database by classifying pages thematically. Bharat has also published on this issue together with Monika Henzinger,
presently Google's Research Director, and they called it "topic distillation". Topic distillation is based on
calculating so-called topic vectors. Topic vectors are term vectors, but they do not only include terms of one page
but rather the terms of many pages which are on the same topic. So, in order to create topic vectors, they have to
know a certain amount of web pages which are on several pre-defined topics. To achieve this, they resort to web
directories.
For their application, Stata, Bharat und Maghoul have crawled about 30,000 links
within each of the then 12 main categories of Yahoo to create topic vectors which include about 10,000 terms each.
Then, in order to identify the topic of any other web page, they matched the according term vector with all the
topic vectors which were created from the Yahoo crawl. The topic of a web page derived from the topic vector which
matched the term vector of the web page best. That such a classification of web pages works can again be observed
by the means of Google News. Google News does not only merge articles to one news topic, but also arranges them to
the categories World, U.S., Business, Sci/Tech, Sports, Entertainment and Health. As long as this categorization is
not based on the structure of the website where the articles come from (which is unlikely), the actual topic of an
article has in fact to be computed.
At the time he published on term vectors, Krishna Bharat did not work on PageRank but
rather on Kleinberg's algorithm, so that he was more interested in filtering off-topic links than in weighting
links. But from classifying pages to weighting links based on content comparisons, there is only a small step.
Instead of matching the term vectors of two pages, it is much more efficient to match the topics of two pages. We
can, for instance, create a "topic affinity vector" for each page based on the degree of affinity of the page's
term vector and all the topic vectors. The better the topic affinity vectors of two pages match, the more likely
are they on the same topic and the higher should a link between them be weighted.
Using topic vectors has one big advantage over comparing term vectors directly: A
topic vector can include terms in different languages by being based on, for instance, the links on different
national Yahoo versions. Deviant site structures of the national versions can most certainly be adapted manually.
Even better may be using the ODP because the structure of the sub-categories of the "World" category is based on
the main OPD structure. In this way, measuring topic similarities between pages in different languages can be
realized, so that a really useful weighting of links based on text analyses appears to be possible.
Is there an Actual Implementation of Themes
in PageRank?
That both the approach of Havelivala and the approach of
Richardson and Domingos are not utilized by Google is obvious. One would notice it using Google. However, a
weighting of links based on text analyses would not be apparent immediately. It has been shown that it is
theoretically possible. But it is doubtful that it is actually implemented.
We do not want to claim that we have shown the only way of weighting links on the
basis of text analyses. Indeed, there are certainly dozens of others. However, the approach that we provided here
is based on publications of important members of Google's staff and, thus, we want to rest a critical evaluation on
it.
Like always, when talking about PageRank, there is the question if our approach is
sufficienly scalable. On the one hand, it causes additional memory requirements. After all, Stata, Bharat and
Maghoul describe the system architecture of a term vector database which is different from Google's inverse index,
since it maps from page ids to terms and, so, can hardly be integrated in the existing architecture. At the actual
size of Google's index, the additional memory requirements should be several hundred GB to a few TB. However, this
should not be so much of a problem since Google's index is most certainly several times bigger. In fact, the time
requirements for building the database and for computing the weigtings appear to be the critical part.
Building a term verctor database should be approximately as time-consuming as
building an inverse index. Of course, many procecces can probably be used for building both but if, for instance,
the weighting of terms in the term vectors has to differ from the weighting of terms in the inverse index, the time
requirements remain substantial. If we assume that, like in our approach, content analyses are based on computing
the inner products of topic affinity vectors which have to be calculated by matching term vectors and topic
vectors, this process should be approximately as time-consuming as computing PageRank. Moreover, we have to
consider that the PageRank calculations themselves beome more complicated by weighting links.
So, the additional time requirements are definitely not negligible. This is why we
have to ask ourselves if weighting links based on text analyses is useful at all. Links between thematically
unrelated page, which have been set for the sole purpose of boosting PageRank of one page, may be annoying, but
most certainly they are only a small fraction of all links. Additionally, the web itself is completely
inhomogeneous. Google, Yahoo or the ODP do not owe their high PageRank solely to links from other search engines or
directories. A huge part of the links on the web are simply not set for the purpose of showing visitors ways to
more thematically related information. Indeed, the motivation for placing links is manifold. Moreover, the
problably most popular websites are completely inhomogeneous in terms of theme. Think about portals like Yahoo or
news websites which contain articles that cover almost any subject of life. A strong weighting of links as it has
been described here could influence those website's PageRanks significantly.
If the PageRank technique shall not become totally futile, a weighting of links can
only take place to a small extent. This, of course, raises the question if the efforts it requires are justifiable.
After all, there are certainly other ways to eliminate spam which often comes to the top of search results through
thematically unrelated and probably bought links.
ALGO SEO UK - Home
|