Web spam: a survey with vision for the archivist

Jácint  Szabó

Web spam: a survey with vision for the archivist

2008

While Web archive quality is endangered by Web spam, a side effect of the high commercial value of top-ranked search-engine results, so far Web spam filtering technologies are rarely used by Web archivists. In this paper we make the first attempt to disseminate existing methodology and envision a solution for Web archives to share knowledge and unite efforts in Web spam hunting. We survey the state of the art in Web spam filtering illustrated by the recent Web spam challenge data sets and techniques and describe the filtering solution for archives envisioned in the LiWA—Living Web Archives project....Read more

Web Spam: a Survey with Vision for the Archivist * András A. Benczúr Dávid Siklósi Jácint Szabó István Bíró Zsolt Fekete Miklós Kurucz Attila Pereszlényi Simon Rácz Adrienn Szabó Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research Institute of the Hungarian Academy of Sciences {benczur, sdavid, jacint, ibiro, zsfekete, mkurucz, peresz, sracz, aszabo}@ilab.sztaki.hu ABSTRACT While Web archive quality is endangered by Web spam, a side ef- fect of the high commercial value of top-ranked search-engine re- sults, so far Web spam ﬁltering technologies are rarely used by Web archivists. In this paper we make the ﬁrst attempt to disseminate existing methodology and envision a solution for Web archives to share knowledge and unite efforts in Web spam hunting. We sur- vey the state of the art in Web spam ﬁltering illustrated by the recent Web spam challenge data sets and techniques and describe the ﬁl- tering solution for archives envisioned in the LiWA—Living Web Archives project. Categories and Subject Descriptors H.3 [Information Systems]: Information Storage and Retrieval; I.7.5 [Computing Methodologies]: Document Capture—Docu- ment analysis; I.2.7 [Computing Methodologies]: Artiﬁcial In- telligence—Natural Language Processing General Terms Web Archival, Information Retrieval Keywords Web spam, Text Analysis, Link Analysis, Feature Selection, Docu- ment Classiﬁcation 1. INTRODUCTION The ability to identify and prevent spam is a top-priority issue for the search-engine industry [31] but less studied by Web archivists. The apparent lack of a widespread dissemination of Web spam ﬁl- tering methods in the archival community is surprising in view of the fact that, under different measurement and estimates, roughly 10% of the Web sites and 20% of the individual HTML pages con- stitute spam. The above ﬁgures directly translate to 10–20% waste of archive resources in storage, processing and bandwidth. In this paper we survey existing Web spam ﬁltering technol- ogy and describe the mission of the EU FP7 LiWA—Living Web Archives project to deploy ﬁltering in Internet archives. As part of the LiWA project our objective is to reduce the amount of fake con- tent the archive will have to deal with. The envisioned toolkit will * This work was supported by the EU FP7 project LiWA—Living Web Archives, the eScience Regional Knowledge Centre, Hungary and by grant OTKA NK 72845. IWAW’08 International Web Archiving Workshop This work is licenced under an Attribution-NonCommercial-NoDerivs 2.0 France Creative Commons Licence. help prioritize crawls by automatically detecting content of value and exclude artiﬁcially generated manipulative and useless content. In our opinion Web spam affects all archival institutions unless the archive target is very restricted and controlled. Bulk crawls of entire domains deﬁnitely encounter Web spam. However even if an archive applies selective crawling, community content is likely involved and it is in particular sensible to the so-called comment spam: responses, posts or tags not related to the topic containing link to a target site or advertisement. This form of spam appears whenever there is no restriction for users putting their own content such as blogs [36], bookmarking systems [35] and even YouTube [7]. Spam ﬁltering is essential in Web archives even if we acknowl- edge the difﬁculty of deﬁning the boundary between Web spam and honest search engine optimization. Archives may have to tolerate more spam compared to search engines in order not to loose some content misclassiﬁed as spam that the users may want to retrieve later. Also they might want to have some representative spam ei- ther to preserve an accurate image of the web or to provide a spam corpus for researchers. In any case, we believe that the quality of an archive with completely no spam ﬁltering policy in use will greatly be deteriorated and signiﬁcant amount of resources will be wasted as the effect of Web spam. Spam classiﬁcation and page-quality assessment is a difﬁcult is- sue for search engines; for archival systems it is even more chal- lenging as they lack information about usage patterns (e.g., click proﬁles) at capture time. We survey methods that ﬁt best the needs of an archive that are capable of ﬁltering spam during the crawl process or in a bootstrap sequence of crawls. Our methods com- bine classiﬁcators based on terms over the page and on features built from content, linkage and site structure. We also show some methods based on external information from major search engines [4] that may at least in part be possible to obtain. In addition to individual solutions for speciﬁc archives, LiWA services intend to provide collaboration tools to share known spam hosts and features across participating archival institutions. A com- mon interface to a central knowledge base will be built in which archive operators may label sites or pages as spam based on own experience or suggested by the spam classiﬁcator applied to the local archives. The purpose of the planned LiWA web spam as- sessment interface is twofold: • It aids the Archive operator in selecting and blacklisting spam sites, possibly in conjunction with an active learning environ- ment where human assistance is asked for example in case of contradicting outcome by the classiﬁer ensemble; • It provides a collaboration tool for the Archives with a pos- sible centralized knowledge base through which the Archive

operators are able to share their labels, comments and obser- vations as well as start discussion on the behavior of certain questionable hosts. Web spam ﬁltering know-how became widespread with the suc- cess of the Adversarial Information Retrieval Workshops since 2005 that host the Web Spam Challenges since 2007. Our mission is to disseminate this know-how and adapt to the special needs for the archival institutions with particular emphasis on periodic recrawls and the time evolution of spam such as the disappearance of quality sites that become parking domains used for spamming purposes or spam, once blacklisted, reappearing under a new domain. In order to tie the bonds between the two communities we intend to provide time-aware Web spam benchmark data sets for future Web Spam Challenges. The rest of this paper is organized as follows. In Section 2 we summarize the types of Web spam. Section 3 describes the ﬁltering techniques applied so far over test data. Finally in Section 4 we state our mission and the intended architecture for the archive spam ﬁltering service. This last section contains our plans for compiling time-aware Web spam benchmark collections. 2. WEB SPAM TYPOLOGY The ﬁrst mention of Web spam, termed spamdexing as a combi- nation of words spam and (search engine) indexing, appears proba- bly in a 1996 news article [18] as part of the early Web era discus- sions on the spreading porn content [16]. Due to the large and ever increasing ﬁnancial gains resulting from high search engine ratings, it is no wonder that a signiﬁcant amount of human and machine resources are devoted to artiﬁcially inﬂating the rankings of certain web pages. Amit Singhal, princi- pal scientist of Google Inc. estimated that the search engine spam industry had a revenue potential of $4.5 billion in year 2004 if they had been able to completely fool all search engines on all com- mercially viable queries [41]. Web spam appears in increasingly sophisticated forms that manipulate content as well as linkage with examples such as • Copied content, "honey pots" that draw attention but link to unrelated, spam targets; • Garbage content, stuffed with popular or monetizable query terms and phrases such as university degrees, online casinos, bad credit status or adult content; • Link farms, a large number of strongly interlinked pages across several domains. The notion of Web spam is less clean compared to email spam since honest and “dark” means of search engine optimization are sometimes hard to distinguish. In [29] a content is called spam only if it is completely irrelevant for the given search terms. In another deﬁnition [39] search engine spam consists of features that main- tainers would not add to their sites if search engines didn’t exist. Link spam however could act even without the presence of a search engine by misleading users to visit certain pages, in particular to hijack payed click trafﬁc in misused afﬁliate programs. Not surprisingly the more restrictive notion of spam is favored by the search engine optimization (SEO) community. Depending on the actual query that the page is optimized for, one may refer to techniques to attain the high search engine ranking as boosting if the page content is otherwise relevant to the query. While giving a sound deﬁnition of Web spam is beyond the scope of this paper, we emphasize that our techniques should eventually be justiﬁed by improved user satisfaction with the archive content. Higher quality Figure 1: Portion of a machine generated content spam page. can hence be achieved by means of “de-boosting”, i.e. allocating more resources to sites of importance where the maintainer has low budget on SEO. Regardless of the deﬁnition used, the Web spammer toolkit con- sists of a clearly identiﬁable set of manipulation techniques that has not changed much recently. The Web Spam Taxonomy of Gyöngyi et al. [29] distinguishes content (term) and link spamming along with techniques of hiding, cloaking and removing traces by e.g. obfuscated redirection. The taxonomy remains valid since its pub- lication in 2006. New areas that opened since then include book- mark, comment and post spam that became widespread with the explosion of the social media. Various top-level or otherwise selected domains may have dif- ferent spamming behavior; Ntoulas et al. [37] give an invaluable comparison that show major differences among national domains and languages of the page. For the .de domain their ﬁndings agree with 16.5% of all pages being spam [6] while for the .uk domain together with Becchetti et al. [3] they report approximately 6%; the latter measurement also reports 16% of sites as spam over .uk. Unfortunately, most of the earlier results consider proprietary crawls and spam data sets. Currently the only reliable Web spam data sets are the Web Spam Challenge .uk crawls, however by comparing the ﬁndings on different top level domains we observe very similar spammer behavior so that we may accept ﬁndings on the Web Spam Challenge data conclusive for all domains. The frac- tion of spam over these data sets are shown in Table 1. It is important to keep in mind that Web spam is an expensive operation requiring the registration and operation of diverse domain names and IP ranges. Unlike for email spam where the target is to reach the user mailbox, Web spammers also compete against each other in obtaining high search engine rankings for their target pages. Web spamming is hence a professional business with the purpose of ﬁnancial gains. In the rest of this section we brieﬂy review techniques for con- tent spam (Section 2.1), link spam (Section 2.2) and hiding (Sec- tion 2.3) as deﬁned in [29]. Finally in Section 2.4 we collect re- cent results on spam in blogs, bookmarking and content sharing services. 2.1 Content spam The ﬁrst generation of search engines relied mostly on the classic vector space model of information retrieval. Thus web spam pio- neers manipulated the content of web pages by stufﬁng it with key- words repeated several times. A large amount of machine gener- ated spam pages such as the one in Fig. 1 are still present in today’s Web. These pages can be characterized as outliers through statis-

∗ Web Spam: a Survey with Vision for the Archivist András A. Benczúr Miklós Kurucz Dávid Siklósi Jácint Szabó István Bíró Zsolt Fekete Attila Pereszlényi Simon Rácz Adrienn Szabó Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research Institute of the Hungarian Academy of Sciences {benczur, sdavid, jacint, ibiro, zsfekete, mkurucz, peresz, sracz, aszabo}@ilab.sztaki.hu ABSTRACT While Web archive quality is endangered by Web spam, a side effect of the high commercial value of top-ranked search-engine results, so far Web spam filtering technologies are rarely used by Web archivists. In this paper we make the first attempt to disseminate existing methodology and envision a solution for Web archives to share knowledge and unite efforts in Web spam hunting. We survey the state of the art in Web spam filtering illustrated by the recent Web spam challenge data sets and techniques and describe the filtering solution for archives envisioned in the LiWA—Living Web Archives project. Categories and Subject Descriptors H.3 [Information Systems]: Information Storage and Retrieval; I.7.5 [Computing Methodologies]: Document Capture—Document analysis; I.2.7 [Computing Methodologies]: Artificial Intelligence—Natural Language Processing General Terms Web Archival, Information Retrieval Keywords Web spam, Text Analysis, Link Analysis, Feature Selection, Document Classification 1. INTRODUCTION The ability to identify and prevent spam is a top-priority issue for the search-engine industry [31] but less studied by Web archivists. The apparent lack of a widespread dissemination of Web spam filtering methods in the archival community is surprising in view of the fact that, under different measurement and estimates, roughly 10% of the Web sites and 20% of the individual HTML pages constitute spam. The above figures directly translate to 10–20% waste of archive resources in storage, processing and bandwidth. In this paper we survey existing Web spam filtering technology and describe the mission of the EU FP7 LiWA—Living Web Archives project to deploy filtering in Internet archives. As part of the LiWA project our objective is to reduce the amount of fake content the archive will have to deal with. The envisioned toolkit will help prioritize crawls by automatically detecting content of value and exclude artificially generated manipulative and useless content. In our opinion Web spam affects all archival institutions unless the archive target is very restricted and controlled. Bulk crawls of entire domains definitely encounter Web spam. However even if an archive applies selective crawling, community content is likely involved and it is in particular sensible to the so-called comment spam: responses, posts or tags not related to the topic containing link to a target site or advertisement. This form of spam appears whenever there is no restriction for users putting their own content such as blogs [36], bookmarking systems [35] and even YouTube [7]. Spam filtering is essential in Web archives even if we acknowledge the difficulty of defining the boundary between Web spam and honest search engine optimization. Archives may have to tolerate more spam compared to search engines in order not to loose some content misclassified as spam that the users may want to retrieve later. Also they might want to have some representative spam either to preserve an accurate image of the web or to provide a spam corpus for researchers. In any case, we believe that the quality of an archive with completely no spam filtering policy in use will greatly be deteriorated and significant amount of resources will be wasted as the effect of Web spam. Spam classification and page-quality assessment is a difficult issue for search engines; for archival systems it is even more challenging as they lack information about usage patterns (e.g., click profiles) at capture time. We survey methods that fit best the needs of an archive that are capable of filtering spam during the crawl process or in a bootstrap sequence of crawls. Our methods combine classificators based on terms over the page and on features built from content, linkage and site structure. We also show some methods based on external information from major search engines [4] that may at least in part be possible to obtain. In addition to individual solutions for specific archives, LiWA services intend to provide collaboration tools to share known spam hosts and features across participating archival institutions. A common interface to a central knowledge base will be built in which archive operators may label sites or pages as spam based on own experience or suggested by the spam classificator applied to the local archives. The purpose of the planned LiWA web spam assessment interface is twofold: ∗This work was supported by the EU FP7 project LiWA—Living Web Archives, the eScience Regional Knowledge Centre, Hungary and by grant OTKA NK 72845. • It aids the Archive operator in selecting and blacklisting spam sites, possibly in conjunction with an active learning environment where human assistance is asked for example in case of contradicting outcome by the classifier ensemble; IWAW’08 International Web Archiving Workshop This work is licenced under an Attribution-NonCommercial-NoDerivs 2.0 France Creative Commons Licence. • It provides a collaboration tool for the Archives with a possible centralized knowledge base through which the Archive operators are able to share their labels, comments and observations as well as start discussion on the behavior of certain questionable hosts. Web spam filtering know-how became widespread with the success of the Adversarial Information Retrieval Workshops since 2005 that host the Web Spam Challenges since 2007. Our mission is to disseminate this know-how and adapt to the special needs for the archival institutions with particular emphasis on periodic recrawls and the time evolution of spam such as the disappearance of quality sites that become parking domains used for spamming purposes or spam, once blacklisted, reappearing under a new domain. In order to tie the bonds between the two communities we intend to provide time-aware Web spam benchmark data sets for future Web Spam Challenges. The rest of this paper is organized as follows. In Section 2 we summarize the types of Web spam. Section 3 describes the filtering techniques applied so far over test data. Finally in Section 4 we state our mission and the intended architecture for the archive spam filtering service. This last section contains our plans for compiling time-aware Web spam benchmark collections. 2. WEB SPAM TYPOLOGY The first mention of Web spam, termed spamdexing as a combination of words spam and (search engine) indexing, appears probably in a 1996 news article [18] as part of the early Web era discussions on the spreading porn content [16]. Due to the large and ever increasing financial gains resulting from high search engine ratings, it is no wonder that a significant amount of human and machine resources are devoted to artificially inflating the rankings of certain web pages. Amit Singhal, principal scientist of Google Inc. estimated that the search engine spam industry had a revenue potential of $4.5 billion in year 2004 if they had been able to completely fool all search engines on all commercially viable queries [41]. Web spam appears in increasingly sophisticated forms that manipulate content as well as linkage with examples such as • Copied content, "honey pots" that draw attention but link to unrelated, spam targets; • Garbage content, stuffed with popular or monetizable query terms and phrases such as university degrees, online casinos, bad credit status or adult content; • Link farms, a large number of strongly interlinked pages across several domains. The notion of Web spam is less clean compared to email spam since honest and “dark” means of search engine optimization are sometimes hard to distinguish. In [29] a content is called spam only if it is completely irrelevant for the given search terms. In another definition [39] search engine spam consists of features that maintainers would not add to their sites if search engines didn’t exist. Link spam however could act even without the presence of a search engine by misleading users to visit certain pages, in particular to hijack payed click traffic in misused affiliate programs. Not surprisingly the more restrictive notion of spam is favored by the search engine optimization (SEO) community. Depending on the actual query that the page is optimized for, one may refer to techniques to attain the high search engine ranking as boosting if the page content is otherwise relevant to the query. While giving a sound definition of Web spam is beyond the scope of this paper, we emphasize that our techniques should eventually be justified by improved user satisfaction with the archive content. Higher quality Figure 1: Portion of a machine generated content spam page. can hence be achieved by means of “de-boosting”, i.e. allocating more resources to sites of importance where the maintainer has low budget on SEO. Regardless of the definition used, the Web spammer toolkit consists of a clearly identifiable set of manipulation techniques that has not changed much recently. The Web Spam Taxonomy of Gyöngyi et al. [29] distinguishes content (term) and link spamming along with techniques of hiding, cloaking and removing traces by e.g. obfuscated redirection. The taxonomy remains valid since its publication in 2006. New areas that opened since then include bookmark, comment and post spam that became widespread with the explosion of the social media. Various top-level or otherwise selected domains may have different spamming behavior; Ntoulas et al. [37] give an invaluable comparison that show major differences among national domains and languages of the page. For the .de domain their findings agree with 16.5% of all pages being spam [6] while for the .uk domain together with Becchetti et al. [3] they report approximately 6%; the latter measurement also reports 16% of sites as spam over .uk. Unfortunately, most of the earlier results consider proprietary crawls and spam data sets. Currently the only reliable Web spam data sets are the Web Spam Challenge .uk crawls, however by comparing the findings on different top level domains we observe very similar spammer behavior so that we may accept findings on the Web Spam Challenge data conclusive for all domains. The fraction of spam over these data sets are shown in Table 1. It is important to keep in mind that Web spam is an expensive operation requiring the registration and operation of diverse domain names and IP ranges. Unlike for email spam where the target is to reach the user mailbox, Web spammers also compete against each other in obtaining high search engine rankings for their target pages. Web spamming is hence a professional business with the purpose of financial gains. In the rest of this section we briefly review techniques for content spam (Section 2.1), link spam (Section 2.2) and hiding (Section 2.3) as defined in [29]. Finally in Section 2.4 we collect recent results on spam in blogs, bookmarking and content sharing services. 2.1 Content spam The first generation of search engines relied mostly on the classic vector space model of information retrieval. Thus web spam pioneers manipulated the content of web pages by stuffing it with keywords repeated several times. A large amount of machine generated spam pages such as the one in Fig. 1 are still present in today’s Web. These pages can be characterized as outliers through statis- Similar misuses of eBay or the German Scout24.de affiliate program is also common practice [6]. It is realized in [14, 4] that spam is characterized by its success in a search engine that does not deploy spam filtering over popular or monetizable queries. Lists of such queries can be obtained from search engine query logs or via AdWords, Google’s flagship pay-per-click advertising product (http://adwords.google.com). 2.2 Figure 2: Top hit list of a major search engine for the query “download freemp3 digitalcamera microsoft linux” occupied by keyword stuffed pages. Link spam Following Google’s success all major search engines quickly incorporated link analysis algorithms such as HITS [33] and PageRank [38] into their ranking schemes. The birth of the highly successful PageRank algorithm [38] was indeed partially motivated by the easy spammability of the simple in-degree count. Unfortunately PageRank (together with probably all known link based ranking schemes) are prone to spam. Spammers build socalled link farms, large collections of tightly interconnected Web sites over diverse domains that eventually all point to the targeted page. The rank of the target will be large regardless of the ranking method due to the large number of links and the tightly connected structure. An example of a well-known link farm in operation for several years now is the 411Web page collection; the content of these sites is likely not spam (indeed they are not excluded from Google) but form a strongly optimized subgraph that illustrates the operation of a link farm well. 2.3 Cloaking and hiding The article of [29] list a few methods that confuse users including term hiding (background color text) and redirection; some of these techniques can still be found by inspecting the HTML code within the page source. Detecting redirection may already require certain expertise as quite a number of so-called doorway spam pages use obfuscated JavaScript code to redirect to their target. These pages deploy the idea that a Web crawler has limited resources to execute scripts. A very simple example, tabulated for better readability, is seen below: <SCRIPT language=javascript> var1=100;var3=200;var2=var1 + var3; var4=var1;var5=v if(var2==var5) document.location="http://umlander.info/ mega/fre </SCRIPT> Figure 3: A page with no content other than Google ads. tical analysis [24] targeting the templatic nature: their term distribution, entropy or compressibility distinguishes them from normal content. Large number of phrases appearing in other web pages as well are also characterize spam [25]. Sites exhibiting excessive phrase reuse are either template driven or spam, employing the so called stitching technique. Ntoulas et al. [37] describe content spamming characteristics including overly large number of words either in the entire page or in the title or anchor text, as well as the fraction of page drawn from popular words and the fraction of most popular words that appear in the page. As noticed by Gyöngyi and Garcia-Molina [28], most spammers act for financial gain. Spam target pages are thus stuffed with a large number of keywords that are either of high advertisement value or highly spammed, including misspelled popular words such as “googel” or “accomodation” as seen among the top hits of a major search engine in Fig. 2. A page full of Google ads and maybe even no other content at all is also a typical spammer technique to misuse Google AdSense for financial gains [4] as seen in Fig. 3. An HTTP-specific misuse is providing different content for human browsers and search engine robots. This so-called cloaking technique is hard to catch in a fixed crawl snapshot and may undermine the coherence of an archive. Cloaking is very hard to detect; the only method is described by Chellapilla and Chickering [17] who aid their cloaking detection method by using the most frequent words from the MSN query log and highest revenue generating words from the MSN advertisement log. In theory cloaking could be detected by comparing crawls with different user agent strings and IP addresses of the robots. Spammers however tackle robot behavior, collect and share crawler IP addresses and hence very effectively distinguish robots from human surfers. Spam may also circumvent a post-processing Web spam filter since the crawler may believe the spam site to be honest based on outdated information. The widespread use of parking domains for spamming purpose illustrates this phenomenon. Spammers purchase sites that terminate their operation and fill them with spam. For some additional time these sites appear with their previous content both in the search engine index and also in the input for the spam classificator. The crawler will meanwhile fetch the new content believed to be honest, follows its links and prioritizes its normal spam undecided total labeled total hosts WEBSPAM-UK2006 8,123 2,113 426 10,662 10,662 WEBSPAM-UK2007 5,709 344 376 6,429 114,529 Table 1: The number of hosts in the Web Spam Challenge data sets. Figure 5: A guestbook filled with comment spam. processes in favor for the spammer’s target. The HTML code excerpt in Fig. 4 shows the use of a parking domain for spamming, in combination with hiding content from human users by using stylesheets. 2.4 Spam in social media New types of spam become widespread with the explosion of social media. Blog comment spam [36] consists of adding comments such as “nice page” and a URL pointing to a link farm as in Fig. 5. Since a search engine may not want to blacklist the entire blog, the target URL may receive trusted inlinks in the search engine’s ranking. Similar forms of spam appear in bookmarking systems [35]. With a slightly different purpose, video comment spam appears in YouTube [7] consisting of commercials or unrelated low-quality posts added as comment to a large number of popular posts. 3. FILTERING: THE STATE OF THE ART As Web spammers manipulate several aspects of content as well as linkage [29], effective spam hunting must combine a variety of content [24, 37, 25] and link [30, 23, 43, 6, 5, 40, 44] based methods. The current LiWA solution is based on the lessons learned from the Web Spam Challenges [13]. As it has turned out, the feature set described in [14] and the bag of words representation of the site content [1] give a very strong baseline with only minor improvements achieved by the Challenge participants. We use the combination, listed in the observed order of their strength, of the following classificators: SVM over tf.idf; an augmented set of the statistical spam features of [14]; graph stacking [20]; text classification by latent Dirichlet allocation [8] as well as by compression [19]. Except from very early results such as “nepotistic” link detection [21] dating back to year 2000, roots of spam filtering methods appear in trust propagation in P2P systems [22, 32]. Trust and distrust propagation methods [30, 40, 23, 44, 5] remain the most important methods of link spam filtering. The methods of this section are illustrated over the Web Spam Challenge data sets. The Web Spam Challenge was first organized in 2007 over the WEBSPAM-UK2006 data set; the last challenge, over the WEBSPAM-UK2007 set was held in conjunction with the Adversarial Information Retrieval (AIRWeb) workshop in 2008 [13]. The Challenge data set sizes and label distributions are shown in Table 1. One may observe that the new data set has less labels but much more sites and hence preferred as it provides a more comprehensive collection of the .uk domain. Both the UK-WEBSPAM dataset and the LIWA results operate on the unit of a site or host name, rather than individual pages or site subsections. This operation is desirable for all applications except for comment spam filtering not just for efficiency. Even if pages of the same site could be reliably split, there is often strong relationship between the operators of the subsites. Based on this fact a classificator will very likely learn the rule to classify all related hosts (even if just the IP address matches) spam if it is convinced that at least one subsite contains spam. 3.1 Content spam filtering Among the early content spam papers, Fetterly et al. [24] demonstrated that a sizable portion of machine generated spam pages can be identified through statistical analysis. Ntoulas et al. [37] introduce a number of content based spam features including number of words in page, title, anchor as well as the fraction of page drawn from popular words and the fraction of most popular words that appear in the page. Spam hunters use a variety of additional content based features [12, 25] to detect web spam; a recent measurement of their combination appears in [14] who also provide these methods as a public feature set for the Web Spam Challenges. Spam can also be classified purely based on the terms used. Perhaps the strongest SVM based content classification is described in [1]. In addition to the public Web Spam Challenge features we used the following features as well: the number of document formats (.pdf etc), the existence and value of robots.txt and robots meta; the existence and average of server last modified dates; finally the distance and personalized PageRank from DMOZ sites. We classified by decision trees. In the LiWA solution content classification quality is improved by adding classifiers based on latent Dirichlet allocation and text compression. In the multi-corpus LDA technique [8] we create a bag-of-words document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if its total spam topic probability is above a threshold. We also deploy text compression, a method first used when email spam detection methods applied to Web spam were presented at the Web Spam Challenge 2007 [19]. Similar to [19] we use the method of [12] that compresses spam and nonspam separately; features are defined based on how well the document in question compresses with spam and nonspam, respectively. Finally we augmented the public challenge features [14] by two features suggested in [4]: the Online Commercial Intention (OCI) <div style="position:absolute; top:20px; width:600px; height:90px; overflow:hidden;"> <font size=-1>atangledweb.co.uk currently offline<br> atangledweb.co.uk back soon<br></font><br><br> <a href="http://www.atangledweb.co.uk"><font size=-1>atangledweb.co.uk</font></a><br><br><br> Soundbridge HomeMusic WiFi Media Play<a class=l href="http://www.atangledweb.co.uk/index01.html">-</a>>... SanDisk Sansa e250 - 2GB MP3 Player -<a class=l href="http://www.atangledweb.co.uk/index02.html">-</a>>... AIGO F820+ 1GB Beach inspired MP3 Pla<a class=l href="http://www.atangledweb.co.uk/index03.html">-</a>>... Targus I-Pod Mini Sound Enhancer<a class=l href="http://www.atangledweb.co.uk/index04.html">-</a>>... Sony NWA806FP.CE7 4GB video WALKMAN <a class=l href="http://www.atangledweb.co.uk/index05.html">-</a>>... Figure 4: The use of a parking domain to impute spam pages into a Web crawl. value assigned to an URL in a Microsoft adCenter Labs Demonstration as well as measures of how well a page fits to the most competitive queries. Here query competitivity is measured by Google AdWords. This method is similar to those of Castillo et al. [14] who measure the popularity (frequency) of words via an in-house query log. In Google AdWords, advertisers bid on keywords and their ads are displayed as sponsored links alongside the organic search results. The AdWords Keyword Tool1 , which is also available as the API call getKeywordsFromSite(), recommends keywords for a site in the form of a tuple (group, volume, competition, phrase). Volume shows the relative amount of users searching for that keyword on Google on a scale 1–5 and advertiser competition shows the relative amount of advertisers bidding on that keyword on the same scale. In addition, for a query word or phrase, we can obtain the following information: estimated average cost per click CPC; the estimated ad positions, the average position in which the ad may show, expressed in ranges between an upper and lower value. Based on these estimates we define the page cost of a document by summing up the CPC value of each (known) word occurrence in it and then we average the page costs over each host. 3.2 Link spam filtering Recently several results have appeared that apply rank propagation to extend initial trust or distrust judgments over a small set of seed pages or sites to the entire web, such as trust [30, 44], distrust [40, 23] propagation in the neighborhood or their combination [43] as well as graph based similarity measures [5]. These methods are either based on propagating trust forward or distrust backwards along the hyperlinks based on the idea that honest pages predominantly point to honest ones, or, stated the other way, spam pages are backlinked only by spam pages. Trust and distrust propagation in trust networks originates in Guha et al. [27]; Wu et al. [43] is the first to show its applicability for Web spam classification. Trust and distrust propagation are in fact forms of semi-supervised learning surveyed by Zhu [45], a methodology to exploit unlabeled instances in supervised classification. Stacked graphical learning introduced by Kou and Cohen [34] is a simple implementation that outperforms the computationally expensive variants [34, 14]. Graph stacking, a methodology used with success for Web spam detection by e.g. [14] is performed under the classifier combination framework as follows. First the base classifiers are built and combined that give prediction p(u) for all the unlabeled nodes u. Next for each node v we construct new features based on the predicted p(u) of its neighbors and the weight of the connection between u and v as described in [20] and classify them by a decision tree. Finally classifier combination is applied to the augmented set of classification results; this procedure is repeated in two iterations as 1 https://adwords.google.com/select/ KeywordToolExternal Table 2: F and AUC measures for the UK2007-WEBSPAM data set with different sets of features used along the baseline classifiers. F AUC Public content [14] 0.249 0.777 Public link [14] 0.196 0.723 Sonar 0.224 0.698 host level stacked, 1 iteration 0.308 0.814 host level stacked, 2 iterations 0.333 0.812 combined stacked, 1 iteration 0.335 0.818 combined stacked, 2 iterations 0.316 0.826 suggested by [14, 20]. In prior results, stacked graphical learning considered edges between the units of the classification, thus the information on the internal linkage as well as the location of an in or outlink within the site structure is lost for the classification process. We used stacked graphical features based on the “Connectivity Sonar” of Amitay et al. [2]. These include the distribution of in and outlinks labeled spam within the site; the average level of spam in and outlinks; the top and leaf level link spamicity. 3.3 Combination and results In the LiWA Spam Classificator Ensemble we split features into related sets and for each we use the best fitting classifier. These classifiers are then combined by random forest, a method that, in our crossvalidation experiment, outperformed logistic regression suggested by [19]. We used the classifier implementations of the machine learning toolkit Weka [42]. Our results for Web site classification, in terms of F and AUC measures over the Challenge test data, are shown in Table 2. For both measures the higher the value, the better the quality. While F measures the quality of a binary classification, AUC tests the quality of the order of the predicted spamicity and hence in general believed to characterize the behavior better. For the definition see e.g. [14]. The Web Spam Challenge 2008 best result [26] achieved an AUC of 0.84 by also using ensemble undersampling [15]. The computational resources for the filtering procedure are moderate. Content features are generated by reading the collection once with the exception of measuring the fit to popular or monetizable queries. For this last step we used and expensive step of building a complete search engine index, but a less costly approximation is described in [14]. For link features a typically only a host graph has to be built, which is very small even for large batch crawls. Training the classifier for a few 100,000 sites can be completed within a day on a single CPU on a commodity machine with 4-16GB RAM; here costs strongly depend on the classifier implementation. Given the trained classificator, a new site can be classified even at crawl time if the crawler is able to compute the required feature set for the new site encountered. 4. VISION FOR WEB ARCHIVES While no single Web archive will likely have spam filtering resources comparable to a major search engine, our envisioned method facilitates the collaboration and knowledge sharing between specialized archives. Hence the first main foreseen achievement is summarized as follows. Archive operators will be able to unite their manual efforts, in particular for spam that span across domain boundaries. To illustrate2 , assume that an archive targets the .uk domain. The crawl encounters site www.discountchildrensclothes.co.uk that contains redirection to the .com domain that further redirects to .it. These .it sites were already flagged spam by another partner, hence their knowledge can be incorporated into the current spam filtering procedure. As the above example illustrates, the collaboration between Web archives in different countries is useful and the LiWA developments are planned to aid the international web archiving community in building and maintaining a world wide data set of web spam. Since most features, and in particular link features are language independent, a global collection will help all archives regardless of their target to level domain. The need for manual labeling is the single most important blocker of high-quality spam filtering. In addition to label sharing, the envisioned solution will also act in coordinating the labeling efforts in an active learning environment: Manual assessment will be supported by a target selection method that proposes sites of a target domain ambiguously classified based on existing common knowledge. Compared to Web spammers, email spammers are in advantage since they may test their message content over a variety of open source or freely available filters. Given unrestricted access to the methods and known labeled sites, all a Web spammer needs is a crawl in the neighborhood of the targeted site to test the efficiency of passing the filters. The filtering mechanism will remain closed in order to prevent spammers from testing their method over a public service, a known and very effective practice of email spammers against open source email servers. In the bulk of this section we discuss the procedures for the spam assessment interface and finally an application planned for the purposes of a future Web spam challenge designed to test the needs of the Web archives. The assessment procedure is summarized in Table 3. The procedure starts by crawling that is best performed in a bootstrapping fashion as indicated by the loop between lines 1–7. The crawl (line 2) and feature generation (line 3) uses local resources; however even in the first step we may use, via the LiWA collaboration service: • Known labels of other institutions who previously visited (part of) the domain; 2 Site names made up for illustration based on real examples that however changed since visited during the Web Spam Challenge 2008 crawl. Table 3: The assessment procedure of a Web Archive operator 1: repeat 2: Crawl by using possible information from prior crawls or partner institutions 3: Process the crawl and generate features by local resources 4: Classify hosts in the crawl by the model obtained from the central service 5: Assess by manual labeling in an active learning framework 6: Revise the labels 7: until satisfied with the filtering quality • The classification model compiled by the shared knowledge of all participating institutions. The first three steps can also be performed in a bootstrapping manner, first obtaining a shallow crawl and build the classifier in line 4. This classifier can then be used even at the time of the next crawl to classify a yet unseen site • purely by its (already visible) in-linkage; • by downloading and analyzing a first few sample pages from the site. If now the archive operator is dissatisfied with the current filtering results and suspects domain specific spammer activities, the manual assessment phase may begin in line 5. The assessment interface described in detail in Section 4.1 presents those sites to the user whose label is expected to improve the most over the classification accuracy. The system hence naturally implements an active learning framework. This methodology is also capable of assisting in determining filtering performance by providing the operator with ambiguous hosts together with the classifier outcome explained in human comprehensible terms as much as possible. The revision of existing labels (line 6) is necessary both in case several operators, possibly sitting in different institutions, disagree in the label of the same site or if time evolution of the site requires the revision of an earlier label. As another fictitious example reminiscent of real hosts, we show the possible time evolution of the www.corsiex.co.uk site: 1. The site appears with real content and gathers some in-links. 2. The owner gives up the domain but it still remains in search engine caches; archives also aware of the earlier state and slowly adapt to the change. 3. At this time a SEO company buys the “parking” domain and fills with Google ads, redirects or links to other hosts in their spam farm. 4. Major search engines realize this change, erase the old site from their cache and blacklist the site. 5. Spammers realize they are blacklisted and give up their operation over this domain. The domain becomes empty with dead links possibly still existing from the initial era of useful content over the site. 6. The original or a new owner re-registers the domain for a purpose probably similar to the initial one. The site gathers content and linkage while still blacklisted. Archives require a twofold caution: they may build a false classification model if features are generated from an incorrectly timed snapshot in addition to losing quality content from the archive due to the blacklist in effect. As the above example illustrates, the revision phase is crucial in an archive. To aid the procedure, a table will be presented with explanations provided by the individual classifiers and the human assessors. Examples may include • Whois information, with links to other sites of the owner. • probability of belonging to certain automatically generated spam topic such as bad credit history loans or no school needed university degrees; In a particular first implementation we fill this interface with the 12 crawl snapshots of the .uk domain gathered by the UbiCrawler between June 2006 and May 2007 [10]. In the “Links and pages” tab we show 12 bits for the presence of the page while the access bar in the bottom of the assessment page shows “jun, jul, . . . , may” and “now”, color coded for availability and navigation. In a future possible extension, the above interface together with a centralized service may act as a knowledge base also for crawler traps, crawling strategies for various content management technologies and other issues related to Web host archival behavior. • human explanations such as “banner.html is clearly spam” or “normal though has inlinks from a SEO site”. 4.2 • most characteristic link or content features that yield high spam score in certain classifiers; • list of known spam in or out-links; The table may be sorted according to the level of disagreement between different assessors, thus also providing a discussion forum between the experts even across different institutions. The entire cycle can be used in an iteration cycle to gradually improve the filtering quality as well as to maintain it as the content of the archived domain evolves. The cycle could terminate in line 7 when the spam filtering quality is satisfactory; however termination of the assessment step is unlikely due to the time evolution of the Web. 4.1 The assessment interface mockup Next we describe the mockup of the assessment page modeled by the Web Spam Challenge 2007 volunteer interface [13]. The right side is for browsing in a tabbed fashion. In order to integrate the temporal dimension of an archive, the available crawl times are shown (called access bar). Upon clicking, the page which appears is the one with crawl date the closest to the crawl date of the linking page. The selected version of the linked page can be either cached at some partner archive or the current version downloaded from the web. We use Firefox extension techniques similar to Zotero to note and organize information without messing about with rendering, frames and redirection. The possibility to select between a stored and the currently available pages also helps in detecting cloaking (see Section 2.3). The right side also contains in and outlinks as well as list or sample pages of the site. By clicking on an in or outlink, we may obtain all possible information in all the subwindows from the central service. The upper part of the left side is to do the assessment. Button “NEXT” links to the next site to be assessed in the active learning framework and “BACK” to the review page. When “NEXT” or “BACK” is pushed, the assigned label is saved. Before saving a “spam” or “borderline” label, a popup window appears requesting for explanation and spam type. Spam type can be: general, link or content and the appropriate types should be ticked. The ticked types appear as part of the explanation. Although not shown on the figure, a text field is available for commenting the site. The explanations and comments appear in the review page. The lower left part of the assessment page contains four windows in a tabbed fashion. • The labels already assigned to this site (number of the four possible types each) with comments if any. • Various spam classifier scores (see Section 3), and an LDAbased content model [8]. • Various site attributes selected by the classification model as most appropriate for deciding the label of the site. Time aware Web Spam Collection generation The proposed interface is also planned to serve the volunteers of a future Web spam challenge, an event organized by the research community. Here volunteer assessors will use the interface to label a centrally generated random sample of certain gold standard Web spam data set such as the recent UK2007-WEBSPAM [13]. In order to take the temporal change of the corpus into account, we will compile a new crawl of the .uk domain around the labeled sites of the UK2007-WEBSPAM data set. Note that a careful selection procedure is required since the Internet Archive crawl of the .uk domain consists currently of 2M sites, an amount that exceeds the capacity of the possible Challenge participants and organizers. In the future two tasks are possible: 1. New site classification. Assessors label sites that are not present in the 12 original .uk snapshots; only UK2007-WEBSPAM labels are available for training. 2. Time evolution feature generation. In a sequence of periodic recrawls generate time-aware spam features as well as perceive changes in the behavior of certain sites. The data sets required for the first task is relative easy to compile and planned in a possible near future. For the second task more crawls are needed together with a specific scope that limits the volume of the data to be processed by participants. Conclusion We have reported the work conducted by the LiWA FP7 project for preventing, detecting and eliminating web spam, i.e., fake pages that try to mislead web users. We have analyzed and discussed Web spam filtering in relation with web archives by enumerating known types of Web spam together with the state-of-the-art filtering technologies. We have envisioned an architecture to facilitate the cooperation between web archives in different domains and countries to effectively fight against spam. The methods presented and illustrated over the 100,000 page UK2007-WEBSPAM snapshot of the .uk domain promise effective filtering even at crawl time. Acknowledgment To Sebastiano Vigna, Paolo Boldi and Massimo Santini for providing us with the UbiCrawler crawls [9, 10]. In addition to them, also to Illaria Borodino, Carlos Castillo and Debora Donato for discussions on the WEBSPAM-UK data sets [11] and ideas on a possible new Web Spam Challenge based on periodic recrawls. And last but not least to Julien Masanés for discussions on the Spam filtering requirements in a Web archive. Figure 6: The mockup for the assessment interface 5. REFERENCES [1] J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008. [2] E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The Connectivity Sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia (HT), pages 38–47, Nottingham, United Kingdom, 2003. [3] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2006. [4] A. A. Benczúr, I. Bíró, K. Csalogány, and T. Sarlós. Web spam detection via commercial intent analysis. In Proceedings of the 3th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), held in conjunction with WWW2007, 2007. [5] A. A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight web spam. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), held in conjunction with SIGIR2006, 2006. [6] A. A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher. SpamRank – Fully automatic link spam detection. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), [7] [8] [9] [10] [11] [12] [13] [14] held in conjunction with WWW2005, 2005. To appear in Information Retrieval. F. Benevenuto, T. Rodrigues, V. Almeida, J. Almeida, C. Zhang, and K. Ross. Identifying video spammers in online social networks. In AIRWeb ’08: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web. ACM Press, 2008. I. Bíró, J. Szabó, and A. A. Benczúr. Latent Dirichlet Allocation in Web Spam Filtering. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008. P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience, 34(8):721–726, 2004. P. Boldi, M. Santini, and S. Vigna. A Large Time Aware Web Graph. SIGIR Forum, 42, 2008. I. Bordino, P. Boldi, D. Donato, M. Santini, and S. Vigna. Temporal evolution of the uk web, 2008. A. Bratko, B. Filipič, G. Cormack, T. Lynam, and B. Zupan. Spam Filtering Using Statistical Data Compression Models. The Journal of Machine Learning Research, 7:2673–2698, 2006. C. Castillo, K. Chellapilla, and L. Denoyer. Web spam challenge 2008. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008. C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] the web topology. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423–430, 2007. N. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1):1–6, 2004. C. Chekuri, M. H. Goldwasser, P. Raghavan, and E. Upfal. Web search using automatic classification. In Proceedings of the 6th International World Wide Web Conference (WWW), San Jose, USA, 1997. K. Chellapilla and D. M. Chickering. Improving cloaking detection using search query popularity and monetizability. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 17–24, Seattle, WA, August 2006. E. Convey. Porn sneaks way back on web. The Boston Herald, May 1996. G. Cormack. Content-based Web Spam Detection. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2007. K. Csalogány, A. Benczúr, D. Siklósi, and L. Lukács. Semi-Supervised Learning: A Comparative Study for Web Spam and Telephone User Churn. In Graph Labeling Workshop in conjunction with ECML/PKDD 2007, 2007. B. D. Davison. Recognizing nepotistic links on the web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23–28, Austin, TX, 2000. J. Douceur. The Sybil Attack. In Proceedings of the first International Peer To Peer Systems Workshop (IPTPS), volume 2429 of Lecture Notes in Computer Science, pages 251–260, Cambridge, MA, USA, January 2002. Springer. I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proceedings of the 16th European Conference on Machine Learning (ECML), volume 3720 of Lecture Notes in Artificial Intelligence, pages 233–243, Porto, Portugal, 2005. D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics – Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), pages 1–6, Paris, France, 2004. D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR), Salvador, Brazil, 2005. G. Geng, X. Jin, and C. Wang. CASIA at WSC2008. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008. R. Guha, R. Kumar, P. Raghavan, and A. Tomkins. Propagation of trust and distrust. In Proceedings of the 13th International World Wide Web Conference (WWW), pages 403–412, 2004. Z. Gyöngyi and H. Garcia-Molina. Spam: It’s not just for inboxes anymore. IEEE Computer Magazine, 38(10):28–34, October 2005. Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005. View publication stats [30] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 576–587, Toronto, Canada, 2004. [31] M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11–22, 2002. [32] S. D. Kamvar, M. T. Schlosser, and H. Garcia-Molina. The EigenTrust algorithm for reputation management in P2P networks. In Proceedings of the 12th International World Wide Web Conference (WWW), pages 640–651, New York, NY, USA, 2003. ACM Press. [33] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. [34] Z. Kou and W. W. Cohen. Stacked graphical models for efficient inference in markov random fields. In SDM 07, 2007. [35] B. Krause, A. Hotho, and G. Stumme. The anti-social tagger - detecting spam in social bookmarking systems. In Proc. of the Fourth International Workshop on Adversarial Information Retrieval on the Web, 2008. [36] G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005. [37] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83–92, Edinburgh, Scotland, 2006. [38] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford University, 1998. [39] A. Perkins. White paper: The classification of search engine spam, September 2001. Online at http://www.silverdisc.co.uk/articles/spam-classification/ (visited June 27th, 2005). [40] PR10.info. BadRank as the opposite of PageRank, 2004. http://en.pr10.info/pagerank0-badrank/ (visited June 27th, 2005). [41] A. Singhal. Challenges in running a commercial search engine. In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004. [42] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005. [43] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Workshop on Models of Trust for the Web, Edinburgh, Scotland, 2006. [44] B. Wu, V. Goel, and B. D. Davison. Topical TrustRank: Using topicality to combat web spam. In Proceedings of the 15th International World Wide Web Conference (WWW), Edinburgh, Scotland, 2006. [45] X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.

Log In

Web spam: a survey with vision for the archivist