Archive for the 'Spam' Category

Crawling Blogs

Thursday, October 9th, 2008

Through a period when my blog was updated only once, this is how Feedburner viewed bots.

blogcrawler

Note that crawling blogs is an interesting problem:

  • Recency is critical
  • Ping servers are available, albeit with incomplete coverage

Crawling blogs is also highly resource intensive:

  • Network latency
  • Disk access/write latency

How do your numbers look?

Scott Huffman on Search Evaluation at Google

Tuesday, September 16th, 2008

Search continues to present many interesting problems, some focusing on new parameters, some others rewiring existing parameters, and a few others shielding these parameters from adversaries (e.g. spam). Through a blog post, Scott Huffman shares how Google evaluates improvements.

..but we are constantly evaluating everything, which can include:
- proposed improvements to segmentation of Chinese queries
- new approaches to fight spam
- techniques for improving how we handle compound Swedish words
- changes to how we handle links and anchortext
- and everything in between

Though spam and the web graph feature prominently, it is interesting to note how "internationalization" features in many of these evaluation examples, reflecting Google’s overall push in this direction. Evaluation is through click-through improvements, and statistically sound relevance metrics.

Evaluation metrics, I think, is one of those areas where academia can greatly influence and impact search. This is one of the more theory centric problems in search, not limited by the lack of large information retrieval data sets. Indeed, the recently concluded SIGIR conference featured many papers in this direction.

Score Standardization for Inter-Collection Comparison of Retrieval Systems [PDF]
W. Webber, A. Moffat and J. Zobel  (University of Melbourne)

The Good and the Bad System: Does the Test Collection Predict Users’ Effectiveness? [PDF]
A. Al-Maskari, M. Sanderson and P. Clough  (University of Sheffield)

Retrieval Sensitivity Under Training Using Different Measures [blog]
B. He, C. Macdonald and I. Ounis  (University of Glasgow)

Evaluation Over Thousands of Queries [PDF]
B. Carterette, V. Pavlu, E. Kanoulas, J. Allan, and J. A. Aslam  (University of Massachusetts Amherst/Northeastern University)

Novelty and Diversity in Information Retrieval Evaluation [PDF]
C. Clarke, M. Kolla, G. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon  (University of Waterloo)

Relevance Assessment: Are Judges Exchangeable and Does it Matter [PDF]
P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. de Vries and E. Yilmaz  (NIST/Northeastern University/Microsoft/CWI/CSIRO ICT Centre)

Intuition-Supporting Visualization of User’s Performance Based on Explicit Negative Higher-Order Relevance [link]
H. Keskustalo, K. Jarvelin, A. Pirkola and J. Kekalainen  (University of Tampere)

Elsewhere, comments on the article  –

Seo Dialect:

The rest of the points are things we’ve been hearing from Google for a long time. We know they’re progressing on universal and personalization search efforts, all in their famous intent to create the best user experience.

Webtribution:

Anyone remotely involved in SEO or digital marketing should always take advantage of any information / insight Google opens to the public.

SearchEngineCaffe:

One of my biggest issues with TREC and similar environments is the single focus on relevance … for example, a spam post that is relevant to a topic would be acceptable, even if you would never want to read it in real life. It’s time we move beyond the basics and find ways to tackle the more challenging retrieval quality aspects…

Also, at webmasterworld.

For readers interested in the overall problem of IR evaluation, a paper by Kalervo & Jaana on "IR evaluation methods for retrieving highly relevant documents" offers an excellent introduction.

40% Japanese Blogs are Spam

Sunday, August 3rd, 2008

Adam Richards (MutantFrog) points to a report from CNET that 40% of blogs hosted on the popular platform "Nifty" is spam.

Japanese web portal Nifty has announced findings that a full 40% of Japanese blogs are set up as nothing but ad platforms to suck up clicks and affiliate bonuses.

.. A Nifty-affiliated research body randomly sampled 100,000 blog entries per month using the filter between October 2007 and February 2008. Over the five-month period it was determined that 40% of domestic blogs are spam blogs.

A translation of the article reveals that the same technique used in identifying spam in their samples, will be used by Japanese blog analytics services named BuzzPulse and BuzzSeeQer.  Note that last we checked, the English blogosphere has comparable, if not more number of splogs. Our study, however, looked at newly created blogs, as opposed to the Nifty study on a sample of all currently hosted blogs.

Note: If researchers behind the Nifty spam filter are reading this, I’d be interested in following up.