Thursday, June 23, 2011

Dupliate Test

1 comments

If there's one issue that causes more contention, heartache and consulting time than any other (at least, recently), it's duplicate content. This scourge of the modern search engine has origins in the fairly benign realm of standard licensing and the occasional act of plagiarism. Over the last five years, however, spammers in desperate need of content began the now much-reviled process of scraping content from legitimate sources, scrambling the words (through many complex processes) and re-purposing the text to appear on their own pages in the hopes of attracting long tail searches and serving contextual ads (and other, various, nefarious purposes).

Thus, today, we're faced with a world of "duplicate content issues" and "duplicate content penalties." Luckily, my trusty illustrated Googlebot and I are here to help eliminate some of the confusion. But, before we get to the pretty pictures, we need some definitions:

  • Unique Content - written by humans, completely different from any other combination of letters, symbols or words on the web and clearly not manipulated through computer text-processing algorithms (like those crazy Markov-chain-employing spam tools).
  • Snippets - small chunks of content like quotes that are copied and re-used; these are almost never problematic for search engines, especially when included in a larger document with plenty of unique content.
  • Duplicate Content Issues - I typically use this when referring to duplicate content that is not in danger of getting a website penalized, but rather, is simply a copy of an existing page that forces the search engines to choose which version to display in the index.
  • Duplicate Content Penalty - When I refer to "penalties," I'm specifically talking about things the search engines do that is worse than simply removing a page from its index.

Now, let's look at the process for Google as it finds duplicate content on the web. In the examples below, I'm making a few assumptions:

  1. The page with text is assumed to be a page containing duplicate content (not just a snippet, despite the illustration).
  2. Each page of duplicate content is presumed to be on a separate domain.
  3. The steps below have been simplified to make the process as simple and clear as possible. This is almost certainly not the exact way in which Google performs (but it conveys the effect quite nicely).