Any content that appears in more than one place on the internet is deemed as a duplicate content. So if you find the same content present in two or more websites, consider it as duplicate.
Which Type of Content is Duplicate Content?
There are different types of duplicate content, all of which may not happen deliberately. Some content duplication is the result of certain technical aspects of a website.
Boilerplate content is the content that is present in different web pages of a website. For example, the homepage of any website consists of three main elements- the header, the footer, and the sidebar or navigation bar. In addition to these, some websites also show recent posts on their homepages. When the Google bot crawls this website, they might find this new blog posts present in more than one place on the website, so it becomes a duplicate content.
Copied Content/Scraped Content
Copying content from a site without the permission of the owner is known as copied content. Content scraping is extracting information from the website using a computer software technique. There’s still much confusion about content scraping, and Google practices it as well by showing content as featured snippets. However, with the Panda update, any type of scraping activity is liable to be penalized.
Content curation is taking information from the web and writing a piece of content using the stats and information received from them. Google doesn’t consider this as spam as long as you rewrite the content in your own words or provide the source of the original content from where it is taken.
Content syndication is the method of pushing content to third-party sites as snippets, links, or full content pieces. Sites that syndicate content allow them to be published on multiple sites. This means for a syndicated post, there are several copies available on the web. Sites like HuffingtonPost and Medium allow content syndication.
Does Duplicate Content Affect SEO?
For search engines like Google and Bing, duplicate content can give rise to certain issues like creating confusion for the search engine regarding which version of the content to consider original and rank for search queries. This also creates confusion among search engines in determining whether to direct link metrics like trust authority, link equity, etc., to one page or distribute it among multiple versions.
When a site contains duplicate content, site owners can suffer from poor rankings due to traffic losses. This happens mainly due to search engines being confused about multiple versions of the same content and showing only one of them, thus diluting the visibility of each of the duplicates.
Duplicate content also affects the link equity as other sites need to choose any one of the versions of the content. This leads to the inbound links being divided among multiple sites. As inbound links are a ranking factor, it can impact the online visibility of duplicate content for all the websites where it exists. The net result is the inability of the content to rank in the SERP.
What Causes Duplicate Content?
Duplicate content can happen due to many reasons, the main one being technical. Let us take a look at the common causes below:
Misunderstanding the Concept of URL
In the CMS database that powers a website, there’s probably only a single article, but the website’s software may allow the same article in the database to be retrieved through more than one URL. For the CMS, the article is identified by a unique ID in the database, but for search engines, the URL acts as an identifier. Hence, with multiple versions of the same content present in different URLs, the issue of duplicate content arises.
Session IDs are used to track your visitors on the site and allow them to store items in their wishlist or shopping cart. To do that, you need to give these users individual sessions. A session is a brief history of the activities that visitors perform on your site. The most common way to store these session IDs is in the form of cookies. However, most search engines don’t store cookies. Due to this, some systems come back to using session IDs in the URL. This means every internal link on the website gets that session ID added to its URL. As that session ID is unique to that particular session, it creates a new URL, resulting in duplicate content.
URL Parameters Used for Tracking & Sorting
Another technical cause for duplicate content is the use of URL parameters that do not change the content of a page. For example, when you look for http://www.example.com/keyword-x/ and http://www.example.com/keyword-x/?source=rss, both of them are different URLs to the search engine. With the latter URL, it might be easier for you to track the source from which your visitors came to the site, but for search engines, it’s a case of duplicate content.
Scrapers & Content Syndication
Sometimes, websites use content from a given site and don’t mention the source. In that case, the search engines become unsure about which version to consider original and show in the search results. This type of content scraping can affect both types of sites- the one that is scraping content and the one from where it is scraped.
Order of Parameters
CMS don’t always use proper URLs but set them based on category and ID, such as /?id=1&cat=2. For other website systems, if you enter /?cat=2&id=1, instead of /?id=1&cat=2, they will show you the same result, but for search engines, these are two entirely different URLs. If your site serves duplicate content to different URLs without using any parameters, you should define canonical distribution than blocking crawling for them.
CMS, like WordPress, have the option for pagination of comments. This leads to the content being duplicated across an article URL and comment pages.
WWW vs. Non-WWW
This is one of the prevalent causes of duplicate content across a website. When your content is accessible in both www and non-www versions, the search engine will consider it as a duplicate content. The same problem arises with HTTP and HTTPS content as well.
Is There a Penalty for Duplicate Content on a Website?
Duplicate content is different from copied content when it comes down to context. While copying content is done consciously, duplicate content may arise due to technical faults, as mentioned above. Google’s John Mueller stated that the search engine doesn’t penalize a site for duplicate content, but if you have millions of such pages on your site, then you’re calling in for risks.
Google always rewards websites with high-quality original content. If you try to manipulate existing content by republishing it on your site, altering a few sentences, or using a few new keywords, it will still not add any value to the users. The safest thing to do as a website owner to boost your SEO rankings is to avoid copying content from other sites or to repeat content from your own website.
How Much Duplicate Content is Acceptable?
According to Matt Cutts, 25% to 30% of the web consists of duplicate content. According to him, Google doesn’t consider duplicate content as spam, and it doesn’t lead your site to be penalized unless it is intended to manipulate the search results. The only problem you face with duplicate content is even though your site might have published it initially, other websites that have blindly copied the content may show up in the result for related search queries.
To prevent someone from using a copied version of your content, you can file a request for removal under the Digital Millennium Copyright Act. While Google tries to find the original source of the content to show up in the search results, blocking access to duplicate content pieces might hinder the search engine’s ability to crawl all the versions and filter the best results.
How to Deal With Duplicate Content: Google Recommended Solutions
Here are some practical ways to tackle content duplication on the web:
If your site has been restructured, use 301 redirects in your .htaccess files to redirect users, Google bots, and other spiders. This will give a signal to the search engine regarding which URL to prioritize over others.
Be Consistent & Use Top Level Domains
Try keeping your internal linking as consistent as possible. To help Google offer the most appropriate version of a piece of content, using top-level domains is highly recommended to handle country-specific content.
If you syndicate your content on other sites, Google will always show the version they think is most appropriate for users, which may not agree with the version you personally prefer. It’d be helpful if your content is syndicated on different sites with a link back to the original article. You can request those using the syndicated content to use noindex meta tags to prevent search engines like Google from indexing their content.
Minimize Boilerplate Repetition
If you’re using copyright text at the bottom of the content that you publish, instead of adding the entire text below each written piece, you can add a small summary and then simply link to a page that contains more details about it. Additionally, you can also utilize the Parameter Handling tool to tell Google how to treat your parameters specifically.
Avoid Publishing Stubs
Users don’t like to see blank pages with no content on them. This ruins their time and affects the user experience, which is something that Google considers to be very important. Hence, don’t publish pages on your website without content in them. In case you publish such pages, prevent them from being indexed using the noindex meta tag.
Understand Your CMS
Get familiar with your Content Management System and understand how content is published on your site. Blogs and forums often tend to show the same content in more than one format. For example, a new blog post may appear on the homepage of a website and also under the category page.
Minimize Content Similarity
If you have more than one page that is similar, consider making each piece of content unique by adding valuable content or merging them into one wherever possible.
Duplicate content is widespread on the web. You should keep an eye on your website to avoid duplicate content issues on your site. For content copied from your site to another, you can always take legal actions under the Copyright Act. You will notice a huge difference in your website ranking and performance just by getting rid of duplicate content issues. So don’t take a risk but focus on developing quality content for your website.