Martin Splitt Explains How Google Selects Canonical Pages
- 5 November, 2020
- Jason Ferry
- SEO Services
Martin Splitt recently told providers of SEO services how Google distinguishes duplicate content and web pages, as well as how they decide which canonical pages are to be included in the search engine results pages (SERPs). This information gave many small business SEO service providers important insights on how the Google algorithm works when it comes to canonicalisation.
In a podcast, Splitt explained that there 20 different signals which are weighted in order to detect the canonical page. He also went into detail about why machine learning is used to adjust the weights.
Splitt first stated how websites are crawled and how documents are indexed. Then, he goes into detail about how Google detects and identifies canonicalisation and page duplicates.
He said that they collect the signals first, then detect the duplicate pages by clustering them all together. Then, they will find a leader page for all these pages, and to do, so, they must reduce the content into a checksum or hash, and compare it with other checksums.
By making clusters of duplicate pages, it makes the task much faster and easier instead of checking thousands of words.
One reason why Google reduces content into a checksum is that they do not want to spend too much time and resources scanning the whole text. So, they calculate several kinds of checksums about the textual content of the page before comparing it with other checksums.
When it comes to exact duplicates and near-duplicates, Splitt says Google’s algorithms can catch both, such as those that are capable of detecting duplicates and then removing the boilerplate from pages. This means that their algorithms detect if the checksums are fairly similar or identical to each other before bringing them together in a duplicate cluster.
Once all the duplicates form one big cluster, Google selects only one document to display in the SERP.
Providers of SEO services may wonder why they avoid showing duplicate web pages in the SERP. This is so that Google can avoid showing the same content across many search results – which is one thing that users dislike. Moreover, doing so saves storage space in the index.
The hardest part is choosing the leader of the cluster, which is why they use more than twenty signals to select which web page to show as canonical from the group of duplicates.
These signals are like factors that help determine which page among the duplicates is the best one to show in the SERP. For instance, one signal is the webpage content. It could also be the PageRank – the higher the rank, the more chances the webpage will show.
Each signal has its own weight, and Google calculates and adjusts these weights. Google uses machine learning to adjust signal weights, making sure everything is accurate compared to doing things manually.
As for redirects, they are usually given a heavier weight compared to http/https URL signals. Splitt explains that any redirects must be higher in weight instead of http/https because the users will eventually see the redirect target. Because of this, Google does not include the redirect source in the SERP.
Canonical links are essential for businesses and small business SEO services because they specify which link is to be shown to users in the SERP. Moreover, search engines do not like duplicate content, and canonical tags help them identify which page should be ranked or shown to the users.
Here at Position1SEO, we make sure that your website is filled with high-quality content that is both authoritative and compelling. If you choose to work with us, you can be assured to get unique content that engages your users and effectively promotes your products and services.
Work with our SEO professionals today! Send us an email at office@position1seo.co.uk or call us on 0141 846 0114.