Canonical annotation is very useful when it comes to dealing with duplicate content. However, until June 2011 when Google started supporting rel=”canonical” HTTP Headers there wasn’t really a way to apply canonical annotation to documents that have no http header (such as PDF files).
However this feature is not as widely used because of its complexity. Here is how it can be implemented as described by this post.
HTTP Headers Using PHP (Text/HTML Types):
The rel=”canonical” HTTP Header can easily be added to a text/html content type that supports PHP using the header() function. Using the proper syntax shown in Google’s documentation coupled with PHP will allow us to accomplish this.
Adding this header() function before any HTML is output will append a link rel=”canonical” HTTP header to the headers before they get sent.
This minics the rel=”canonical” link tag. The only difference is sending the preferred canonical URI using an HTTP header versus a <link> tag. Traditionally the <link> tag has been by far the most popular choice of implementation. This function, however, will be used in the advanced section of this article.
HTTP Headers Using .htaccess (non-Text/HTML Types):
The HTTP Header can modified relatively easily using .htaccess for all content-types, such as PDF files. This solution works great for sites that have a relatively small amount of files which you need the header added to. In this example I’m showcasing an application/pdf content-type.
This snippet of code when implemented will add a HTTP Header to the PDF file pointing to an HTML page with the URL of /page.html.
The filename argument should include a filename, or a wild-card string, where ? matches any single character, and * matches any sequences of characters.
Regular expressions can also be used, with the addition of the ~ character.
Advanced Dynamic HTTP Header Implementation (non-Text/HTML Types):
The dynamic implementation of the HTTP Header for application/pdf content-types i’m about to show you is more advanced and does require knowledge with .htaccess and PHP, although I will provide the examples.
The first step is to create php file to control the output of the PDF. This can most easily be done by rewriting a URL.
This RewriteRule, when added to an .htaccess file, simply allows us to control the PDF file from a php file named “pdf.php”. Anytime a user or search engine tries to access a URL that contains the file extension “.pdf”, the pdf.php file will be referenced for instructions on how to display the file. This gives us the ability to perform conditional logic to add a canonical HTTP header.
This snippet of code when added to pdf.php, will check if the URL of the PDF file exists. If it does, we can then perform logic to add in the canonical HTTP header, otherwise, we want to return the file as a 404.
As you can see, the conditional logic has been commented and does not currently function although this is where I leave you in good hands to create your own logic based on your needs.
For example, you may have a csv or txt file you would like to read from. Those files could contain a list of pdf files and corresponding URLs that you would want to have the HTTP header added to. You may also have table(s) or column(s) of useful information in a database that you need to reference to find the corresponding URL to point to the preferred URL. There are many possibilities.
It is also important to note the other two additional headers added for the content-type and content-length. These are necessary for proper output. If we do not set the content-type to application/pdf, the file will be treated as a text/html file.