Dynamically-generated .PDF files, instead of normal pages, indexed by and ranking in Google

fugu

Hi,

I come across a tough problem. I am working on an online-store website which contains the functionlaity of viewing products details in .PDF format (by the way, the website is built on Joomla CMS), now when I search my site's name in Google, the SERP simply displays my .PDF files in the first couple positions (shown in normal .PDF files format: [PDF]...)and I cannot find the normal pages there on SERP #1 unless I search the full site domain in Google. I really don't want this! Would you please tell me how to figure the problem out and solve it. I can actually remove the corresponding component (Virtuemart) that are in charge of generating the .PDF files. Now I am trying to redirect all the .PDF pages ranking in Google to a 404 page and remove the functionality, I plan to regenerate a sitemap of my site and submit it to Google, will it be working for me? I really appreciate that if you could help solve this problem. Thanks very much.

Sincerely

SEOmoz Pro Member

TheEspresseo

Recently discovered this:

Indicate the canonical version of a URL by responding with the Link rel="canonical" HTTP header. Addingrel="canonical" to the head section of a page is useful for HTML content, but it can't be used for PDFs and other file types indexed by Google Web Search. In these cases you can indicate a canonical URL by responding with the Link rel="canonical" HTTP header, like this (note that to use this option, you'll need to be able to configure your server).

Link: <http: www.example.com="" downloads="" white-paper.pdf="">; rel="canonical"</http:>

Google currently supports these link header elements for Web Search only.

-http://support.google.com/webmasters/bin/answer.py?hl=en&answer=139394

TheEspresseo

I would consider either excluding the PDFs from the index with your robots.txt in conjunction with resubmitting your sitemap (which you're all over), or placing a text link at the bottom of each PDF pointing back to the HTML version of that page (which, all things being equal, should cause the HTML version of the page to rank instead). I am not sure about serving 404 headers to Google instead of the PDFs that are currently in the index. Why not 301 to the HTML version of each PDF? Obviously that can't be a permanent solution, as you will eventually want to restore the functionality to users, right? But it will tell Googlebot that the content of each PDF is to be found from here on out at the URL containing the HTML version. This is a case where it would be handy to serve one thing to the bots and another to the human viewers, but I am afraid that doing so could get you into trouble.

I am interested in your case though—let us know what, if anything besides the 404s and sitemap resubmittal, you end up trying and what happens with it. I'm also curious to know what other mozzers suggest.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Moz Q&A is closed.

Dynamically-generated .PDF files, instead of normal pages, indexed by and ranking in Google

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Google is indexing bad URLS

Does a no-indexed parent page impact its child pages?

How To Cleanup the Google Index After a Website Has Been HACKED

Why is Google Webmaster Tools showing 404 Page Not Found Errors for web pages that don't have anything to do with my site?

Fake Links indexing in google

Why is my blog disappearing from Google index?

Pages removed from Google index?

CDN Being Crawled and Indexed by Google

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved