Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Will an XML sitemap override a robots.txt
-
I have a client that has a robots.txt file that is blocking an entire subdomain, entirely by accident. Their original solution, not realizing the robots.txt error, was to submit an xml sitemap to get their pages indexed.
I did not think this tactic would work, as the robots.txt would take precedent over the xmls sitemap. But it worked... I have no explanation as to how or why.
Does anyone have an answer to this? or any experience with a website that has had a clear Disallow: / for months , that somehow has pages in the index?
-
The robots file will avoid google to show further information on the disallowed pages but it doesn't prevent indexation.
They're still indexed (that's why you're seeing them) but with no meta desc nor text taken from the page because google wasn't allowed to retrieve more information.
If you want them to start showing info, you'll jsut need to remove that rule from the robots.txt and soon you'll start seeing those pages information showing, but if you want them out of the index you can use GWT to remove them from the index after you've included in each page the noindex meta tag which is the only command which will prevent indexation.
-
I assumed the same thing, but I performed a site command search while they were prospects, and they had 1 result present with the explanation of "A description for this result is not available because of this site's robots.txt – learn more"
They uploaded an xml sitemap before I could tell them to remove the robots.txt. and 1 week later, the entire site is now in the index.
I have used the robots.txt to properly block websites, it usually takes 2-3 for all results to drop out the index, so I don't know how that could explain it either.
-
I agree, the only way I could think this would work would be if the robotx.txt file was on the root domain. I agree, check Webmaster tools, they will tell you under the sitemaps section about "Error: URL was blocked by robots.txt).
One thing to remember is that robots.txt is technically a suggestion to ask search engines not to crawl your site. They can choose to ignore it, though personally I don't know of any cases in which this happenned.
-
An XML sitemap shouldn't override robots.txt. If you have Google Webmaster Tools setup, you will see warnings on the sitemaps page that pages being blocked by robots are being submitted.
Now, robots.txt does not prevent indexation, just crawling. So if the pages were indexed before they implemented robots.txt, they may continue to be indexed. Google will also display just the URL for pages that it's discovered, but can't crawl because of robots.txt.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How does changing sitemaps affect SEO
Hi all, I have a question regarding changing the size of my sitemaps. Currently I generate sitemaps in batches of 50k. A situation has come up where I need to change that size to 15k in order to be crawled by one of our licensed services. I haven't been able to find any documentation on whether or not changing the size of my sitemaps(but not the pages included in them) will affect my rankings negatively or my SEO efforts in general. If anyone has any insights or has experienced this with their site please let me know!
Technical SEO | | Jason-Reid0 -
Robots.txt & meta noindex--site still shows up on Google Search
I have set up my robots.txt like this: User-agent: *
Technical SEO | | RoxBrock
Disallow: / and I have this meta tag in my on a Wordpress site, set up with SEO Yoast name="robots" content="noindex,follow"/> I did "Fetch as Google" on my Google Search Console My website is still showing up in the search results and it says this: "A description for this result is not available because of this site's robots.txt" This site has not shown up for years and now it is ranking above my site that I want to rank for this keyword. How do I get Google to ignore this site? This seems really weird and I'm confused how a site with little content, that has not been updated for years can rank higher than a site that is constantly updated and improved.1 -
Robots.txt to disallow /index.php/ path
Hi SEOmoz, I have a problem with my Joomla site (yeah - me too!). I get a large amount of /index.php/ urls despite using a program to handle these issues. The URLs cause indexation errors with google (404). Now, I fixed this issue once before, but the problem persist. So I thought, instead of wasting more time, couldnt I just disallow all paths containing /index.php/ ?. I don't use that extension, but would it cause me any problems from an SEO perspective? How do I disallow all index.php's? Is it a simple: Disallow: /index.php/
Technical SEO | | Mikkehl0 -
Oh no googlebot can not access my robots.txt file
I just receive a n error message from google webmaster Wonder it was something to do with Yoast plugin. Could somebody help me with troubleshooting this? Here's original message Over the last 24 hours, Googlebot encountered 189 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 100.0%. Recommended action If the site error rate is 100%: Using a web browser, attempt to access http://www.soobumimphotography.com//robots.txt. If you are able to access it from your browser, then your site may be configured to deny access to googlebot. Check the configuration of your firewall and site to ensure that you are not denying access to googlebot. If your robots.txt is a static page, verify that your web service has proper permissions to access the file. If your robots.txt is dynamically generated, verify that the scripts that generate the robots.txt are properly configured and have permission to run. Check the logs for your website to see if your scripts are failing, and if so attempt to diagnose the cause of the failure. If the site error rate is less than 100%: Using Webmaster Tools, find a day with a high error rate and examine the logs for your web server for that day. Look for errors accessing robots.txt in the logs for that day and fix the causes of those errors. The most likely explanation is that your site is overloaded. Contact your hosting provider and discuss reconfiguring your web server or adding more resources to your website. After you think you've fixed the problem, use Fetch as Google to fetch http://www.soobumimphotography.com//robots.txt to verify that Googlebot can properly access your site.
Technical SEO | | BistosAmerica0 -
HTML Sitemap Pagination?
Im creating an a to z type directory of internal pages within a site of mine however there are cases where there are over 500 links within the pages. I intend to use pagination (rel=next/prev) to avoid too many links on the page but am worried about indexation issues. should I be worried?"
Technical SEO | | DMGoo0 -
XML Sitemap without PHP
Is it possible to generate an XML sitemap for a site without PHP? If so, how?
Technical SEO | | jeffreytrull11 -
Invisible robots.txt?
So here's a weird one... Client comes to me for some simple changes, turns out there are some major issues with the site, one of which is that none of the correct content pages are showing up in Google, just ancillary (outdated) ones. Looks like an issue because even the main homepage isn't showing up with a "site:domain.com" So, I add to Webmaster Tools and, after an hour or so, I get the red bar of doom, "robots.txt is blocking important pages." I check it out in Webmasters and, sure enough, it's a "User agent: * Disallow /" ACK! But wait... there's no robots.txt to be found on the server. I can go to domain.com/robots.txt and see it but nothing via FTP. I upload a new one and, thankfully, that is now showing but I've never seen that before. Question is: can a robots.txt file be stored in a way that can't be seen? Thanks!
Technical SEO | | joshcanhelp0 -
Should I set up a disallow in the robots.txt for catalog search results?
When the crawl diagnostics came back for my site its showing around 3,000 pages of duplicate content. Almost all of them are of the catalog search results page. I also did a site search on Google and they have most of the results pages in their index too. I think I should just disallow the bots in the /catalogsearch/ sub folder, but I'm not sure if this will have any negative effect?
Technical SEO | | JordanJudson0