Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

GET, POST, and safely surfacing more of the web

Tuesday, November 01, 2011 at 3:53 PM

Webmaster Level: Intermediate to Advanced

As the web evolves, Google’s crawling and indexing capabilities also need to progress. We improved our indexing of Flash, built a more robust infrastructure called Caffeine, and we even started crawling forms where it makes sense. Now, especially with the growing popularity of JavaScript and, with it, AJAX, we’re finding more web pages requiring POST requests -- either for the entire content of the page or because the pages are missing information and/or look completely broken without the resources returned from POST. For Google Search this is less than ideal, because when we’re not properly discovering and indexing content, searchers may not have access to the most comprehensive and relevant results.

We generally advise to use GET for fetching resources a page needs, and this is by far our preferred method of crawling. We’ve started experiments to rewrite POST requests to GET, and while this remains a valid strategy in some cases, often the contents returned by a web server for GET vs. POST are completely different. Additionally, there are legitimate reasons to use POST (e.g., you can attach more data to a POST request than a GET). So, while GET requests remain far more common, to surface more content on the web, Googlebot may now perform POST requests when we believe it’s safe and appropriate.

We take precautions to avoid performing any task on a site that could result in executing an unintended user action. Our POSTs are primarily for crawling resources that a page requests automatically, mimicking what a typical user would see when they open the URL in their browser. This will evolve over time as we find better heuristics, but that’s our current approach.

Let’s run through a few POSTs request scenarios that demonstrate how we’re improving our crawling and indexing to evolve with the web.

Examples of Googlebot’s POST requests
  • Crawling a page via a POST redirect
  • <html>
      <body onload="document.foo.submit();">
        <form name="foo" action="request.php" method="post">
          <input type="hidden" name="bar" value="234"/>
        </form>
      </body>
    </html>


  • Crawling a resource via a POST XMLHttpRequest
    In this step-by-step example, we improve both the indexing of a page and its Instant Preview by following the automatic XMLHttpRequest generated as the page renders.

    1. Google crawls the URL, yummy-sundae.html.
    2. Google begins indexing yummy-sundae.html and, as a part of this process, decides to attempt to render the page to better understand its content and/or generate the Instant Preview.
    3. During the render, yummy-sundae.html automatically sends an XMLHttpRequest for a resource, hot-fudge-info.html, using the POST method.
      <html>
        <head>
          <title>Yummy Sundae</title>
          <script src="jquery.js"></script>
        </head>
        <body>
          This page is about a yummy sundae.
          <div id="content"></div>
          <script type="text/javascript">
            $(document).ready(function() {
              $.post('hot-fudge-info.html', function(data)
                {$('#content').html(data);});
            });
          </script>
        </body>
      </html>
    4. The URL requested through POST, hot-fudge-info.html, along with its data payload, is added to Googlebot’s crawl queue.
    5. Googlebot performs a POST request to crawl hot-fudge-info.html.
    6. Google now has an accurate representation of yummy-sundae.html for Instant Previews. In certain cases, we may also incorporate the contents of hot-fudge-info.html into yummy-sundae.html.
    7. Google completes the indexing of yummy-sundae.html.
    8. User searches for [hot fudge sundae].
    9. Google’s algorithms can now better determine how yummy-sundae.html is relevant for this query, and we can properly display a snapshot of the page for Instant Previews.
Improving your site’s crawlability and indexability
General advice for creating crawlable sites is found in our Help Center. For webmasters who want to help Google crawl and index their content and/or generate the Instant Preview, here are a few simple reminders:
  • Prefer GET for fetching resources, unless there’s a specific reason to use POST.

  • Verify that we're allowed to crawl the resources needed to render your page. In the example above, if hot-fudge-info.html is disallowed by robots.txt, Googlebot won't fetch it. More subtly, if the JavaScript code that issues the XMLHttpRequest is located in an external .js file disallowed by robots.txt, we won't see the connection between yummy-sundae.html and hot-fudge-info.html, so even if the latter is not disallowed itself, that may not help us much. We've seen even more complicated chains of dependencies in the wild. To help Google better understand your site it's almost always better to allow Googlebot to crawl all resources.

    You can test whether resources are blocked through Webmaster Tools “Labs -> Instant Previews.”

  • Make sure to return the same content to Googlebot as is returned to users’ web browsers. Cloaking (sending different content to Googlebot than to users) is a violation of our Webmaster Guidelines because, among other things, it may cause us to provide a searcher with an irrelevant result -- the content the user views in their browser may be a complete mismatch from what we crawled and indexed. We’ve seen numerous POST-request examples where a webmaster non-maliciously cloaked (which is still a violation), and their cloaking -- on even the smallest of changes -- then caused JavaScript errors that prevented accurate indexing and completely defeated their reason for cloaking in the first place. Summarizing, if you want your site to be search-friendly, cloaking is an all-around sticky situation that’s best to avoid.

    To verify that you're not accidentally cloaking, you can use Instant Previews within Webmaster Tools, or try setting the User-Agent string in your browser to something like:

    Mozilla/5.0 (compatible; Googlebot/2.1;
      +http://www.google.com/bot.html)

    Your site shouldn't look any different after such a change. If you see a blank page, a JavaScript error, or if parts of the page are missing or different, that means that something's wrong.

  • Remember to include important content (i.e., the content you’d like indexed) as text, visible directly on the page and without requiring user-action to display. Most search engines are text-based and generally work best with text-based content. We’re always improving our ability to crawl and index content published in a variety of ways, but it remains a good practice to use text for important information.
Controlling your content
If you’d like to prevent content from being crawled or indexed for Google Web Search, traditional robots.txt directives remain the best method. To prevent the Instant Preview for your page(s), please see our Instant Previews FAQ which describes the “Google Web Preview” User-Agent and the nosnippet meta tag.

Moving forward
We’ll continue striving to increase the comprehensiveness of our index so searchers can find more relevant information. And we expect our crawling and indexing capability to improve and evolve over time, just like the web itself. Please let us know if you have questions or concerns.

Raising awareness of cross-domain URL selections

Monday, October 31, 2011 at 9:30 AM

Webmaster level: Advanced

A piece of content can often be reached via several URLs, not all of which may be on the same domain. A common example we’ve talked about over the years is having the same content available on more than one URL, an issue known as duplicate content. When we discover a group of pages with duplicate content, Google uses algorithms to select one representative URL for that content. A group of pages may contain URLs from the same site or from different sites. When the representative URL is selected from a group with different sites the selection is called a cross-domain URL selection. To take a simple example, if the group of URLs contains one URL from a.com and one URL from b.com and our algorithms select the URL from b.com, the a.com URL may no longer be shown in our search results and may see a drop in search-referred traffic.

Webmasters can greatly influence our algorithms’ selections using one of the currently supported mechanisms to indicate the preferred URL, for example using rel="canonical" elements or 301 redirects. In most cases, the decisions our algorithms make in this regard correctly reflect the webmaster’s intent. However, in some rare cases we’ve also found many webmasters are confused as to why it has happened and what they can do if they believe the selection is incorrect.

To be transparent about cross-domain URL selection decisions, we’re launching new Webmaster Tools messages that will attempt to notify webmasters when our algorithms select an external URL instead of one from their website. The details about how these messages work are in our Help Center article about the topic, and in this blog post we’ll discuss the different scenarios in which you may see a cross-domain URL selection and what you can do to fix any selections you believe are incorrect.

Common causes of cross-domain URL selection

There are many scenarios that can lead our algorithms to select URLs across domains.

In most cases, our algorithms select a URL based on signals that the webmaster implemented to influence the decision. For example, a webmaster following our guidelines and best practices for moving websites is effectively signalling that the URLs on their new website are the ones they prefer for Google to select. If you’re moving your website and see these new messages in Webmaster Tools, you can take that as confirmation that our algorithms have noticed.

However, we regularly see webmasters ask questions when our algorithms select a URL they did not want selected. When your website is involved in a cross-domain selection, and you believe the selection is incorrect (i.e. not your intention), there are several strategies to improve the situation. Here are some of the common causes of unexpected cross-domain URL selections that we’ve seen, and how to fix them:

  1. Duplicate content, including multi-regional websites: We regularly see webmasters use substantially the same content in the same language on multiple domains, sometimes inadvertently and sometimes to geotarget the content. For example, it’s common to see a webmaster set up the same English language website on both example.com and example.net, or a German language website hosted on a.de, a.at, and a.ch.

    Depending on your website and your users, you can use one of the currently-supported canonicalization techniques to signal to our algorithms which URLs you wish selected. Please see the following articles about this topic:

  2. Configuration mistakes: Certain types of misconfigurations can lead our algorithms to make an incorrect decision. Examples of misconfiguration scenarios include:
    1. Incorrect canonicalization: Incorrect usage of canonicalization techniques pointing to URLs on an external website can lead our algorithms to select the external URLs to show in our search results. We’ve seen this happen with misconfigured content management systems (CMS) or CMS plugins installed by the webmaster.

      To fix this kind of situation, find how your website is incorrectly indicating the canonical URL preference (e.g. through incorrect usage of a rel="canonical" element or a 301 redirect) and fix that.

    2. Misconfigured servers: Sometimes we see hosting misconfigurations where content from site a.com is returned for URLs on b.com. A similar case occurs when two unrelated web servers return identical soft 404 pages that we may fail to detect as error pages. In both situations we may assume the same content is being returned from two different sites and our algorithms may incorrectly select the a.com URL as the canonical of the b.com URL.

      You will need to investigate which part of your website’s serving infrastructure is misconfigured. For example, your server may be returning HTTP 200 (success) status codes for error pages, or your server might be confusing requests across different domains hosted on it. Once you find the root cause of the issue, work with your server admins to correct the configuration.

  3. Malicious website attacks: Some attacks on websites introduce code that can cause undesired canonicalization. For example, the malicious code might cause the website to return an HTTP 301 redirect or insert a cross-domain rel="canonical" link element into the HTML <head> or HTTP header, usually pointing to an external URL hosting malicious content. In these cases our algorithms may select the malicious or spammy URL instead of the URL on the compromised website.

    In this situation, please follow our guidance on cleaning your site and submit a reconsideration request when done. To identify cloaked attacks, you can use the Fetch as Googlebot function in Webmaster Tools to see your page’s content as Googlebot sees it.

In rare situations, our algorithms may select a URL from an external site that is hosting your content without your permission. If you believe that another site is duplicating your content in violation of copyright law, you may contact the site’s host to request removal. In addition, you can request that Google remove the infringing page from our search results by filing a request under the Digital Millennium Copyright Act.

And as always, if you need help in identifying the cause of an incorrect decision or how to fix it, you can see our Help Center article about this topic and ask in our Webmaster Help Forum.

Accessing search query data for your sites

Tuesday, October 18, 2011 at 11:18 AM

Webmaster level: All

SSL encryption on the web has been growing by leaps and bounds. As part of our commitment to provide a more secure online experience, today we announced that SSL Search on https://www.google.com will become the default experience for signed in users on google.com. This change will be rolling out over the next few weeks.

What is the impact of this change for webmasters? Today, a web site accessed through organic search results on http://www.google.com (non-SSL) can see both that the user came from google.com and their search query. (Technically speaking, the user’s browser passes this information via the HTTP referrer field.) However, for organic search results on SSL search, a web site will only know that the user came from google.com.

Webmasters can still access a wealth of search query data for their sites via Webmaster Tools. For sites which have been added and verified in Webmaster Tools, webmasters can do the following:
  • View the top 1000 daily search queries and top 1000 daily landing pages for the past 30 days.
  • View the impressions, clicks, clickthrough rate (CTR), and average position in search results for each query, and compare this to the previous 30 day period.
  • Download this data in CSV format.
In addition, users of Google Analytics’ Search Engine Optimization reports have access to the same search query data available in Webmaster Tools and can take advantage of its rich reporting capabilities.

We will continue to look into further improvements to how search query data is surfaced through Webmaster Tools. If you have questions, feedback or suggestions, please let us know through the Webmaster Tools Help Forum.

Posted by Anthony Chavez, Product Manager

Create and manage Custom Search Engines from within Webmaster Tools

Wednesday, October 12, 2011 at 2:20 AM

Webmaster level: All

Custom Search Engines (CSEs) enable you to create Google-powered customized search experiences for your sites. You can search over one or more sites, customize the look and feel to match your site, and even make money with AdSense for Search. Now it’s even easier to get started directly from Webmaster Tools.

If you’ve never created a CSE, just click on the “Custom Search” link in the Labs section and we’ll automatically create a default CSE that searches just your site. You can do some basic configuring or immediately get the code snippet to add your new CSE to your site. You can always continue on to the full CSE control panel for more advanced settings.

Once you’ve created your CSE (or if you already had one), clicking the “Custom Search” link in Labs will allow you to manage your CSEs without leaving Webmaster Tools.

We hope these new features make it easier for you to help users search your site. If you have any questions, please post them in our Webmaster Help Forum or the Custom Search Help Forum.

Webmaster forums' Top Contributors rock

Wednesday, October 05, 2011 at 10:27 AM

Webmaster level: All

The TC Summit was a blast! As we wrote in our announcement post, we recently invited more than 250 Top Contributors from all over the world to California to thank them for being so awesome and to give them the opportunity to meet some of our forum guides, engineers and product managers in person.

Our colleagues Adrianne and Brenna already published a recap post on the Official Google Blog. As for us, the search folks at Google, there's not much left to say except that we enjoyed the event and meeting Top Contributors in real life, many of them for the first time. We got the feeling you guys had a great time, too. Let’s quote a few of the folks who make a huge difference on a daily basis:

Sasch Mayer on Google+ (Webmaster TC in English):

"For a number of reasons this event does hold a special place for me, and always will. It's not because I was one of comparatively few people to be invited for a Jolly at the ‘Plex, but because this trip offered the world's TCs a unique opportunity to finally meet each other in person."

Herbert Sulzer, a.k.a. Luzie on Google+ (Webmaster TC in English, German and Spanish):

“Hehehe! Fun, fun fun, this was all fun :D Huhhh”

Aygul Zagidullina on Google+ (Web Search TC in English):

“It was a truly fantastic, amazing, and unforgettable experience meeting so many other TCs across product forums and having the chance to talk to and hear from so many Googlers across so many products!”

Of course we did receive lots of constructive feedback, too. Transparency and communication were on top of the list, and we're looking into increasing our outreach efforts via Webmaster Tools, so stay tuned! By the way, if you haven’t done so yet, please remember to use the forwarding option in the Webmaster Tools Message Center to get the messages straight to your email inbox. In the meantime please keep an eye on our Webmaster Central Blog, and of course keep on contributing to discussions in the Google Webmaster Forum.

On behalf of all Google guides who participated in the 2011 Summit we want to thank you. You guys rock! :)


That’s right, TCs & Google Guides came from all over the world to convene in California.


TCs & Google Guides from Webmaster Central and Search forums after one of the sessions.


After a day packed with presentations and breakout sessions...


...we did what we actually came for...


...enjoyed a party, celebrated and had a great time together.

Webmaster Tools Search Queries data is now available in Google Analytics

Tuesday, October 04, 2011 at 12:42 PM

Webmaster level: All

Earlier this year we announced a limited pilot for Search Engine Optimization reports in Google Analytics, based on Search queries data from Webmaster Tools. Thanks to valuable feedback from our pilot users, we’ve made several improvements and are pleased to announce that the following reports are now publicly available in the Traffic Sources section of Google Analytics.
  • Queries: impressions, clicks, position, and CTR info for the top 1,000 daily queries
  • Landing Pages: impressions, clicks, position, and CTR info for the top 1,000 daily landing pages
  • Geographical Summary: impressions, clicks, and CTR by country
All of these Search Engine Optimization reports offer Google Analytics’ advanced filtering and visualization capabilities for deeper data analysis. With the secondary dimensions, you can view your site’s data in ways that aren’t available in Webmaster Tools.


To enable these Search Engine Optimization reports for a web property, you must be both a Webmaster Tools verified site owner and a Google Analytics administrator of that Property. Once enabled, administrators can choose which profiles can see these reports.

If you have feedback or suggestions, please let us know in the Webmaster Help Forum.

Work smarter, not harder, with site health

Thursday, September 29, 2011 at 2:26 PM

Webmaster level: All

We consistently hear from webmasters that they have to prioritize their time. Some manage dozens or hundreds of clients’ sites; others run their own business and may only have an hour to spend on website maintenance in between managing finances and inventory. To help you prioritize your efforts, Webmaster Tools is introducing the idea of “site health,” and we’ve redesigned the Webmaster Tools home page to highlight your sites with health problems. This should allow you to easily see what needs your attention the most, without having to click through all of the reports in Webmaster Tools for every site you manage.

Here’s what the new home page looks like:


You can see that sites with health problems are shown at the top of the list. (If you prefer, you can always switch back to listing your sites alphabetically.) To see the specific issues we detected on a site, click the site health icon or the “Check site health” link next to that site:


This new home page is currently only available if you have 100 or fewer sites in your Webmaster Tools account (either verified or unverified). We’re working on making it available to all accounts in the future. If you have more than 100 sites, you can see site health information at the top of the Dashboard for each of your sites.

Right now we include three issues in your site’s health check:

  1. Have we detected malware on the site?
  2. Have any important pages been removed via our URL removal tool?
  3. Are any of your important pages blocked from crawling in robots.txt?

You can click on any of these items to get more details about what we detected on your site. If the site health icon and the “Check site health” link don’t appear next to a site, it means that we didn’t detect any of these issues on that site (congratulations!).

A word about “important pages:” as you know, you can get a comprehensive list of all URLs that have been removed by going to Site configuration > Crawler access > Remove URL; and you can see all the URLs that we couldn’t crawl because of robots.txt by going to Diagnostics > Crawl errors > Restricted by robots.txt. But since webmasters often block or remove content on purpose, we only wanted to indicate a potential site health issue if we think you may have blocked or removed a page you didn’t mean to, which is why we’re focusing on “important pages.” Right now we’re looking at the number of clicks pages get (which you can see in Your site on the web > Search queries) to determine importance, and we may incorporate other factors in the future as our site health checks evolve.

Obviously these three issues—malware, removed URLs, and blocked URLs—aren’t the only things that can make a website “unhealthy;” in the future we’re hoping to expand the checks we use to determine a site’s health, and of course there’s no substitute for your own good judgment and knowledge of what’s going on with your site. But we hope that these changes make it easier for you to quickly spot major problems with your sites without having to dig down into all the data and reports.

After you’ve resolved any site health issues we’ve flagged, it will usually take several days for the warning to disappear from your Webmaster Tools account, since we have to recrawl the site, see the changes you’ve made, and then process that information through our Web Search and Webmaster Tools pipelines. If you continue to see a site health warning for that site after a week or so, the issue may not have been resolved. Feel free to ask for help tracking it down in our Webmaster Help Forum... and let us know what you think!