Working with Internet Search Engines and their Crawlers

Erich Zirnhelt · ‎07-14-2014

In my last post I shared some of the hidden metadata we applied to our articles to feed the crawlers - those bots sent by the internet search engines to consume and subsequently index our content. Today I will explain how a knowledge manager may fine-tune the search engine to better handle the crawled content. Search engines are incredibly smart today, and many of you (if you even have your content crawled by internet search engines) may not ever need to take these added steps. For us, however, we have a lot of content that resembles it's neighbors, and to be sure our customers have the best experience, we want the search engines to be as effective as possible.

Allowing crawlers to crawl

The first part of this puzzle in ensuring that the crawlers are able to access the content. This comes in two steps: (1) opening the doors to them, and (2) then inviting them in.

Opening the doors entails making the permissions changes on the server side. Specifically, enabling the 'guest' user and changing the permissions on the pages and objects you need to be accessible.
Inviting them in is as simple as updating your robots.txt file. The ServiceNow default looks like this:

User-agent: *
Disallow: /

Depending on the search engine, how you need to format the file is different, and some respect the Allow statement. With Google (from which we get over 95% of our search traffic) they respect the more specific over the more general, and order doesn't matter. For example, here is ours today:

User-agent: *

Allow: /kb_view

Allow: /kb_view.do

Allow: /knowledgeSiteMap.do

Allow: /sys_attachment

Allow: /sys_attachment.do

Allow: /*.png

Allow: /*.pngx

Allow: /*.jpg

Allow: /*.jpgx

Allow: /*.jpeg

Allow: /*.jpegx

Allow: /*.gif

Allow: /*.gifx

Disallow: /

sitemap: https://hi.service-now.com/knowledgeSiteMap.do

You can see that the Disallow is less specific than the Allow lines. (Interestingly, Bing behaves differently enough that our Disallow is overpowering our Allow lines, and our KB content is not being crawled. We're going to have to replace the Disallow wildcard with discrete Disallow lines to fix this.)

It's worth noting that this will not override your system permissions, nor does it guarantee the crawler won't become aware of a page otherwise. However, it ensures that when the crawler comes across a link you don't want it to follow, it won't attempt it at all.

Directing the crawlers towards the right content to crawl

You may have noticed that sitemap line in the robots.txt file above. This is an easy way to tell the crawlers where to find a comprehensive listing of the pages that need to be crawled. So, as part of this release, we created a realtime sitemap that lists each crawlable page like this:

<url>

<loc>

https://hi.service-now.com/kb_view.do?sysparm_article=KB0538525</loc>

2014-06-19</lastmod>

monthly</changefreq>

2014-07-26</expires>

</url>

Beyond providing the location of the page, the crawler is also told how often the page is typically updated, the date it was last modified, and (optionally) the date we expect the page to be expired.

Working with the Search Engine's Tools

Once you've set your content to be crawler-accessible and invited them in, you can monitor your progress and even fine-tune the crawling and search behavior using the search engines' webmaster tools. (We've set up with both Google and Bing, but the links below mostly come from Google.)

First, you'll need to verify your site, so the crawler can tie your user ID to the site. This ensures that you are truly responsible for the site content. We chose the meta tag approach, and even set up a system property to allow us to easily change this value through our test cycles or down the road.

Once your site is verified, you gain access to the webmaster tools (Google, Bing) and you can monitor the crawler progress, monitor the indexer progress (yes, they are different), identify structural or crawler errors, and provide special handling instructions on URL parameters. Some of the steps we took were:

specified our sitemap, the same one we list on the robots.txt file
told the crawler to ignore some of the session-based URL parameters so that it understands that the page content is the same with or without certain parameter specified
looked at the structural health of our content
reviewed the crawler, indexer and search stats daily... and obsessively...

Next Steps

As I mentioned, Bing is not crawling our site yet, so we have a story in our backlog to refactor the robots.txt file. Outside of that, we know these are early days, and search engines change their behaviors regularly. We will learn and adjust as appropriate.

ServiceNow Community servicenow community

Working with Internet Search Engines and their Crawlers

Allowing crawlers to crawl

Directing the crawlers towards the right content to crawl

Working with the Search Engine's Tools

Next Steps

Have any of you exposed your content to the internet? What did you learn, and what tools did you use?

Moving up the Community Ladder!

How to use new Flow or Flow logic Features of Washington Release - demo (step by step)

Help Shape the Future of ServiceNow Community – Take Our Survey!