Designing a search engine friendly website

Send to a friend Print

Help more people find out about this story

Del.icio.us
StumbleUpon

Ash Nallawalla21 February 2009, 12:52 PM

Ash Nallawalla explains how to design a search-engine-friendly (SEF) website.


Some companies unwittingly make it difficult for a search engine spider to crawl their website. Some do it deliberately, although they block certain parts of the site for good reasons. Let’s see how to get it right.

Website Architecture

From a search engine crawler’s perspective there are two broad types of website architecture:
  • Static (or flat) pages, which can be edited individually.
  • Dynamic pages, which are rendered on the fly from templates and a content database.
No matter how fast a web server can render a dynamic page to a spider, it can present a static page a little faster, since it has already been assembled. A common trick is to make a dynamic page look like a static page. Although this does not render the page any faster, the URL is shorter and this enables more pages to be crawled.

Internal Links

You often see the same navigation menu on every page of a website. On a very small site this means that each page is linked to every other page.

The home page has as many internal links to it as does the least important page. This results in every page being assigned the same importance from a Google PageRank (PR) viewpoint.

So, design your linking structure in a logical structure that resembles an inverted pyramid, and identify sub-pyramids based on themes, e.g. film cameras and digital cameras.

File and Directory Naming

Web developers sometimes use underscores as file separators, e.g. file_name.html. The problem with this is that Google treats an hyphen (-) as a space but an underline as a literal character, so “file_name” is seen as one string of text and not the two words “file name”. This has some benefit in the anchor text of links, so it indirectly helps with ranking. You should use hyphens in file and directory names where possible. There are indications that some search engines will allow underscores as separators, but it is safer to use hyphens.

Links to the Home Page

We frequently see internal links to the home page coded as follows:
<a href=”/index.html”>Home</a>
This means the home page has external links pointing to:
<a href=”http://www.companyname.com/”>Company Name</a>
whereas, internal links are pointing to:
<a href=”/index.html”>Home</a>
which is, effectively, the same as:
<a href=”http://www.companyname.com/index.html”>Home</a>
We now have the home page referred to as two distinct URLs – with and without the “index.html”. It is best to use internal links pointing to the root of the home directory with a slash:
<a href=”/”>Home</a>
However, other internal links should use the full domain name if there is any risk that someone will scrape (copy) your pages. This way, those other links will take people to your site unless the scraper edits these links.

JavaScript Links

Client-side JavaScript (JS) refers to code that runs in the browsers of the visitors, not at the web server. Since spiders do not use a browser to view a site, they cannot execute JS which often means that they cannot follow navigation menus that have been coded with JS. Consequently, they cannot see the rest of the site beyond the home page. Only Google claims to be able to follow JS links.

Many people like the effects created by JS, e.g. fly-out menus, and will insist on it, so use a textual navigation bar at the bottom of a page to get around the problem, or at least a single textual link to a sitemap.
Major subhead: Managing Spiders/Robots

Robots Meta Tag

The <HEAD> portion of a page should contain all Meta tags. If you want to try and control the behaviour of a spider, you can do it at a page level. There is no need to tell a spider to Index and Follow, as this is its default behaviour. It wastes bytes, so remove it from your code if you have it.

Examples:
<meta name="robots" content="index,follow">
<meta name="robots" content="noindex,follow">
<meta name="robots" content="index,nofollow">
<meta name="robots" content="noindex,nofollow">
Only one (or none) of the above should appear on a page.

Note that Nofollow means that such pages do not get a vote for PR purposes.

If you want to tell well-behaved robots (includes Google, Yahoo!, Live, and Ask) to stay away, then use:
<meta name="robots" content="none">
That is the opposite of:
<meta name="robots" content="all">
Do not use the above, as this is the robots’ default behaviour and will make the page heavier.

Robots.txt File

The Robots Exclusion Protocol enables website administrators to indicate to visiting robots which parts of their site should not be explored by the robot. Only one such file is needed to be present in the root directory, e.g. www.example.com/robots.txt. It is a good idea to have even an empty file with this name, or else the website log gets an additional entry when a nonexistent page is called.

You can take advantage of Sitemap Auto-discovery whereby you mention the location of your XML sitemap in this file and all compliant search engines (Google, Live, Yahoo! and Ask) will come and find it without needing to be pinged.
You can use a wildcard to disallow all user agents from certain directories.
Example:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /private/
SITEMAP: http://www.example.com/sitemap.xml
Note: It is easy to fake the user-agent text, so you could, for example, make your browser appear to be Googlebot. Hence this is not a guaranteed exclusion method. You can read the contents of any website that has such a file, e.g. http://www.google.com/robots.txt, so it is better not to name your private directories in an obvious manner, such as “secret” or “private”. Better still, such directories should be password-protected.

To exclude all spiders from the whole site (Don’t do this if you want to optimise the site!):
User-agent: *
Disallow: /
To exclude a specific spider from the whole site:
User-agent: Nastybot
Disallow: /
Nasty spiders, such as ones that harvest email addresses for spam, do not look at the robots.txt file, so there is no easy protection from malicious spiders.

Page Coding

We won’t tell you which website platform to use, but we can discuss the approach you should take in coding pages.

General Principles

In general, the code should be as lean as possible (CSS and external JavaScript file usage should be maximised), with keyphrase-laden text to be as high as possible in the code (to a spider). Navigation code should be as low as possible and not use JavaScript (else there should be an alternative way to show HTML links). Links should be clean (not escaped through a cgi or JavaScript) HTML. Repetitive links to no-content pages, e.g. to a Disclaimer should be marked as rel=“nofollow”, except for one clean link. The home page should contain a link to the site map, which should contain textual links to important pages – on large sites it is not practical to link to all pages (which is the job of the sitemap.xml file).

Use JavaScript/AJAX Wisely

The “weight” of a page should be kept down by moving JavaScript to an external file. When multiplied over many pages on a site, this results in faster spider crawls.
 
The following line replaces several lines that were previously in the <head> section of a page:
<script language="JavaScript" src="includes/myscript.js" type="text/javascript">
</script>
AJAX is a great technology for reducing page reloads, but is another kind of JavaScript, so be mindful that the content may not be seen by the crawlers.

Cascading Style Sheets (CSS)

Keep style definitions in an external style sheet, as this reduces the weight of a page and speeds up access for users:
<link href="includes/style.css" rel="stylesheet" type="text/css">
Ordinarily, an H1 tag creates a large, ugly heading that might not go with the style of your page, so you can use CSS to design it.
Check out this site (www.tanfa.co.uk/css/layouts/) for some good CSS layouts.
 

Source Ordered Content

Source Ordered Content (SOC) (also known as Source Ordered HTML, Absolute Positioning) involves placing valuable text high in the source code and less valuable content, such as navigation elements, lower. This also ensures that the snippet of the page shown in a search engine results page (SERP) is meaningful, not “Home, About, Products” etc.  Sometimes, the effort required to achieve SOC may outweigh the benefit, such as complex Content Management System templates that are not available to the SEO.

Images

All visible pictures (not spacers) should have alternate text for people with visual difficulties, as they use text readers that “speak” the text on the web page. Known as “alt tags”, the code looks like this:
<img src="/images/pos.gif" width=22, height=4 align=absmiddle border=0 
ALT="Positive page rank value"
The maximum length of alternate text should be 80 characters but you must not try to stuff a lot of keywords here.


HTML Page Elements

The following is not an HTML tutorial but some comments on how certain elements affect SEO.

Doctype

The first line in the code must be the correct DOCTYPE, depending on the version of HTML you have chosen. The most common error is when people use Frames but do not use the Frameset Doctype.

HTML

The<html></html> pair encloses the rest of the page, as they tell the browser that this page has been encoded as HTML (since the Doctype is often missing in many sites). You can also specify the language of the page here, e.g.
Australian English = <html lang="en-au">
German = <html lang="de">

Head

The <head></head> section should contain only essential Meta Tags and important elements such as:
  • <title>APCmag - Home</title> - this is the most important single element on your web page. Place your most valuable keyphrase in it.
  •  <meta name="Description" content=” APC - tech news, reviews, how to and computer help. Microsoft Windows XP, Vista, Linux, Mac, dual-boot tutorials”> - The content should be compelling words that encourage the viewer to click, but should also be keyword-rich.
  • <meta name="Keywords" content=” XP, Windows, Linux, Mac, Vista, OS X, iPhone, APC, Australian Personal Computer, Australia, PC, hardware, dual-boot, dualboot, tech news”>
  • <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  • <link href="APCStyles.css" rel="stylesheet" type="text/css"> – This should point to a cascading style sheet (CSS) that is placed in another file.
  • <meta name="robots" content="noodp"> (optional) – Use this to display a snippet from your page in Google/Live/Yahoo! search results, instead of the ODP (dmoz.org) description, which might not be what you want.
  • <meta name="robots" content="noydir"> (optional) – Use this to display a snippet from your page in Yahoo! search results, instead of the Yahoo! Directory description, which might be out of date.
  • It is fine to insert some proprietary meta tags if you know you are getting SEO benefit – for example some directories or geographic listings need you to place their tag in your Head area. Don’t waste this space with meta tags such as Copyright, Author, Distribution, etc. Government sites use DC (Dublin Core) meta tags, which are not recognised by the search engines. Unless you are forced to use them (e.g. by departmental rules), do not use them.

Body

The <body></body> pair delineate the body of the page, namely most of the visible part of the page. This would contain images, headings, text, and structural code. Do not confuse semantics with style – for example, do not mark a whole page with an H3 (heading) tag because it makes the font larger. The tags should describe the purpose of the content text, e.g. <H2> = A second-level heading.

H1

There should be only one <h1></h1> pair per page. This is the main topic of the page. H tags have been abused by SEOs; therefore, their impact is not as great as it once was.

H2 through H5

An H2 is a sub-topic of an H1. There is no limit to the number of H2s, H3s etc that you can have.

Img

The image tag is an opportunity to place a keyword in the Alternate text (for blind people), but only if the image is a link. You can use keywords for other images but they have little, if any, SEO value. You should not place keywords in spacer images.

A

An  anchor tag is used for linking. When linking internally, use absolute URLs, that is, include the domain name with the URL.
Example: <a href=http://www.example.com/products.php>Products</a>, not
<a href= “/products.php”>Products</a>.

There are many other tags used within the Body of a page but they do not have SEO significance.
Post your comment



anonymous user Anonymous user

This month in APC!

Tags