Robots.txt File, Sitemap.xml and the XML schema

What the hell are these bots or robots?  

The robots exclusion protocol or simply robots.txt, is a standard used by most websites to communicate with web crawlers and other web robots. Yes, robots will be taking over the world, yeah sort of. Ever heard of HAL 9000?   The standard specifies how to tell the web robot about which areas of the website should not be processed or scanned. Even though you might attempt to block the bot from entering that security folder, it doesn’t mean that a human cannot try and break in.  Keep that in mind, it does expose your admin files in at text format. At any rate, here’s what one robots.txt code might look like in a website CMS system like WordPress but you can also get a proper cheap or free SEO plugin if you’re in a rush and don’t want to hard code it.

 

User-agent: *
Allow:/wp-content/uploads/
 Disallow:/cgi-bin/
 Disallow:/wp-admin/
 Sitemap: http://www.mybadasswebsite.com/sitemap.xml

As you can see, the disallow means no, and allow will let the web crawlers do their thing. So moving onto the site map (or sitemap). It’s a list of pages of a web site accessible to crawlers or users. It can be either a document in any form used as a planning tool for Web design, or a Web page that lists the pages on a Web site, typically organized in hierarchical fashion. It’s basically a directory that tells web crawlers to look and scan information in a basic xml format.  I’m not really the best expert in this field of SEO but I know Google doesn’t play around when you ignore these rules in their search engine. So this code below is an example sitemap.xml file that you might place in your main folder of your website directory.

<?xml version="1.0" encoding="UTF-8"?>
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-30</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>http://www.example.com/about.html</loc>
<lastmod>2005-03-21</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>

There’s a lot more rules and meaning behind the location, change frequency and priority that is beyond my scope but you can find more info online from more dedicated web developers. As you see it’s probably easier to use a plugin device for wordpress. One of my favorite plugins for an automatic sitemap generator is the XML & Google News Sitemaps plugin. Once you get that installed, it’s a lot easier to update you posting. The plugin also creates a robots.txt file as well.  If you choose to do it manually, you can but it sucks.

Last but not least, it’s a good idea to add some basic meta data to the head section of your html5 website so the robots can follow you.  All this stuff is boring but it’s indexing your site and getting you started on search engine optimization.

 <meta name="robots" content="index, follow">
 <meta name="googlebot" content="index, follow">