Peter's Blog

Redefining the Impossible

Items filed under mod_rewrite


mod_rewrite is a module for the Apache web server that allows web requests to be rewritten according to rules listed in a .htaccess file.


Filed under: htaccess mod_rewrite


Found a problem searching Drupal sites that use 'clean urls'. It even happens on the main Drupal site. Just do a search for 'die/die' and you get the following error:

Not Found
The requested URL /search/node/die/die was not found on this server.

Apache/1.3.33 Server at drupal.org Port 80

Clean urls requires a mod_rewrite hack to turn

http://www.drupal.org/search/node/die/die

into

http://www.drupal.org/index.php?q=search/node/die/die

but this is not working properly. If you try the second form above in a browser then it works, the search is done but mod_rewrite does not seem to correctly munge it.


Filed under: apache drupal mod_rewrite


My server is being hit by attempts to submit trackback spam which is particulary annoying as I don't have trackback. By default Drupal formats up a full web page with a fancy 'page not found' line for a 404 error (page not found). To save server time and bandwidth, I've put this at the top of my .htaccess file:

ErrorDocument 403 /fail.html

Fail.html is a minimal html file containing just the string '403 error'. Should be little enough load for the server:

<header>
<title>Error</title>
</header>
<body>
Error 403
</body>

This is added to the mod_rewrite rules:

#
# Reject any attempt to submit trackback spam
#
RewriteRule ^(.*)trackback(.*)$ - [F]

any url with 'trackback' in it is rejected with the minimal 403 error.


Filed under: drupal htaccess mod_rewrite


This is a python script to test out .htaccess mod_rewrite rules to block referrer spam. I just hate the idea of these parasites sucking my bandwidth.

   1  #
   2  # Test .htaccess
   3  #
   4  import httplib
   5  import urllib2
   6  
   7  #
   8  # Site to test
   9  #
  10  strSite = "http://www.petersblog.org"
  11  
  12  #
  13  # Bad referrers: should fail
  14  #
  15  strBadReferrers = [
  16      "http://www.blah.info",
  17      "http://blah.info",
  18      "http://any.blah.info",
  19      "http://www.blah.info/",
  20      "http://www.blah.info/this/should/still/fail"
  21  ]
  22  
  23  #
  24  # Good referrers: should pass
  25  #
  26  strGoodReferrers = [
  27      "http://www.google.com",
  28      "http://www.google.com/search?q=tecrep-inc.net",
  29      strSite,                    # allow internal referrer in
  30      strSite + "/node/123",
  31      ""                          # no referrer
  32  ]
  33  
  34  def TestReferrer( strReferrer):
  35      "Test whether a referrer is allowed in: True if so"
  36      try:
  37          request = urllib2.Request(strSite)
  38          if strReferrer != "":
  39              request.add_header("referer", strReferrer)
  40          opener = urllib2.build_opener()
  41          data = opener.open(request).read()
  42          return True
  43      except(urllib2.HTTPError):
  44          return False
  45  
  46  #
  47  # Test bad referrers.
  48  #
  49  for strReferrer in strBadReferrers:
  50      if TestReferrer( strReferrer):
  51          print "Failed: allowed %s in" % strReferrer
  52      else:
  53          print "Passed: didn't allow %s in" % strReferrer
  54  
  55  #
  56  # Test good referrers.
  57  #
  58  for strReferrer in strGoodReferrers:
  59      if TestReferrer( strReferrer):
  60          print "Passed: allowed %s in" % strReferrer
  61      else:
  62          print "Failed: didn't allow %s in" % strReferrer

I find the following format for mod_rewrite referrer blocking to be effective.

RewriteCond %{HTTP_REFERER} ^http://[^/]*blah.net($|/.*$) [OR]
RewriteCond %{HTTP_REFERER} ^http://[^/]*blah.com($|/.*$) [OR]
RewriteCond %{HTTP_REFERER} ^http://[^/]*blahblah.org($|/.*$) [NC]
RewriteRule ^.* - [F]

A good source for a list of sites to block can be found in any comment spam that happens to get through. Note that one rule such as:

RewriteCond %{HTTP_REFERER} ^http://[^/]*blah.com($|/.*$) [NC]
RewriteRule ^.* - [F]

will catch all the following permutations:

* http://www.blah.com/
* http://www.foo.blah.com/
* http://www.foo.bar.blah.com/
* http://www.foo.bar.blah.com/still/not/allowed

Note: I've seen referrer spelt 'referer' a lot and it is spelt this way in the .htaccess rules but google define assures me I'm spelling it right: referer sounds to me more like a smoker of certain narcotic substances.



My logs show attempts by GoogleBot et al to access a robots.txt file so I've decided not to disappoint them any more and have provided them with this:

User-agent: *
Disallow:

This is telling them to scan away, I'm open.

Googling for robots.txt it is interesting to see the fifth entry is for http://www.whitehouse.gov/robots.txt i.e. the white house's robots.txt file. It seems to do a lot of disallowing. I hope this isn't going to be interpreted as political commentry.

I didn't need a mod_rewrite rule to get this working, it Just Worked.


Filed under: mod_rewrite


I noticed in my Drupal logs that google is looking for a file called rss.xml on my site:

09/10/2004 - 13:23  404 error: 'rss.xml' not found	Anonymous

I am eager to keep google happy but how to create such a file? I had a brainwave and added a mod_rewrite rule to my .htaccess file:

RewriteRule rss.xml blog/feed/1

So any attempts to access rss.xml trigger the link that my RSS buttons point to. Note that I have 'Clean Urls' turned on.

Come back google, I'm waiting for you!