mod_rewrite is a module for the Apache web server that allows web requests to be rewritten according to rules listed in a .htaccess file.
Items filed under mod_rewrite
Found a problem searching Drupal sites that use 'clean urls'. It even happens on the main Drupal site. Just do a search for 'die/die' and you get the following error:
Not Found The requested URL /search/node/die/die was not found on this server. Apache/1.3.33 Server at drupal.org Port 80
Clean urls requires a mod_rewrite hack to turn
http://www.drupal.org/search/node/die/die
into
http://www.drupal.org/index.php?q=search/node/die/die
but this is not working properly. If you try the second form above in a browser then it works, the search is done but mod_rewrite does not seem to correctly munge it.
Filed under: apache drupal mod_rewrite
My server is being hit by attempts to submit trackback spam which is particulary annoying as I don't have trackback. By default Drupal formats up a full web page with a fancy 'page not found' line for a 404 error (page not found). To save server time and bandwidth, I've put this at the top of my .htaccess file:
ErrorDocument 403 /fail.html
Fail.html is a minimal html file containing just the string '403 error'. Should be little enough load for the server:
<header> <title>Error</title> </header> <body> Error 403 </body>
This is added to the mod_rewrite rules:
# # Reject any attempt to submit trackback spam # RewriteRule ^(.*)trackback(.*)$ - [F]
any url with 'trackback' in it is rejected with the minimal 403 error.
Filed under: drupal htaccess mod_rewrite
This is a python script to test out .htaccess mod_rewrite rules to block referrer spam. I just hate the idea of these parasites sucking my bandwidth.
1 # 2 # Test .htaccess 3 # 4 import httplib 5 import urllib2 6 7 # 8 # Site to test 9 # 10 strSite = "http://www.petersblog.org" 11 12 # 13 # Bad referrers: should fail 14 # 15 strBadReferrers = [ 16 "http://www.blah.info", 17 "http://blah.info", 18 "http://any.blah.info", 19 "http://www.blah.info/", 20 "http://www.blah.info/this/should/still/fail" 21 ] 22 23 # 24 # Good referrers: should pass 25 # 26 strGoodReferrers = [ 27 "http://www.google.com", 28 "http://www.google.com/search?q=tecrep-inc.net", 29 strSite, # allow internal referrer in 30 strSite + "/node/123", 31 "" # no referrer 32 ] 33 34 def TestReferrer( strReferrer): 35 "Test whether a referrer is allowed in: True if so" 36 try: 37 request = urllib2.Request(strSite) 38 if strReferrer != "": 39 request.add_header("referer", strReferrer) 40 opener = urllib2.build_opener() 41 data = opener.open(request).read() 42 return True 43 except(urllib2.HTTPError): 44 return False 45 46 # 47 # Test bad referrers. 48 # 49 for strReferrer in strBadReferrers: 50 if TestReferrer( strReferrer): 51 print "Failed: allowed %s in" % strReferrer 52 else: 53 print "Passed: didn't allow %s in" % strReferrer 54 55 # 56 # Test good referrers. 57 # 58 for strReferrer in strGoodReferrers: 59 if TestReferrer( strReferrer): 60 print "Passed: allowed %s in" % strReferrer 61 else: 62 print "Failed: didn't allow %s in" % strReferrer
I find the following format for mod_rewrite referrer blocking to be effective.
RewriteCond %{HTTP_REFERER} ^http://[^/]*blah.net($|/.*$) [OR]
RewriteCond %{HTTP_REFERER} ^http://[^/]*blah.com($|/.*$) [OR]
RewriteCond %{HTTP_REFERER} ^http://[^/]*blahblah.org($|/.*$) [NC]
RewriteRule ^.* - [F]
A good source for a list of sites to block can be found in any comment spam that happens to get through. Note that one rule such as:
RewriteCond %{HTTP_REFERER} ^http://[^/]*blah.com($|/.*$) [NC]
RewriteRule ^.* - [F]
will catch all the following permutations:
* http://www.blah.com/ * http://www.foo.blah.com/ * http://www.foo.bar.blah.com/ * http://www.foo.bar.blah.com/still/not/allowed
Note: I've seen referrer spelt 'referer' a lot and it is spelt this way in the .htaccess rules but google define assures me I'm spelling it right: referer sounds to me more like a smoker of certain narcotic substances.
Filed under: google htaccess mod_rewrite python
My logs show attempts by GoogleBot et al to access a robots.txt file so I've decided not to disappoint them any more and have provided them with this:
User-agent: * Disallow:
This is telling them to scan away, I'm open.
Googling for robots.txt it is interesting to see the fifth entry is for http://www.whitehouse.gov/robots.txt i.e. the white house's robots.txt file. It seems to do a lot of disallowing. I hope this isn't going to be interpreted as political commentry.
I didn't need a mod_rewrite rule to get this working, it Just Worked.
Filed under: mod_rewrite
I noticed in my Drupal logs that google is looking for a file called rss.xml on my site:
09/10/2004 - 13:23 404 error: 'rss.xml' not found Anonymous
I am eager to keep google happy but how to create such a file? I had a brainwave and added a mod_rewrite rule to my .htaccess file:
RewriteRule rss.xml blog/feed/1
So any attempts to access rss.xml trigger the link that my RSS buttons point to. Note that I have 'Clean Urls' turned on.
Come back google, I'm waiting for you!

