Peter's Blog

Redefining the Impossible

Items filed under apache


Nginx? Think 'Engine X' (NOT en-jinx like I keep thinking). It's a web server, one that is very popular in Russia (the first country into space). It can be thought of as a modern, lean mean web server that does away with the bloat of Apache. It is very good for use in VPS's where memory could be at a premium.

I tried it on my new Slicehost VPS and it was up and running in about ten minutes, proxying to my mongrel cluster (which didn't need rebooting when I swapped web servers) like a good'un. The configuration files are very clear and the ubuntu installation is similar to the apache configuration that I am familiar with. It is possible to add a new virtual site in a minute or two, adding a configuration file that is as simple as:

server {
    listen          80;
    server_name     www.domain1.com;
    access_log      logs/domain1.access.log main;

    location / {
        index index.html;
        root  /var/www/domain1.com/htdocs;
    }
}

This is pretty much all you want to state: the name of the site, a path within the site and where it maps to on the file system. For the mongrel proxying I used a recipe I found: the nginx wiki has many such recipes. It reminds me of cherrypy and how most of the recipes on it's wiki were incomplete or broken, something that ultimately contributed to me abandoning turbogears, django and python for ruby on rails.

I was encouraged to see that Nginx can support drupal quite easily as it's url rewriting rules are capable of pleasing drupal's 'clean urls' mode. Lighttpd can support drupal if you do some php hacking.

Nginx doesn't directly support cgi (as a security feature!?!) but it can support fcgi by talking to a fcgi process via a proxy. I haven't tried this yet but it would be the route I would take to support php (if I wanted drupal on my nice box). I would also dearly like phpmyadmin to be available (hidden from the outside world, only accessable through an ssh tunnel of course). My one hope is that such a fcgi php process can support more than one application (can never be sure with these things).

I read more about lighttpd and found mumblings of bugginess in proxying mongrel clusters and some security concerns. I read nothing but good about Nginx.

n.b. unlike lighttpd

/etc/init.d/nginx restart

actually restarted the server (not having to expressly kill the old processes).


Filed under: apache lighttpd nginx

Add a comment

I've got a nice Rails development setup going now. Aptana is a very nice IDE, very powerful, very rich. The Rails development aspect is most useful in being able run applications in development mode on my PC. I use the mysql server on the deployment server through an ssh tunnel rather than install a database on the PC.

I have created a subversion repository on my VPSlink server. I have installed the subclipse plugin on Aptana and checked out my application onto the work PC. I work on the application in Aptana, polish it and when it is ready I commit the changes again through Aptana. Then on my server I deploy the code by getting the new version out of subversion. I don't use a simple rsync to deploy and I haven't got into capistrano (the rails deployment tool) going via svn works for me. I might consider putting up something like trac (or a rails based alternative) to give me a development wiki and web browsing of the repository.

I've been thinking about backups and the current plan is to backup the SVN repositorys and mysql dumps from my VPSLink server to my site5 server and vice versa. The VPSLink stuff is more in need of backing up since I am the administrator of that one, the site5 account includes daily backups. I was tempted by rsync.net a nice, simple, flexible and cheap remote backup option but since I already have two accounts with ssh access and 35G of space between them I don't think I need to spend more money. Rsync.net looks appealing but the main thing it is lacking for professional purposes would be Windows file permissions, otherwise I might consider it for backing up the servers at work: being able to recover all ones files is good, not having to spend a week fixing the spaghetti mess of windows access permissions is better.

One major change is that I have reverted from Lighttpd back to Apache on my VPSLink server. The main reason for this was that Lighttpd seems to have some limitations in terms of supporting things like drupal's urls. Lighttpd would be good for simple setups but getting multiple legacy php and rails applications set up is just as troublesome as with Apache so I've gone back to the devil I know.

I've had Apache fail to start with memory errors a few times. This could well be the 500M limit of the VPSLink coming into play. VPSLink is cheap because there is no swap, 500M is my absolute limit. If I start getting memory problems I will have to consider the options:

  • I can upgrade the VPSLink to 1G memory but for similar money I could get a cheap dedicated server
  • I could get a different VPS account, one that had burst memory (it's the bursts that kill VPSLink) but I do like the performance of VPSLink.
  • I could fiddle about tuning the number of apache processes and suchlike but life is too short for that one

This blog is still on the site5 account. The server is being upgraded soon, maybe this will resolve the loading issues it has whenever I try doing some development on it (e.g run 'top' and shriek in horror).


Add a comment

Extracts from awstats logs, bandwidth used by various visitors to this site:

crawler.bloglines.com353.02 MB
MSIECrawler 567.41 MB
Inktomi Slurp 135.79 MB
Googlebot 59.86 MB

Apparently MSIECrawler is IE sucking the entire contents of the site. A couple of people seem to have done this, what is the point? Is this site that interesting? Are the spam blogs (copies of legitimate blogs full of links to p0ker sites) using IE for their scraping technology? My attitude to the spammers turns from annoyance to pity.

Bloglines is getting a bit carried away, 353M just downloading RSS feeds.

InkTomi Slurp is still slurping and not returning any visitors from search results.

Googlebot drives 95% of my traffic so 59M is acceptable.

Here is my latest crack at apache log file analysis in python:

   1  #
   2  # Apache log file analysis.
   3  #
   4  import re
   5  import datetime
   6  
   7  #
   8  # Regular expression for parsing apache log file.
   9  #
  10  oLogRE = re.compile( r'''([\d.]+).*\s+                # host
  11                    [^\s]+\s+                        # ?
  12                    [^\s]+\s+                        # ?
  13                    \[(.*?)\]\s+                     # when
  14                    "(.*?)\s+(.*)\s+(.*)"\s+         # method, path, protocol
  15                    (\d+)\s+                         # Error code
  16                    ([^\s]+)\s+                      # Size ?
  17                    "(.*?)"\s+                       # Referrer
  18                    "(.*?)"                          # Agent
  19  ''', re.VERBOSE)
  20  
  21  LOG_Who = 0
  22  LOG_When = 1
  23  LOG_How = 2
  24  LOG_What = 3
  25  LOG_Protocol = 4
  26  LOG_Error = 5
  27  LOG_Size = 6
  28  LOG_Referrer = 7
  29  LOG_Agent = 8
  30  
  31  def ScanFile( strFile):
  32      """
  33      Scan apache log file and return hits.
  34      """
  35      for strLine in open( 'c:\\Desktop\\access.log').readlines():
  36          oMatch = oLogRE.match( strLine)
  37          if oMatch:
  38              yield( oMatch.groups())
  39          else:
  40              print 'Reject: %s' % strLine
  41  
  42  def GatherBy( oHits, nField):
  43      """
  44      Gather hits from list of hits into a dictionary keyed
  45      by unique values of a specific field.
  46      """
  47      oDict = {}
  48  
  49      for oHit in oHits:
  50          oKey = oHit[nField]
  51          if oKey in oDict:
  52              oDict[oKey].append( oHit)
  53          else:
  54              oDict[oKey] = [oHit]
  55  
  56      return oDict
  57  
  58  def FilterBy( oHits, nField, strFilter):
  59      """
  60      Filter hits from list of hits by unique values of a specific field.
  61      """
  62      oRE = re.compile( strFilter)
  63  
  64      for oHit in oHits:
  65          if oRE.search( oHit[nField]):
  66              yield( oHit)
  67  
  68  def FilterByDate( oHits,
  69                    oStartDate,
  70                    oEndDate = datetime.date.today() + datetime.timedelta(1)):
  71      """
  72      Filter hits >= Start Date and < End Date
  73      """
  74      oRE = re.compile( r'(\d+)/(\w+)/(\d+).*')
  75  
  76      for oHit in oHits:
  77          strDate = oHit[LOG_When]
  78          strDay, strMonth, strYear = oRE.match( strDate).groups()
  79  
  80          nDay = int( strDay)
  81          nMonth = ['Jan', 'Feb', 'Mar',
  82                    'Apr', 'May', 'Jun',
  83                    'Jul', 'Aug', 'Sep',
  84                    'Oct', 'Nov', 'Dec'].index( strMonth) + 1
  85          nYear = int( strYear)
  86  
  87          oDate = datetime.date( nYear, nMonth, nDay)
  88  
  89          if oDate >= oStartDate and oDate < oEndDate:
  90              yield( oHit)
  91  
  92  def AnalyseBy( oHits, nField, bJustSummary = False):
  93      """
  94      Print hits by unique values of a specific field
  95      and generate counts and bytes for each unique value.
  96      """
  97      oDict = GatherBy( oHits, nField)
  98  
  99      oKeys = oDict.keys()
 100  
 101      oKeys.sort()
 102  
 103      nGrandTotalCounts = 0
 104      nGrandTotalBytes = 0
 105  
 106      for oKey in oKeys:
 107          nCount = len( oDict[oKey])
 108  
 109          nTotal = 0
 110  
 111          for oHit in oDict[oKey]:
 112              strSize = oHit[LOG_Size]
 113              if strSize != '-':
 114                  nTotal += int(strSize)
 115  
 116          if not bJustSummary:
 117              print oKey, nCount, nTotal
 118  
 119          nGrandTotalCounts += nCount
 120          nGrandTotalBytes += nTotal
 121  
 122      print "Unique items: %d, Total Hits: %d, Total Bytes: %d" % (len(oKeys),
 123                                                                   nGrandTotalCounts,
 124                                                                   nGrandTotalBytes)
 125  
 126  oStartDate = datetime.date.today() - datetime.timedelta( 8 ) # week yesterday
 127  oEndDate = datetime.date.today() - datetime.timedelta( 1 )  # yesterday
 128  
 129  oAllHits = list( FilterByDate( ScanFile( 'c:\\Desktop\\access.log'),
 130                                 oStartDate, oEndDate))
 131  oAllHits.extend( list( FilterByDate( ScanFile( 'c:\\Desktop\\access.log.1'),
 132                                       oStartDate, oEndDate)))
 133  
 134  print "User Agents"
 135  AnalyseBy( oAllHits, LOG_Agent, True)
 136  
 137  print "All Hosts (hence all usage)"
 138  AnalyseBy( oAllHits, LOG_Who, True)
 139  
 140  print "Hits from bloglines"
 141  #
 142  # Determine different bloglines feeds and analyse each one
 143  #
 144  for strFeed, oFeedHits in GatherBy( FilterBy( oAllHits,
 145                                                LOG_Agent,
 146                                                'Bloglines'),
 147                                      LOG_What).items():
 148      #
 149      # Now analyse by agent:  agent includes number of subscribers
 150      # so we see subscribers per feed.
 151      print "Bloglines feed %s" % strFeed
 152      AnalyseBy( oFeedHits, LOG_Agent)
 153  
 154  print "Hits from MSIECrawler by host"
 155  AnalyseBy( FilterBy( oAllHits, LOG_Agent, 'MSIECrawler'), LOG_Who)
 156  
 157  print "Hits from Inktomi/yahoo slurp"
 158  AnalyseBy( FilterBy( oAllHits, LOG_Agent, 'Slurp'), LOG_Agent)

Example output for week ending yesterday. Notes:

  • 659,628,604 bytes served in a week!
  • Slurp took 45,786,926 bytes
  • MSIECrawl user took 49,980,154 bytes
  • I've got more bloglines subscribers than I thought.
  • Just how many rss feed urls does drupal provide?
User Agents
Unique items: 497, Total Hits: 76052, Total Bytes: 659628604

All Hosts (hence all usage)
Unique items: 3301, Total Hits: 76052, Total Bytes: 659628604

Hits from bloglines
Bloglines feed /blog/1/feed
Bloglines/3.0-rho (http://www.bloglines.com; 1 subscriber) 252 13730570
Bloglines/3.0-rho (http://www.bloglines.com; 3 subscribers) 256 13947306
Bloglines/3.0-rho (http://www.bloglines.com; 5 subscribers) 252 13730570
Bloglines/3.0-rho (http://www.bloglines.com; 7 subscribers) 256 13947306
Unique items: 4, Total Hits: 1016, Total Bytes: 55355752
Bloglines feed /blog/feed
Bloglines/3.0-rho (http://www.bloglines.com; 1 subscriber) 242 13189698
Unique items: 1, Total Hits: 242, Total Bytes: 13189698
Bloglines feed /atom/feed
Bloglines/3.0-rho (http://www.bloglines.com; 3 subscribers) 256 1449434
Unique items: 1, Total Hits: 256, Total Bytes: 1449434
Bloglines feed /tags/18/feed
Bloglines/3.0-rho (http://www.bloglines.com; 1 subscriber) 256 12427520
Unique items: 1, Total Hits: 256, Total Bytes: 12427520
Bloglines feed /blog/feed/1
Bloglines/3.0-rho (http://www.bloglines.com; 5 subscribers) 252 0
Unique items: 1, Total Hits: 252, Total Bytes: 0
Bloglines feed /taxonomy/term/5/0/feed
Bloglines/3.0-rho (http://www.bloglines.com; 1 subscriber) 242 3832796
Unique items: 1, Total Hits: 242, Total Bytes: 3832796
Bloglines feed /node/feed
Bloglines/3.0-rho (http://www.bloglines.com; 3 subscribers) 256 13959338
Unique items: 1, Total Hits: 256, Total Bytes: 13959338
Bloglines feed /rss.xml
Bloglines/3.0-rho (http://www.bloglines.com; 3 subscribers) 256 0
Bloglines/3.0-rho (http://www.bloglines.com; 7 subscribers) 256 0
Unique items: 2, Total Hits: 512, Total Bytes: 0
Bloglines feed /tags/3/feed
Bloglines/3.0-rho (http://www.bloglines.com; 1 subscriber) 250 12999000
Unique items: 1, Total Hits: 250, Total Bytes: 12999000

Hits from MSIECrawler by host
81.159.46.223 1784 49980154
Unique items: 1, Total Hits: 1784, Total Bytes: 49980154

Hits from Inktomi/yahoo slurp
Mozilla/5.0 (compatible; Yahoo! Slurp China;) 58 800642
Mozilla/5.0 (compatible; Yahoo! Slurp;) 2066 44986284
Unique items: 2, Total Hits: 2124, Total Bytes: 45786926

4 Comments

Playing with trac I had to set up apache login authentication to set up access permissions. This is good, I now know how to password protect personal areas of the site (not that personal).

I've used auth-digest as it's supposed to be more secure than basic authentication. It may have problems with some versions of internet explorer: no, lets rephrase that, it is better at keeping the proles out. Here is how I did it for my debian system:

  • Enable the Digest Authentication module in apache2:
    sudo a2enmod
    auth_digest<cr>
    apache2ctl restart
    
  • Create a digest file:
    mkdir /somewhere/to/keep/it
    htdigest -c /somewhere/to/keep/it/auth.htdigest Area51 me
    
    where me is my user id. You will be prompted for password for the user.
  • Edit site configuration file: in the case of my trac url, I've protected it thusly:
    ScriptAlias /trac /usr/share/trac/cgi-bin/trac.cgi
    <Location "/trac">
        AuthType Digest
        AuthName "Area51"
        AuthDigestDomain /var/www/Trac http://www.somewhere.org/Trac
        AuthDigestFile /somewhere/to/keep/it/auth.htdigest
        Require valid-user
        SetEnv TRAC_ENV "/var/www/Trac"
    </Location>
    

Now I have to log in to get into www.somewhere.com/Trac.


Filed under: apache debian ubuntu

Add a comment

My embroyo Django blog got to the stage where I wanted to theme it. I started off by giving it the theme from this site which requires serving of various graphic files as well as the .css file. This means setting apache up for serving static files as well as dynamic pages. Django can serve static files but it is better to use apache as that is what apache is good at. The django advise is to use a differrent server for the static files, something lean and mean, but I'd prefer to use the one server. I get about 500 page loads/day, performance is not totally paramount to me.

I achieved this using the following apache virtualhost setup (in /etc/apache2/sites-available/mysite for debian and apache2).

<VirtualHost *>
    ServerName sitename.org
    ServerAlias sitename.org *.sitename.org
    ServerAdmin webmaster@localhost

    DocumentRoot /var/www/sitename.org
    <Directory />
        Options -Indexes FollowSymLinks MultiViews
        AllowOverride None
        Order allow,deny
        allow from all
    </Directory>

    #
    # Most stuff is handled by django
    #
    <Location "/">
        SetHandler python-program
        PythonHandler django.core.handlers.modpython
        PythonPath "['/usr/local/lib/Django'] + sys.path"
        SetEnv DJANGO_SETTINGS_MODULE django_local.settings
    </Location>

    #
    # Theme files are static and not done by django
    #
    <Location "/theme">
        SetHandler none
    </Location>

    # Possible values include: debug, info, notice, warn, error, crit,
    # alert, emerg.
    LogLevel warn

    ErrorLog /var/log/apache2/sitename.org/error.log

    CustomLog /var/log/apache2/sitename.org/access.log combined
    ServerSignature On

</VirtualHost>

Here I use the Location keyword to tell mod_python to handle all accesses to the site. Then I override this for the subdirectory /theme where I set the handler to none, causing static content to be served.


Filed under: apache django python

Add a comment

I've discovered the 'Alias' keyword in apache2 config files. This keyword allows a pretty free hand at redirecting urls to directories and files on the hard disk. Consider this extract from the site config file (in /etc/apache2/sites-available/intranet):

<VirtualHost *>
    ServerName intranet
    ServerAdmin webmaster@localhost

    DocumentRoot /var/www/intranet
    <Directory />
        Options FollowSymLinks
        AllowOverride All
    </Directory>
    <Directory /var/www/intranet>
        # pcw No directory listsings
        # Options Indexes FollowSymLinks MultiViews
        Options -Indexes FollowSymLinks MultiViews
        AllowOverride All
        Order allow,deny
        allow from all
    </Directory>

    Alias /bugzilla "/var/www/bugzilla/"
    <Directory "/var/www/bugzilla/">
        Options ExecCGI -Indexes MultiViews FollowSymLinks
        AllowOverride None
        Order deny,allow
        Deny from all
        Allow from all
    </Directory>

This is telling apache that the site is called 'intranet' and is normally served up from the directory /var/www/intranet. However, there is a subdirectory called 'bugzilla' that is addressed as http://intranet/bugzilla but is served up from /var/www/bugzilla rather than /var/www/intranet/bugzilla.

Why would I want to do this? Because /var/www/intranet is a drupal setup stored in subversion and I don't want to put the bugzilla stuff in subversion or fiddle around telling subversion to ignore it. It keeps each feature of the domain cleanly seperated.


Filed under: apache bugzilla subversion

Add a comment

After fixing postfix installation all was looking good until I realised the web server running Drupal was only showing the home page: clicking on other pages kept giving the home page. I looked in /var/log/apache/error.log and found I was getting this error on each click:

/usr/sbin/apache: relocation error: /usr/lib/php4/20020429/mysql.so:
undefined symbol: php_sprintf

A nasty one. Some googling and forum browsing gave me the clue to the solution: I had an unholy mix of Apache 1.3 and Apache 2.0 installed on the box, apache 1.3 was running and finding php compiled for Apache 2.0 (or something like that).

The solution was:

  • run rcconf and disable apache (1.3) and enable apache2
  • stop apache 1.3
  • install mod_php4 for apache2
  • enable php in /etc/apache2/apache2.conf
  • enable mysql.so and gd.so in /etc/php4/apache2/php.ini
  • start apache2

and sanity was restored (if an intranet can be described as that).


Filed under: apache linux mysql php ubuntu

4 Comments

I needed to set up another drupal site on my ubuntu linode. I had a domain name, I wanted to make it an independent site. I decided to keep it seperate from my existing site by putting in a fresh Drupal 4.6.1 installation and not to use Drupals virtual server facility.

I knew Apache2 supported virtual hosting and I decided to use that. I tried creating a new virtual host by creating a file in /etc/apache2/sites-available as follows:

<VirtualHost *>
        ServerName www.site2.com
        ServerAlias site2.com
        ServerAdmin webmaster@localhost

        DocumentRoot /var/www/site2
        <Directory /var/www/site2/>
                Options Indexes FollowSymLinks MultiViews
                # pcw AllowOverride None
                AllowOverride All
                Order allow,deny
                allow from all
                # This directive allows us to have apache2's default start page
                # in /apache2-default/, but still have / go to the right place
                # Commented out for Ubuntu
                #RedirectMatch ^/$ /apache2-default/
        </Directory>

        ErrorLog /var/log/apache2/site2/error.log

        # Possible values include: debug, info, notice, warn, error, crit,
        # alert, emerg.
        LogLevel warn

        CustomLog /var/log/apache2/site2/access.log combined
        ServerSignature On

</VirtualHost>

where site2 is the name of the new site. Note that I created /var/log/apache2/site2 so that the site would get it's own access logs.

I used the command

a2ensite site2

to enable the site. I restarted apache2 and, bang, both this site and the new site showed the new site, I had broken this site.

After faffing around and googling, I tried a simple experiment. I removed the symbolic link to site2 from /etc/apache2/sites-enabled created by a2ensite and I just appended the above file to /etc/apache2/sites-available/default. I restarted apache2 and this worked, I had two sites. This is probably not the right way to do it but it works and any time I spend fixing it will bring this site down which bothers me so I'll leave it as it is unless I come across the correct way to do it.

Update: On my oneandone server running debian this is working fine as a seperate file, enabled with a2ensite:

<VirtualHost *>
    ServerName petersblog.org
    ServerAlias petersblog.org *.petersblog.org
    ServerAdmin webmaster@localhost

    DocumentRoot /var/www/petersblog.org
    <Directory />
        Options FollowSymLinks
        AllowOverride All
    </Directory>
    <Directory /var/www/petersblog.org>
        # pcw No directory listsings
        # Options Indexes FollowSymLinks MultiViews
        Options -Indexes FollowSymLinks MultiViews
        AllowOverride All
        Order allow,deny
        allow from all
    </Directory>

    ErrorLog /var/log/apache2/petersblog.org/error.log

    # Possible values include: debug, info, notice, warn, error, crit,
    # alert, emerg.
    LogLevel warn

    CustomLog /var/log/apache2/petersblog.org/access.log combined
    ServerSignature On

</VirtualHost>

I have four sites set up like this and all are working.


Filed under: apache debian drupal ubuntu

8 Comments

Apache is a web server.


Filed under: apache

Add a comment

Found a problem searching Drupal sites that use 'clean urls'. It even happens on the main Drupal site. Just do a search for 'die/die' and you get the following error:

Not Found
The requested URL /search/node/die/die was not found on this server.

Apache/1.3.33 Server at drupal.org Port 80

Clean urls requires a mod_rewrite hack to turn

http://www.drupal.org/search/node/die/die

into

http://www.drupal.org/index.php?q=search/node/die/die

but this is not working properly. If you try the second form above in a browser then it works, the search is done but mod_rewrite does not seem to correctly munge it.


Filed under: apache drupal mod_rewrite

Add a comment