Peter's Blog

Redefining the Impossible

Python Log Dumps


My Site5 hosting service allows me to download access logs which I find enlessly fascinating. The netadmin administration tool offers AwStats which shows incredibly detailed statistics but it is slightly skewed by showing my own access.

So I wrote a python script to parse the log and dump out anything interesting. It filters out IP addresses I am likely to connect from. This is crude in that I have hard wired the log file name. Note that the log file I download is gzipped but that is no problem for python.

This dumps out:

  • suspicious looking attempts to hack in (extremely long strings etc)
  • a list of various user agents and the IP addresses they are coming from
  • a list of referrer strings

Things I find interesting in the dumps:

  • There are 171 different types of user agents listed. Most claim to be mozilla type browsers which is probably rarely true but even so, there are a lot of things crawling around out there. Someone out there is using lynx. Hi there.
  • I get at least one known spam email address harvester visiting (DTS Agent). Be warned. This particular one does not really bother to hide itself.
  • Referrers from drupal.org seem to arrive from random pages on that site. I think folk are browsing around, see something from me in the 'Drupal Talk block and come here for a read. Drupal generates a misleading referrer string.
  • The referrer strings from google give the search terms. I get a number of people looking for r-s-y-n-c w-i-n-2-k (obscured to hide from google) and when I do that search this post somes in at #7 with it's enticing title. Moral: give postings enticing titles.
  • Yahoo Slurp crawls the site about as much as google but gave me one referral compared to 81 from google.

These statistics are for a 7 day period.

import gzip
import re

#
# Open log file. Crude but effective. Reads directly from gzipped log file.
#
oFile = gzip.GzipFile( 'C:\\Tmp\\accesslog-bisiand.me.uk-9-28-2004.gz')

def Sorted( oArray):
    "Return sorted array"
    oTmp = oArray[:]
    oTmp.sort()
    return oTmp

#
# Scan through the log file.
# Use regular expression to split the entries up.
#
# Pattern is thus:
#
# 56.98.204.40 - - [09/Sep/2004:03:50:01 -0400] "GET / HTTP/1.0" 200 643 "-" "
#Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7) Gecko/20040803 Firefox/0.9.3"
#
oRE = re.compile( r'(\d+\.\d+\.\d+\.\d+).*(\[.*\])\s+"(GET|POST|HEAD|SEARCH|PUT)\s+([^"]+)
                       "\s+([\d-]+)\s+([\d-]+)\s+"([^"]+)"\s+"([^"]+)"')

#
# Here build map of IP addresses to the log file entries.
#
oHits = {}

#
# Build map of unique referrers and how many folk they sent my way.
#
oReferrers = {}

#
# Go though file.
#
for strLine in oFile.readlines():
    print strLine[:-1]
    oMatch = oRE.search( strLine)
    if oMatch:
        #
        # These things seem to be used by hackers trying  to break in.
        #
        if oMatch.group(3) in ("PUT", "SEARCH"):
            print strLine
            continue

        #
        # Get the IP address.
        #
        strIP = oMatch.group(1)

        #
        # Ignore the entry if it is me.
        #
        if strIP in ('76.54.32.10', '12.3.45.67'):
            continue

        #
        # Get interesting fields from log file.
        #
        strAccess = oMatch.group( 3) + oMatch.group(4)
        strReferrer = oMatch.group(7)
        strAgent = oMatch.group( 8 )

        #
        # Build up hit map.
        #
        if oHits.has_key( strIP):
            oHits[strIP].append( (strAccess, strReferrer, strAgent))
        else:
            oHits[strIP] = [( strAccess, strReferrer, strAgent)]

        #
        # Build up referred map.
        #
        if oReferrers.has_key( strReferrer):
            oReferrers[strReferrer] += 1
        else:
            oReferrers[strReferrer] = 1
    else:
        #
        # Did not match the regular expression. Just dump the line.
        #
        print "Miss:" + strLine

#
# Determine which user agents originate from which IP.
#
strAgents = {}

for strIP in Sorted(oHits.keys()):
    oHit = oHits[strIP]
    if strAgents.has_key(oHit[0][2]):
        strAgents[oHit[0][2]].append( strIP)
    else:
        strAgents[oHit[0][2]] = [strIP]

#
# Display the unique User Agents and the IPs using them.
# This shows things like googlebot.
#
for strAgent in Sorted( strAgents.keys()):
    strIPs = strAgents[strAgent]
    print strAgent
    for strIP in strIPs:
        print "   %s %d" % (strIP.ljust( 15), len( oHits[strIP]))

#
# How did they get here? Show the referred name.
#
for strReferrer in Sorted( oReferrers.keys()):
    if strReferrer.find( '209.59.159.21') >= 0:
        continue
    if strReferrer.find( 'bisiand.me.uk') >= 0:
        continue
    if len(strReferrer) < 60:
        print strReferrer.ljust( 60) + str(oReferrers[strReferrer])
    else:
        print strReferrer + "\n" + (' ' * 60) + str(oReferrers[strReferrer])

Comments are Closed