My Site5 hosting service allows me to download access logs which I find enlessly fascinating. The netadmin administration tool offers AwStats which shows incredibly detailed statistics but it is slightly skewed by showing my own access.
So I wrote a python script to parse the log and dump out anything interesting. It filters out IP addresses I am likely to connect from. This is crude in that I have hard wired the log file name. Note that the log file I download is gzipped but that is no problem for python.
This dumps out:
- suspicious looking attempts to hack in (extremely long strings etc)
- a list of various user agents and the IP addresses they are coming from
- a list of referrer strings
Things I find interesting in the dumps:
- There are 171 different types of user agents listed. Most claim to be mozilla type browsers which is probably rarely true but even so, there are a lot of things crawling around out there. Someone out there is using lynx. Hi there.
- I get at least one known spam email address harvester visiting (DTS Agent). Be warned. This particular one does not really bother to hide itself.
- Referrers from drupal.org seem to arrive from random pages on that site. I think folk are browsing around, see something from me in the 'Drupal Talk block and come here for a read. Drupal generates a misleading referrer string.
- The referrer strings from google give the search terms. I get a number of people looking for r-s-y-n-c w-i-n-2-k (obscured to hide from google) and when I do that search this post somes in at #7 with it's enticing title. Moral: give postings enticing titles.
- Yahoo Slurp crawls the site about as much as google but gave me one referral compared to 81 from google.
These statistics are for a 7 day period.
import gzip
import re
#
# Open log file. Crude but effective. Reads directly from gzipped log file.
#
oFile = gzip.GzipFile( 'C:\\Tmp\\accesslog-bisiand.me.uk-9-28-2004.gz')
def Sorted( oArray):
"Return sorted array"
oTmp = oArray[:]
oTmp.sort()
return oTmp
#
# Scan through the log file.
# Use regular expression to split the entries up.
#
# Pattern is thus:
#
# 56.98.204.40 - - [09/Sep/2004:03:50:01 -0400] "GET / HTTP/1.0" 200 643 "-" "
#Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7) Gecko/20040803 Firefox/0.9.3"
#
oRE = re.compile( r'(\d+\.\d+\.\d+\.\d+).*(\[.*\])\s+"(GET|POST|HEAD|SEARCH|PUT)\s+([^"]+)
"\s+([\d-]+)\s+([\d-]+)\s+"([^"]+)"\s+"([^"]+)"')
#
# Here build map of IP addresses to the log file entries.
#
oHits = {}
#
# Build map of unique referrers and how many folk they sent my way.
#
oReferrers = {}
#
# Go though file.
#
for strLine in oFile.readlines():
print strLine[:-1]
oMatch = oRE.search( strLine)
if oMatch:
#
# These things seem to be used by hackers trying to break in.
#
if oMatch.group(3) in ("PUT", "SEARCH"):
print strLine
continue
#
# Get the IP address.
#
strIP = oMatch.group(1)
#
# Ignore the entry if it is me.
#
if strIP in ('76.54.32.10', '12.3.45.67'):
continue
#
# Get interesting fields from log file.
#
strAccess = oMatch.group( 3) + oMatch.group(4)
strReferrer = oMatch.group(7)
strAgent = oMatch.group( 8 )
#
# Build up hit map.
#
if oHits.has_key( strIP):
oHits[strIP].append( (strAccess, strReferrer, strAgent))
else:
oHits[strIP] = [( strAccess, strReferrer, strAgent)]
#
# Build up referred map.
#
if oReferrers.has_key( strReferrer):
oReferrers[strReferrer] += 1
else:
oReferrers[strReferrer] = 1
else:
#
# Did not match the regular expression. Just dump the line.
#
print "Miss:" + strLine
#
# Determine which user agents originate from which IP.
#
strAgents = {}
for strIP in Sorted(oHits.keys()):
oHit = oHits[strIP]
if strAgents.has_key(oHit[0][2]):
strAgents[oHit[0][2]].append( strIP)
else:
strAgents[oHit[0][2]] = [strIP]
#
# Display the unique User Agents and the IPs using them.
# This shows things like googlebot.
#
for strAgent in Sorted( strAgents.keys()):
strIPs = strAgents[strAgent]
print strAgent
for strIP in strIPs:
print " %s %d" % (strIP.ljust( 15), len( oHits[strIP]))
#
# How did they get here? Show the referred name.
#
for strReferrer in Sorted( oReferrers.keys()):
if strReferrer.find( '209.59.159.21') >= 0:
continue
if strReferrer.find( 'bisiand.me.uk') >= 0:
continue
if len(strReferrer) < 60:
print strReferrer.ljust( 60) + str(oReferrers[strReferrer])
else:
print strReferrer + "\n" + (' ' * 60) + str(oReferrers[strReferrer])

