Peter's Blog

Redefining the Impossible

Downloading Podcasts in Python


I like to listen to the Daily Source Code podcast while rowing. I've always liked talk radio and this is like talk radio with F words.

The podcast files are about 20Megs and they take a few minutes to download on my 750k broadband connection. I wanted to have my Ubuntu box download them automatically so they would be ready for me (as was the original vision of podcasting).

I decided to knock up a python script to do it for me as I didn't want a gui tool and the only other script I found was written in ruby :sick:. This script downloads the rss feed from the podcasting site, looking for mp3 files. Any that it finds it will download. It remembers which files it has downloaded so you can listen to them and delete them and they won't be downloaded again. It doesn't play the files, thats done by Totem.

Other podcasts can be added easily enough.

   1  #
   2  # Download podcasts
   3  #
   4  import xml.parsers.expat
   5  import re
   6  import os
   7  import traceback
   8  import sys
   9  
  10  class FeedParser:
  11    # 3 handler functions
  12    def __init__( self):
  13      self.oElementStack = []
  14      self.bItem = False
  15      self.oItem = None
  16  
  17    def Parse( self, strFeed, strRETitle, strTargetDir):
  18      #
  19      # Parse feed, given url and regular expression describing podcast title.
  20      #
  21      self.oRETitle = re.compile( strRETitle)
  22      self.strTargetDir = strTargetDir
  23  
  24    #
  25    # Open database to remember what files have been dealt with
  26    #
  27      try:
  28        self.oDB = open( strTargetDir + '.pypodder.db').read().split( '\n')
  29      except:
  30        self.oDB = []
  31  
  32      p = xml.parsers.expat.ParserCreate()
  33  
  34      p.StartElementHandler = self.start_element
  35      p.EndElementHandler = self.end_element
  36      p.CharacterDataHandler = self.char_data
  37  
  38      #
  39      # Read feeed using wget as it is robust.
  40      #
  41      strRSS = os.popen4( 'wget -q -O - "%s"' % strFeed)[1].read()
  42  #    print strRSS
  43      p.Parse( strRSS)
  44  
  45    def start_element(self, name, attrs):
  46      #
  47      # Put element on element stack alog with empty data array
  48      #
  49      self.oElementStack.append( [name, []])
  50  
  51      #
  52      # If this is the start of an item then reset the item contents
  53      #
  54      if name == 'item':
  55        self.bItem = True
  56        self.oItem = {}
  57      elif name == 'enclosure':
  58        #
  59        # If element is an enclosure then get the url
  60        #
  61        strUrl = attrs.get( 'url')
  62        if strUrl:
  63          if self.bItem:
  64            self.oItem['enclosure']=strUrl
  65  
  66    def end_element(self, name):
  67      #
  68      # Pop complete element from the element stack
  69      #
  70      strElement, strData = self.oElementStack.pop()
  71  
  72      #
  73      # Check for sillies
  74      #
  75      if strElement != name:
  76        raise "Element mismatch: %s != %s" % (name, strElement)
  77  
  78      if strElement != 'item':
  79        #
  80        # Get data associated with element and store in item
  81        #
  82        if self.bItem:
  83          strData = "".join( strData).strip()
  84  
  85          self.oItem[strElement] = strData
  86      else:
  87        #
  88        # Item is complete.
  89        # See if item title matches the re provided
  90        #
  91        if self.oRETitle.match( self.oItem.get( 'title', '').encode()):
  92          #
  93          # Try to get url of mp3 file
  94          #
  95          strUrl = self.oItem.get( 'enclosure')
  96          if not strUrl:
  97            #
  98            # No enclosure, try the 'link' field.
  99            #
 100            strUrl = self.oItem.get( 'link', '').encode()
 101  
 102          if strUrl and strUrl[-4:].lower() == '.mp3':
 103            #
 104            # See if item has a guid
 105            #
 106            strGuid = self.oItem.get( 'guid').encode()
 107            if not strGuid:
 108              #
 109              # If no guid then use the link url as a guid
 110              #
 111              strGuid = strUrl
 112  
 113            #
 114            # See if guid has already been processed in the database
 115            #
 116            if not strGuid in self.oDB:
 117              #
 118              # try to download the file
 119              # Use wget as a more robust way to download big mp3 files
 120              #
 121              os.chdir( self.strTargetDir)
 122              strResults = os.popen4( 'wget -q "%s"' % strUrl)[1].read()
 123  
 124              strFileName = self.strTargetDir + os.path.basename( strUrl)
 125              print 'Downloaded file %s' % strFileName
 126              print strResults
 127  
 128              #
 129              # Remember that the file has been processed, don't download it again.
 130              #
 131              self.oDB.append( strGuid)
 132              open( self.strTargetDir + '.pypodder.db', 'wt').write( "\n".join( self.oDB))
 133  
 134        self.oItem = None
 135        self.bItem = False
 136  
 137    def char_data(self, data):
 138      #
 139      # Append data to element.
 140      #
 141      self.oElementStack[-1][1].append( data)
 142  
 143  FeedParser().Parse( "http://radio.weblogs.com/0001014/categories/dailySourceCode/rss.xml",
 144                         "Daily Source Code for.*", "/home/peter/DailySourceCode/")

I've set up cron to do this for me at 5:11pm every day, just before I get home from work for a row before eating (I don't recommend a half hour rowing with a full stomach).

crontab -e

11 17 * * * /usr/bin/python /home/peter/pypodder.py

Update: I have altered the script above. There are three main changes:

  • It now uses wget to do the downloading as it is more robust than using urllib2 which had a tendancy to timeout.
  • It is now using the proper Daily Source Code RSS feed, rather than Adam Curry's Weblog as the latter sometimes got the file names wrong.
  • The history of what has been downloaded is now a simple text file, making it easy to delete lines if necessary.

Filed under: mp3 python rss ubuntu

William Says:

over 3 years ago

This is a really useful script, thank you very much.

However there is a bug which prevented me from using it with another feed (The Dawn and Drew Show).

The problem is with the way the script handles enclosure tags. Once it finds one with a url in start_element it adds it to oItem. It then immediately goes into the end_element (as enclosure tags end themselves - they have no content between tags). Now it takes the name 'enclosure' and an empty string as data from oElementStack and procedes to overwrite the entry in oItem for enclosure with its empty string.

This is not a problem for the Daily Source Code as its link element does contain a link to the mp3 file - but for the Dawn and Drew Show that link goes to that episodes web page.

This can be fixed either by writing the enclosure url to oElementStack in start_element and letting end_element move it across to oItem. But I don't know if the rss spec allows text to be enclosed within 'enclosure' tags, if so it would be appended to the url and stuff it up.

So I fixed it by changing line 82 (near the beginning of end_element) from: if self.bItem:

to: if self.bItem and (strElement != 'enclosure'):

which just prevents the program from overwriting the enclosure tag.

Hope that helps, and thanks again for posting the code,

William

William Says:

over 3 years ago

OK, another feed - another bug. This time I tried the LUGRadio feed. Unlike the other two, it doesn't have guid tags. So on line 106: strGuid = self.oItem.get( 'guid').encode() there is no guid member of oItem, so python objects that a 'null object' does not have an encode() method.

My workaround is to replace lines 106 and 107 with:

if self.oItem.get( 'guid'): strGuid = self.oItem.get( 'guid').encode() else:

which seems to work.

Another problem I noticed isn't a bug as such, but if you used the script as it is you'll have missed the DSC on Monday June 6th because your regular expression "Daily Source Code for.*" didn't match the title on that day: Daily Source Code June 6th 2005 #189. It's Adam's fault really for missing out the ' for' but in feeds like this there isn't really a need to match the titles (the script matches .mp3 files within enclosures - if its found one regardless of the title chances are you want it).

So I've replaced the regular expressions with .* to catch everything but if Adam forgets the title one day (or I subscribe to another feed that doesn't bother with them) the program will crash out as on line 91 where the regex is matched to the title: if self.oRETitle.match( self.oItem.get( 'title', '').encode()):

encode() is applied to something that would be null if there were no title.

I think I'll go through the script and remove all of the title checking as I really don't think its necessary. The LUGRadio feed for example has for its titles the names of the shows - with no common starter text - so I have to use the catch-all ".*" anyway.

Hope this will help you improve the script,

William

Peter Says:

over 3 years ago

William,

Thanks for your feedback on my script. I only ever use it on the DSC, I've never tried any other podcasts as the DSC fills all the time I have available for listening.

If you want to post your updated version here or post a link to it then feel free.

Peter

Peter Says:

over 3 years ago

I have rewritten this now, the new version is here. Hopefully this addresses William's comments.

Peter

Comments are Closed