I like to listen to the Daily Source Code podcast while rowing. I've always liked talk radio and this is like talk radio with F words.
The podcast files are about 20Megs and they take a few minutes to download on my 750k broadband connection. I wanted to have my Ubuntu box download them automatically so they would be ready for me (as was the original vision of podcasting).
I decided to knock up a python script to do it for me as I didn't want a gui tool and the only other script I found was written in ruby :sick:. This script downloads the rss feed from the podcasting site, looking for mp3 files. Any that it finds it will download. It remembers which files it has downloaded so you can listen to them and delete them and they won't be downloaded again. It doesn't play the files, thats done by Totem.
Other podcasts can be added easily enough.
1 # 2 # Download podcasts 3 # 4 import xml.parsers.expat 5 import re 6 import os 7 import traceback 8 import sys 9 10 class FeedParser: 11 # 3 handler functions 12 def __init__( self): 13 self.oElementStack = [] 14 self.bItem = False 15 self.oItem = None 16 17 def Parse( self, strFeed, strRETitle, strTargetDir): 18 # 19 # Parse feed, given url and regular expression describing podcast title. 20 # 21 self.oRETitle = re.compile( strRETitle) 22 self.strTargetDir = strTargetDir 23 24 # 25 # Open database to remember what files have been dealt with 26 # 27 try: 28 self.oDB = open( strTargetDir + '.pypodder.db').read().split( '\n') 29 except: 30 self.oDB = [] 31 32 p = xml.parsers.expat.ParserCreate() 33 34 p.StartElementHandler = self.start_element 35 p.EndElementHandler = self.end_element 36 p.CharacterDataHandler = self.char_data 37 38 # 39 # Read feeed using wget as it is robust. 40 # 41 strRSS = os.popen4( 'wget -q -O - "%s"' % strFeed)[1].read() 42 # print strRSS 43 p.Parse( strRSS) 44 45 def start_element(self, name, attrs): 46 # 47 # Put element on element stack alog with empty data array 48 # 49 self.oElementStack.append( [name, []]) 50 51 # 52 # If this is the start of an item then reset the item contents 53 # 54 if name == 'item': 55 self.bItem = True 56 self.oItem = {} 57 elif name == 'enclosure': 58 # 59 # If element is an enclosure then get the url 60 # 61 strUrl = attrs.get( 'url') 62 if strUrl: 63 if self.bItem: 64 self.oItem['enclosure']=strUrl 65 66 def end_element(self, name): 67 # 68 # Pop complete element from the element stack 69 # 70 strElement, strData = self.oElementStack.pop() 71 72 # 73 # Check for sillies 74 # 75 if strElement != name: 76 raise "Element mismatch: %s != %s" % (name, strElement) 77 78 if strElement != 'item': 79 # 80 # Get data associated with element and store in item 81 # 82 if self.bItem: 83 strData = "".join( strData).strip() 84 85 self.oItem[strElement] = strData 86 else: 87 # 88 # Item is complete. 89 # See if item title matches the re provided 90 # 91 if self.oRETitle.match( self.oItem.get( 'title', '').encode()): 92 # 93 # Try to get url of mp3 file 94 # 95 strUrl = self.oItem.get( 'enclosure') 96 if not strUrl: 97 # 98 # No enclosure, try the 'link' field. 99 # 100 strUrl = self.oItem.get( 'link', '').encode() 101 102 if strUrl and strUrl[-4:].lower() == '.mp3': 103 # 104 # See if item has a guid 105 # 106 strGuid = self.oItem.get( 'guid').encode() 107 if not strGuid: 108 # 109 # If no guid then use the link url as a guid 110 # 111 strGuid = strUrl 112 113 # 114 # See if guid has already been processed in the database 115 # 116 if not strGuid in self.oDB: 117 # 118 # try to download the file 119 # Use wget as a more robust way to download big mp3 files 120 # 121 os.chdir( self.strTargetDir) 122 strResults = os.popen4( 'wget -q "%s"' % strUrl)[1].read() 123 124 strFileName = self.strTargetDir + os.path.basename( strUrl) 125 print 'Downloaded file %s' % strFileName 126 print strResults 127 128 # 129 # Remember that the file has been processed, don't download it again. 130 # 131 self.oDB.append( strGuid) 132 open( self.strTargetDir + '.pypodder.db', 'wt').write( "\n".join( self.oDB)) 133 134 self.oItem = None 135 self.bItem = False 136 137 def char_data(self, data): 138 # 139 # Append data to element. 140 # 141 self.oElementStack[-1][1].append( data) 142 143 FeedParser().Parse( "http://radio.weblogs.com/0001014/categories/dailySourceCode/rss.xml", 144 "Daily Source Code for.*", "/home/peter/DailySourceCode/")
I've set up cron to do this for me at 5:11pm every day, just before I get home from work for a row before eating (I don't recommend a half hour rowing with a full stomach).
crontab -e 11 17 * * * /usr/bin/python /home/peter/pypodder.py
Update: I have altered the script above. There are three main changes:
- It now uses wget to do the downloading as it is more robust than using urllib2 which had a tendancy to timeout.
- It is now using the proper Daily Source Code RSS feed, rather than Adam Curry's Weblog as the latter sometimes got the file names wrong.
- The history of what has been downloaded is now a simple text file, making it easy to delete lines if necessary.


This is a really useful script, thank you very much.
However there is a bug which prevented me from using it with another feed (The Dawn and Drew Show).
The problem is with the way the script handles enclosure tags. Once it finds one with a url in start_element it adds it to oItem. It then immediately goes into the end_element (as enclosure tags end themselves - they have no content between tags). Now it takes the name 'enclosure' and an empty string as data from oElementStack and procedes to overwrite the entry in oItem for enclosure with its empty string.
This is not a problem for the Daily Source Code as its link element does contain a link to the mp3 file - but for the Dawn and Drew Show that link goes to that episodes web page.
This can be fixed either by writing the enclosure url to oElementStack in start_element and letting end_element move it across to oItem. But I don't know if the rss spec allows text to be enclosed within 'enclosure' tags, if so it would be appended to the url and stuff it up.
So I fixed it by changing line 82 (near the beginning of end_element) from: if self.bItem:
to: if self.bItem and (strElement != 'enclosure'):
which just prevents the program from overwriting the enclosure tag.
Hope that helps, and thanks again for posting the code,
William