2010-02-03 A Simple RSS Generator For My Sphinx Based BlogΒΆ

This entry jumps into the middle of a conversation I haven’t had yet, about my software philosophy, why I use Python, and why I chose the tools I did to build this web site. What this article is about the blog portion of my site, and a simple (and rather stupid) RSS feed generator I wrote for it.

The blog itself is part of my Sphinx generated web site. Sphinx is a great package, which I use because it is (a) written in Python and (b) designed around using one of the easiest to type markup systems available, reStructured Text. reST makes it easy to write the blog entries using my favorite editor, vi.

Turning a series of reST files into a blog consists of two steps. The first one Sphinx handles out of the box: the toctree directive has a :glob: option that will build the index from a sorted list of filenames. So by arranging my blog posts in an appropriate directory structure and making sure the filenames will lexically sort into chronological order, I get the basic blog structure and its index pages with almost no extra work. (The “almost” is because I do have to create new stub index.rst files in each new year and month subdirectory).

However, in order to be an operational blog in blogosphere, the blog needs an RSS feed, and that Sphinx does not do out of the box. So I wrote a little script:

#!/usr/bin/python

import sys
import os
import time
import cgi
import email.utils

MAXPOSTS = 20
BLOGURL = "http://www.bitdance.com/blog"

def fmtdate(secs):
    return email.utils.formatdate(secs, usegmt=True)

rssheader = """\
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
  <title>Dancing With the Bits</title>
  <link>{0}</link>
  <description>The Occasional Musings of R. David Murray</description>
  <language>en-us</language>
""".format(BLOGURL)

srcdir, destpath = sys.argv[1:]
outfile = open(destpath, 'w')
outfile.write(rssheader)
outfile.write('  <pubDate>{0}</pubDate>\n'.format(fmtdate(time.time())))

posts = 0
for year in sorted(os.listdir(srcdir), reverse=True):
    yeardir = os.path.join(srcdir, year)
    if not os.path.isdir(yeardir): continue
    for month in sorted(os.listdir(yeardir), reverse=True):
        monthdir = os.path.join(yeardir, month)
        if not os.path.isdir(monthdir): continue
        for post in sorted(os.listdir(monthdir), reverse=True):
            linkname, ext = os.path.splitext(post)
            if ext!='.rst' or linkname=='index': continue
            postfn = os.path.join(monthdir, post)
            posts += 1
            outfile.write('  <item>\n')
            url = '{0}/{1}/{2}/{3}'.format(BLOGURL, year, month, linkname)
            outfile.write('    <link>{0}</link>\n'.format(url))
            outfile.write('    <guid>{0}</guid>\n'.format(url))
            with open(postfn) as postfile:
                lines = iter(postfile)
                line = lastline = ''
                while not line.startswith('========='):
                    lastline = line
                    line = next(lines)
                outfile.write('    <title>{0}</title>\n'.format(
                                 cgi.escape(lastline.strip())))
                outfile.write('    <description>\n')
                line = next(lines)
                while not line.strip():
                    line = next(lines)
                while line.strip():
                    outfile.write('      '+cgi.escape(line))
                    line = next(lines)
            outfile.write('\n      (more)\n    </description>\n')
            modtime = os.stat(postfn).st_mtime
            outfile.write('    <pubDate>{0}</pubDate>\n'.format(
                            fmtdate(modtime)))
            outfile.write('  </item>\n')
            if posts>MAXPOSTS: break
outfile.write('</channel>')
outfile.write('</rss>')

As you can see, the script simply emits the appropriate XML, walking the date-structured tree of directories in reverse chronological order and making <item> entries for each post. The only tricky bit is parsing the blog entry to get the title and description. I’ve adopted the convention of using ======= lines for the blog entry titles, so I pick up the line above that as the title. Then I skip any blank lines between the title marker and the first paragraph, and then read non-blank lines until the next paragraph break to use as the description for the entry.

One of the things I find gratifying about this script is that the batteries that make it simple are included in the stdlib: cgi.escape() to do the html escaping in the description entry, and email.utils.formatdate() to correctly format the RFC 822 dates.

To use it, I added the following lines to my Sphinx project Makefile:

rss:
    rssgen source/blog ${PRODDIR}/blog/rss.xml

and added rss to the publish line as a dependency. So now when I’m satisfied that my web site edits look good and I type ‘make publish’ to publish the site, it also publishes the new blog entries into the rss feed. (I’ll explain my Sphinx setup for publishing my web site in a later blog post.)

This setup works well for me, at least for now. A more sophisticated approach would be to hook in to Sphinx itself, through Sphinx’s extension mechanism, to generate the RSS. The advantage of this would be not needing to parse the blog entry file myself (and therefore getting a more accurate and bullet proof parse), being able to automatically add items marked ‘updated’ for any blog posts that I edit after initially publishing them, and being able to pull the blog URL out of the Sphinx configuration file. But that’s a project for some other day.