Quick ’n Dirty RSS with XSLT
These days, a lot of sites syndicate their content. Thanks to RSS, new items show up in your aggregator shortly after they are published.
The Daily Python-URL from Secret Labs is a great resource for staying up-to-date on what is happening in the Python universe. But sadly, it is not available as an RSS feed.
Let’s do something about that.
Cleaning up
First download the page to disk to have something to work with:
curl http://www.pythonware.com/daily/ > daily.html
The Daily Python-URL page does not validate, so the downloaded file cannot be used as the XML source document. Use HTML Tidy to convert it to well-formed and valid XHTML:
tidy -asxml -wrap 160 < daily.html > daily.xhtml
Let’s try to parse the output with an XML parser just to be sure:
xmllint daily.xhtml
If this results in parsing errors, you will need to get a more recent version of Tidy. The version on our Debian server (release date: 1st March 2002) did not output well-formed XML. The latest release works fine.
The XSLT stylesheet
A regular XSLT stylesheet contains templates that are applied to (parts of) the source document. Inside these templates, data from the source document is inserted in the output document using xsl:value-of elements. It is also possible to apply other templates using xsl:apply-template.
Elements from the source document are selected using XPaths, which work like regular filesystem paths with some extensions. For example, a paragraph element with the class attribute set to 'newsitem' can be selected by putting [@class='newsitem'] after the element name in the XPath.
Daily Python-URL uses nested tables for layout, so the actual items are rather deep in the document tree. It is usually a good idea to use a stylesheet like this one to create a XPath view of the tree:
xsltproc xpathview.xslt daily.xhtml > daily_xpath.xhtml
The resulting XPath view makes it a lot easier to find the path for a specific element.
In the following stylesheet two templates are defined. The first template matches the root of the document and defines the outer RSS elements, the second template matches the RSS item elements.
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
version="1.0">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<rss version="2.0">
<channel>
<title>
<xsl:value-of select="xhtml:html/xhtml:head/xhtml:title"/>
</title>
<link>
http://www.pythonware.com/daily/
</link>
<description>
<xsl:value-of select="//xhtml:div[@class='pad']/xhtml:p"/>
</description>
<xsl:apply-templates select="//xhtml:p[@class='newsitem']"/>
</channel>
</rss>
</xsl:template>
<xsl:template match="//xhtml:p[@class='newsitem']">
<item>
<title>
<xsl:value-of select="xhtml:a[@href]"/>
</title>
<link>
<xsl:value-of select="xhtml:a[@href]/@href"/>
</link>
<description>
<xsl:value-of select="xhtml:i"/>
</description>
</item>
</xsl:template>
</xsl:stylesheet>
The first templates matches the root of the source document tree.
Gets the text of the title element inside the html header.
The paragraph with the page description can be found inside the div element with the class attribute set to 'pad'.
All news items are inside paragraphs with the class attribute set to 'newsitem'. This XSLT elements applies the second template
The second template is applied to each of the matching news items.
Selects the text inside an anchor element only if the anchor element includes a href attribute. This is needed because there is also an empty anchor elements inside each news item paragraph (they look like <a name="105768601144453612">).
The last part of this path selects the href attribute value instead of the text inside the element.
Selects the text inside the i element.
The paths starting with / are absolute. The paths in the second template are relative to the enclosing XSLT elements.
Use an XSLT processor like xsltproc to apply the stylesheet to the source document:
xsltproc dailyrss.xslt daily.xhtml > daily.rss
And yes, it validates.
Pipe it around
All the above steps can be given as a single command:
xsltproc dailyrss.xslt <(curl http://www.pythonware.com/daily/ | tidy -asxml -wrap 160) > daily.rss
You should not just put this command in an hourly cronjob. That would not be proper behavior for an aggregator or any HTTP client.
If you like, you are free to subscribe to this RSS feed. It is currently only updated once a day and will be removed when the Daily Python-URL gets a feed of it’s own.