djangoproject.com | python.org | linux.com
demongin.org - Pygmentize 2: a More Resilient Syntax Highlighter for Django

Pygmentize 2: a More Resilient Syntax Highlighter for Django

In which I revise my old pygmentize.py script, making it less likely to die badly.


Sunday, 2010-04-04 | Django, On the Internet, Programming

"The most difficult thing in the world is to know how to do a thing and to watch someone else doing it wrong, without commenting."

T. H. White

The snippet that I originally lifted from DjangoSnippets.org left, I eventually came to realize, a few things to be desired.

Back in July, I wrote this post about implementing some basic syntax highlighting in a quick-and-dirty kind of style. Over time, I came to realize that the original pygmentize.py file had a few serious problems:

  1. It would raise an exception and return nothing (except for an apache 500 error) if a post's content couldn't be handled by BeautifulSoup
  2. It wasn't escaping HTML (or apache confs or XML or anything in angle-bracket-type syntax), which lead to a lot of broken-looking formatting
  3. The use of the markdown module, while necessary for syntax highlighting, would occasionally break up a pre tag, inserting p and other markup tags into the middle of a block
It's taken a while, but I've made some pretty substantial revisions to pygmentize.py. The new improvements include the following:
  1. A complete "back-off" routine, whereby fatal exceptions in the pygmentize routine result in code blocks being replaced with escaped pre blocks: this is crude, duck-punching code, but it keeps 500 errors down (and that's always good)
  2. Logic that causes any text that makes it to the filter looking like this:
    <html><b>This is text</b></html>
    ...to come out of the filter like this:
    <pre class="html">&lt;html&gt;&lt;b&gt;This is text&lt;/b&gt;&lt;/html&gt;</pre>
    This lets HTML-like code retain its whitespace and lets you style it however you want with your own CSS.
  3. Poorly-formed HTML/XML/whatever, so long as it is submitted to the filter within a code tag with a class="html" attribute, will make it through the filter and to the page.
I would like, at this point, to be able to say that the thing works so well and so reliably that I can display it in a codeblock on this page, but I cannot: the actual code contains too much of the code that the code looks for when it processes code, and thus would render poorly on the website here.

If you want to see it, however, you can check it out here.