djangoproject.com | python.org | nginx.org
version seven.
  http://demongin.org
demongin.org - Windows Encoding and Python: Django Survival Basics

Windows Encoding and Python: Django Survival Basics

Dealing with encoding problems that originate in Windows, e.g. linebreaks, ascii/unicode, etc.


Friday, 2011-04-01 | Careerism, Django, Programming

Occasionally, one is compelled to stick his head out of his Unix bombshelter and interact with the rest of the Internet.

Unfortunately, since Microsoft has reduced most of the Internet to a radioactive hell-scape of random standards-compliance, broken protocols and obfuscated spec, surviving the occasional voyage out into the Waste can be a bit of a trick: learning to work around Windows encodings can be a painful process (that occasionally Breaks the Website) and the more tools you have at your disposal, the less likely you are to find yourself scrolling through Stack Overflow.

Out of the box, Django has some great user-friendliness and convenience tools for working with Windows. Occasionally, however, you're working on a level so low that you haven't got ready access to them (e.g. when writing templatetags. What follows are some simple tricks for saving yourself valuable time (and sanity) when trying to tidy up after Redmond.

Converting Line Breaks

If you find yourself in a life situation where you are frequently working with files that were created on or arrived via Windows, you have probably spent a good amount of time in linebreak conversion/escape hell. Here are two tricks for saving sanity:
  1. When reading files into strings, force them to have unix-style linebreaks:
    raw = file("/path/to/windows_file.txt", "rU")
    
  2. When iterating over files, use the with syntax if you can:
    with file(f, "rU") as raw_file:
        raw = u"".join([line.strip() for line in raw_file])
    
    This will get you the cleanest possible string, i.e. one that starts Unix-y, has its line-ending whitespaces stripped intelligently and is constructed by an optimized generator process.

Encoding Problems

If you work with Python for even a short while, you're highly likely to learn more about unicode (particularly how it varies from and clashes with ascii) than any sane person could ever reasonably want to learn.

For those times that you want to force a string through, preferring mangled text to an Exception, here are some tips for getting aggressive with encoding problems:

  1. This is the wild man:
    s = s.encode('ascii','replace')
    
    Use that on a string when you want an ascii string no matter what. This will convert un-recognized characters to question marks: YHBW.
  2. This is a more mellow/customizable version of the above. It replaces encodings from Windows that a.) pop up frequently and b.) tend to bring things to a violent halt:
        punctuation = {
                u'\u2018': "'",
                u'\u2019': "'",
                u'\u2013': "-",
                u'\u2014': "-",
                u'\u201c': '"',
                u'\u201d': '"',
        }
        for src, dest in punctuation.iteritems():
            s = s.replace(src, dest)
    
  3. Finally, here's a freebie (that isn't really Microsoft-related, but I end up using it in most of my Django projects at some point): if you're looking to "sanitize" a string (any string--not just one that Microsoft has defiled) by programmatically removing punctuation from it, here's a little function that relies on one of my personal favorite built-in methods of string:
    def zero_punctuation(s):
        """
        Rip all punctuation from a string, return it. Do a little duck-punching/strong typing too.
        """
        import string
        s = str(s)
        s = s.replace("&", " and ")
        for c in string.punctuation:
            s = s.replace(c,"")
        return s