djangoproject.com | nginx.org | python.org | linux.com
version seven.
  http://demongin.org
demongin.org - Selenium and BeautifulSoup: Dealing with Python Encoding Errors

Selenium and BeautifulSoup: Dealing with Python Encoding Errors

An encoding problem encountered during an HTML scrape reveals a weird sys bug/feature.


Tuesday, 2010-02-02 | Careerism, Programming, Testing

[S]ocializing doesn’t scale.

Clive Thompson

So, I was working on using BeautifulSoup and Selenium to do a little bit of automatic screen-scraping, when I encountered the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 14: ordinal not in range(128)
Bogus.

What I was attempting to do was search for an anchor tag element by its text and get its parent (i.e. its anchor tag stuff, e.g. altid, etc.) and use that in a Selenium click() to navigate to another page. Here's what the element looks like:
Show all 20 sites »
And here's the BeautifulSoup code I was using to find it:
    p = re.compile("Show all \d.*? sites")
    s = soup.find("a", text=p)
    if s != None:
        print s.findParents()
But finding it wasn't the problem. Once I found it, my program was barfing trying to display that findParents() output:
Traceback (most recent call last):
  File "./iqa_scrape.py", line 142, in <module>
    print s.findParents()
UnicodeEncodeError: 'ascii' codec can't encode character \xbb in position 161: ordinal not in range(128)</module>

It was immediately pretty clear that I could just change encodings as necessary for those Soup objects that would require it, but I wanted something more global: something out in main, you know? So I Googled around a little bit and learned that there was a function of sys that would allow me to change python's default encoding for the execution of a single script.

The only problem was, when I tried to execute it, I came up with a traceback: "sys.setdefaultencoding()", I thought to myself, "must not be a part of python 2.6."

It turns out, however, that it is, but there's a bug (or perhaps a feature) that prevents the function from being available the first time you import the sys module. Observe:

toconnell@esme:~$ python
Python 2.6.4 (r264:75706, Nov  2 2009, 14:38:03) 
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.setdefaultencoding('utf-8')
Traceback (most recent call last):
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> reload(sys)
>>> sys.setdefaultencoding('utf-8')
>>> sys.getdefaultencoding()
'utf-8'
Crazy, huh?

For whatever reason, you don't get the setdefaultencoding function the first time you import sys. So, getting back to my script, I ended up using the following code to hand off the dynamically generated id attribute of the anchor tag to my Selenium test:
    p = re.compile("Show all \d.*? sites")
    s = soup.find("a", text=p)
    if s != None:
        reload(sys) 
        sys.setdefaultencoding('utf-8') 
        anchor_tag_id = s.findParents()[0]['id']
        html, sel_instance = sel.see_all_clients(sel_instance, anchor_tag_id)
        soup = BeautifulSoup("".join(html))
Woo!