![]() |
version seven.   http://demongin.org |
Selenium and BeautifulSoup: Dealing with Python Encoding Errors
An encoding problem encountered during an HTML scrape reveals a weird sys bug/feature.
Tuesday, 2010-02-02 | Careerism, Programming, Testing
| [S]ocializing doesn’t scale. |
| Clive Thompson |
So, I was working on using BeautifulSoup and Selenium to do a little bit of automatic screen-scraping, when I encountered the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 14: ordinal not in range(128)
What I was attempting to do was search for an anchor tag element by its text and get its parent (i.e. its anchor tag stuff, e.g. altid, etc.) and use that in a Selenium click() to navigate to another page. Here's what the element looks like:
Show all 20 sites »And here's the BeautifulSoup code I was using to find it:
p = re.compile("Show all \d.*? sites") s = soup.find("a", text=p) if s != None: print s.findParents()
Traceback (most recent call last): File "./iqa_scrape.py", line 142, in <module> print s.findParents() UnicodeEncodeError: 'ascii' codec can't encode character \xbb in position 161: ordinal not in range(128)</module>
It was immediately pretty clear that I could just change encodings as necessary for those Soup objects that would require it, but I wanted something more global: something out in main, you know? So I Googled around a little bit and learned that there was a function of sys that would allow me to change python's default encoding for the execution of a single script.
The only problem was, when I tried to execute it, I came up with a traceback: "sys.setdefaultencoding()", I thought to myself, "must not be a part of python 2.6."
It turns out, however, that it is, but there's a bug (or perhaps a feature) that prevents the function from being available the first time you import the sys module. Observe:
toconnell@esme:~$ python Python 2.6.4 (r264:75706, Nov 2 2009, 14:38:03) [GCC 4.4.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getdefaultencoding() 'ascii' >>> sys.setdefaultencoding('utf-8') Traceback (most recent call last): AttributeError: 'module' object has no attribute 'setdefaultencoding' >>> reload(sys) >>> sys.setdefaultencoding('utf-8') >>> sys.getdefaultencoding() 'utf-8'
For whatever reason, you don't get the setdefaultencoding function the first time you import sys. So, getting back to my script, I ended up using the following code to hand off the dynamically generated id attribute of the anchor tag to my Selenium test:
p = re.compile("Show all \d.*? sites") s = soup.find("a", text=p) if s != None: reload(sys) sys.setdefaultencoding('utf-8') anchor_tag_id = s.findParents()[0]['id'] html, sel_instance = sel.see_all_clients(sel_instance, anchor_tag_id) soup = BeautifulSoup("".join(html))
