python – Alain M. Lafon

How to become a proficient Python programmer

Alain M. Lafon — Sun, 12 Jun 2011 13:15:41 +0000

Spoiler: This post is primarily gonna be an excerpt of my bookmarks collection. That’s because more intelligent men than me have already written great articles on the topic of how to become a great Python programmer.

I will focus on four primary topics: Functional programming, performance, testing and code guidelines. When those four aspects merge in one programmer, he or she will gain greatness no matter what.

Functional programming

Writing code in an imperative style has become the de facto standard. Imperative programs consist of statements that describe change of state. While this might sometimes be a performant way of coding, it sometimes isn’t (for example for sake of complexity) – also, it probably is not the most intuitive way when compared with declarative programming.

If you don’t know what I’m talking about, that’s great. Here are some starter articles to get your mind running. But beware, it’s a little like the red pill
– once you tasted functional programming, you don’t want to go back.

Performance

There’s so much talk going on about how inefficient these ‘scripting languages’ (Python, Ruby, …) are, that it’s easy to forget that very often it’s the algorithm chosen by the programmer that leads to horrible runtime behaviour.

Those articles are a great place to get a feel for the ins and outs of Python’s runtime behaviour, so you can get your high performing application writting in a language that is concise and fun to write. And if your manager asks about Python’s performance, don’t forget to mention that the second largest search engine in the world is run by Python – namely Youtube(see Python quotes).

Testing

Testing is probably one the most misjudged topics in computer science these days. Some programmers really got it and emphasize TDD(test driven development) and it’s successor BDD(behaviour driven development) whereever possible. Others simply don’t feel it yet and think it’s a waste of time. Well, I’m gonna be that guy and tell you: If you haven’t started out on TDD/BDD yet, you have missed out greatly!

It’s not about introducing a technology to replace that release management automaton in your company that mindlessly clicks through the application once in a while, it is about giving you a tool to deeply understand your own problem domain – to really conquer, manipulate and twist it the way you want and need it to be. If you haven’t yet, give it a shot. These articles will give you some impulses:

Code guidelines

Not all code is created equal. Some can be read and changed by any great programmer out there. But some can only be read and only sometimes changed by the original author – and that maybe only a couple of hours after he or she wrote it. Why is that? Because of missing test coverage (see above) and the lack of proper usage of coding guidelines.

These articles establish an absolute minimum to adhere to. When you follow these, you will write more consise and beautiful code. As a side effect it will be more readable and adaptable by you or anyone else.

Now go ahead and spread the word. Start with the guy sitting right next to you. Maybe you can go to the next hackathlon or code dojo and start becoming great proficient programmers together!

All the best on your journey.

If you liked this article, please feel free to re-tweet it and let others know.

You should follow me on twitter here

BeautifulSoup vs. lxml benchmark

Alain M. Lafon — Mon, 16 Aug 2010 11:17:00 +0000

Previously, I’ve been using BeautifulSoup whenever I had to parse HTML (for example in my dictionary pDict). But this time I’m working on a larger scale project which involves quite a lot of HTML parsing – and BeautifulSoup disappointed me performance wise. In fact, the project wouldn’t be possible using it. Well, it would be – if I subscribed to half of Amazon EC2(;

Since the project is in stealth mode right now, I can’t say which pages I am referring to, but let me give you these facts:

~170kb HTML code
W3C validation shows about 1300 errors and 2600 warnings per page

Considering this many errors and warnings, I previously thought the job had to be done using BeautifulSoup, because it is known to have a very error resistant parser. In fact, BeautifulSoup doesn’t parse the HTML directly, but splits the tags in tag-soup by applying regular expressions around them. Opposing popular stories this seems to make BeautifulSoup very resilient towards bad code.

However, BeautifulSoup doesn’t perform well on the described files. The task: I need to parse 20 links of a particular class off the page. I put the relevant code in a seperate method and profiled it using cProfile:

cProfile.runctx("self.parse_with_beautifulsoup(html_data)", globals(), locals())

def parse_with_beautifulsoup(html_data):
  soup = BeautifulSoup.BeautifulSoup(html_data)
  links_res = soup.findAll("a", attrs={"class":"detailsViewLink"})
  links = [car_link["href"] for car_link in car_links_res]

Parsing 20 pages, this takes 167s on my small Debian VPS. Thats 8s+ per page. Incredibly long. Thinking of how BeautifulSoup parses, it’s understandable however. The overhead of creating tag-soup and parsing via RegExp leads to a whopping 302’000 method calls for just these four lines of code. I repeat: 302’000 method calls for four lines of code.

Hence, I tried lxml. The corresponding code is:

root = lxml.html.fromstring(html_data)
links_lxml_res = root.cssselect("a.detailsViewLink")
links_lxml = [link.get("href") for link in links_lxml_res]
links_lxml = list(set(links_lxml))

On the 20 pages, this takes only 2.4s. That’s only 0.12s per page. lxml needed only 180 method calls for the job. It runs 70x faster than BeautifulSoup and creates 1600x fewer calls.

When you do a graph of these numbers, the performance difference looks ridiculous. Well, let’s have some fun(;

lxml vs BeautifulSoup performance

Considering lxml supports xpath as well, I’m permanently switching my default HTML parsing library.

Note: Ian Bicking wrote a wonderful summary in 2008 on the performance of several Python HTML parsers which led me to lxml and to this article.

Update (08/17/2010): I planned on implementing my results on Google AppEngine. “Unfortunately” lxml relies heavily on C-code (that’s where the speed comes from^^). AppEngine is a pure Python environment. It will never run modules written in C.

Serving images dynamically with CherryPy (on Google AppEngine)

Alain M. Lafon — Fri, 13 Aug 2010 09:22:34 +0000

Google AppEngine(GAE) is great for hosting Python (or Java) Web-Applications. They offer 1.3mio hits/d and 1GB up- and downstream/d for free. Considering that you will get access to Google infrastructure that let’s you crawl the web as fast as Google does itself, choosing GAE is a no-brainer for applications doing a lot of web-crawling, screen scraping or web-indexing. You can even do cron-jobs to get your job done periodically.

I won’t elaborate on how to get an account, download the SDK and get started, because Google hosts great tutorials for these itself. If you are already familiar with Python web development this will get you started in a matter of minutes.

I personally chose not to use the Google webapp framework, because I’m quite familiar with CherryPy. I fell in love with it, because it feels very sleek – very Zen-like. This comes to no surprise, because it was a deliberate design decision as can be read in The Zen of CherryPy.

Getting started with CherryPy on GAE is no trouble, either. GAE supports any Python framework that is WSGI-compliant. Those include Django, CherryPy, web.py and Pylons. Google doesn’t host these frameworks themselves, so all you have to do is copy the whole framework into your GAE project to get the import to work. That’s it. Same counts for any 3rd party module. Need BeautifulSoup? Just copy the py-file to your project. Easy as cake.

Now, if you want to serve images dynamically, you don’t have to store them on harddisk to link to them. Just save them in the Google Datastore and serve whenever needed.

Using the following snippet you will be able to dynamically serve images with URLs like this:
http://application/handler_name/index/[0-9]*

import cherrypy
from cherrypy import expose
import wsgiref.handlers
import DynamicImage

class Root:
  @expose
  def index(self):
    return ""

class GetImage():
  """ GetImage provides a handler for dynamic images """

  def __init__(self):
    """
      Mockup for getting some images. Datastore or live
      scraping could be done here
    """
    # Note: DynamicImage is just a mockup.
    # There is no such module.
    dynamic_image = DynamicImage.DynamicImage()
    self.pictures = dynamic_image.getImages()

  @expose
  def index(self, num=None):
    """
      Provides the handler for urls:
        application/handler_name/index/[0-9]*
    """
    return self._convert_to_image(self.pictures[0][int(num)])

  def _convert_to_image(self, picture):
    cherrypy.response.headers['Content-Type'] = "image/jpg"
    return picture

# Root() doesn't do anything here. It normally serves your index page.
root = Root()

# Generate route http://app/img/
root.img = GetImage()

# Start CherryPy app in wsgi mode
app = cherrypy.tree.mount(root, "/")
wsgiref.handlers.CGIHandler().run(app)

One last note: Processes running longer than 15-30s will be cut off from GAE with the DeadlineExceededError exception. You can catch this exception and try to divide your workload into smaller pieces.

C# 4.0’s dynamicity

Alain M. Lafon — Mon, 10 May 2010 21:23:21 +0000

I just found an article ranking highly on Hacker News (my favourite read) concerning the release of C# 4.0 – you can find it on blogs.msdn.com. On this blog Microsoft claims a couple of highly sophistacated new features. Being the spoiled guy I am, they just seem natural to me. Since they started their article with the words “The dynamic keyword is a key feature of this release.”, let me demonstrate the new features in a really dynamically typed language where they have been around for quite some time: Python

Dynamic

C#		dynamic contact = new ExpandoObject(); contact.Name = "Patrick Hines"; contact.Phone = "206-555-0144";
Python		class contact: None contact.Name = "Patrick Hines" contact.Phone = "206-555-0144"

Optional (or Default) Parameters

C#		public static void SomeMethod(int optional = 0) { } SomeMethod(); // 0 is used in the method. SomeMethod(10);
Python		def some_method(optional = 0): pass some_method() some_method(10)

Named Arguments

C#

                        var sample = new List();
                        sample.InsertRange(collection: new List(), index: 0);
                        sample.InsertRange(index: 0, collection: new List());

Python

                        def foo(bar, foobar): None

                        foo(bar='asdf', foobar=12)
                        foo(foobar=12, bar='asdf')

I honestly know that a comparison of a statically typed language on the CLR and an interpreted dynamic language doesn’t account for too much. But since Microsoft is making a fuzz about dynamic being the keyword of the new release, I felt the urge to drop this note.

The Python code was tested with v2.5 – that’s the oldest installation I’ve got. However, it’s old enough, because .NET didn’t even have decent IPC back then (i.e. Named pipes were added in .NET 3.5).

Well, that’s my rant for the night – I’m going back to teaching myself a real programmers language(Clojure) – as you should, too(;

If you liked the article, follow me on twitter here

Python’s binascii – hexlify() and unhexlify()What the heck?

Alain M. Lafon — Tue, 08 Dec 2009 23:55:11 +0000

Today, a dear friend of mine came up to me and asked about the Python module binascii – particularly about the methods hexlify() and unhexlify(). Since he asked for it, I’m going to share my answer publicly with you.

First of all, I’m defining the used nomenclature:

ASCII characters are being written in single quotes
decimal numbers are of the type Long with a L suffix
hex values have a x prefix

First, let me quote the documentation:

binascii.b2a_hex(data)

binascii.hexlify(data)

Return the hexadecimal representation of the binary data. Every byte of data is converted into the corresponding 2-digit hex representation. The resulting string is therefore twice as long as the length of data.

binascii.a2b_hex(hexstr)

binascii.unhexlify(hexstr)

Return the binary data represented by the hexadecimal string hexstr. This function is the inverse of b2a_hex(). hexstr must contain an even number of hexadecimal digits (which can be upper or lower case), otherwise a TypeError is raised.

I’ll begin with hexlify(). As the documentation states, this method splits a string which consists of hex-tuples into distinct bytes.

The ASCII character ‘A’ has 65L as numerical representation. To verify this in Python:

&gt;&gt;&gt; long(ord('A'))
65L

You might ask “Why is this even relevant to understand binascii?” Well, we don’t know anything about how ord() does its job. But with binascii we can re-calculate manually and verify.

&gt;&gt;&gt; binascii.hexlify('A')
'41'

Now we know that an ‘A’ – interpreted as binary data and shown in hex – resembles ’41’. But wait, ’41’ is a string and no hex value! That’s no biggy, hexlify() represents its result as string.

To stay with the example, let’s convert 41 into a decimal number and check if it equals 65L.

&gt;&gt;&gt; long('41', 16)
65L

Tada! It seems that ‘A’ = 41 = 65L.
You might have known that already, but please, stay with me a minute longer.

To make it look a little more complex:

&gt;&gt;&gt; binascii.hexlify('A') == &quot;%X&quot; % long('41', 16)
True

Be aware that

&gt;&gt;&gt; &quot;%X&quot; %n

converts a decimal number into its hex representation.

——

binascii.unhexlify() naturally does the same thing as hexlify(), but in reverse. It takes binary data and displays it in tuples of hex-values.

I’ll start off with an example:

	&gt;&gt;&gt; binascii.unhexlify('41')
	'A'

	&gt;&gt;&gt; binascii.unhexlify(&quot;%X&quot; % ord('A'))
	'A'

Here, unhexlify() takes the numerical representation 65L from the ASCII character ‘A’

	&gt;&gt;&gt; ord('A')
	65

converts it into hex 41

	&gt;&gt;&gt; &quot;%X&quot; % ord('A')
	'41'

and represents it as a 1-tuple (meaning dimension of one) of hex values.

And now the conclusio – why might all of this be useful?
Right now, I can think of at least four use cases:

cryptography
data-transformation (i.e. Base64 for MIME/E-Mail attachements)
security (deciphering binary readings off a network, pattern matching, …)
textual representation of escape sequences

Taking up the last example, I’ll show you how to visualize the Bell esape sequence (you know, that thing that keeps beeping in your terminal).
Taken from the ASCII table, the numerical representation of the Bell is 7. Programmers might know it better as a.

	&gt;&gt;&gt; '7' == 'a'
	True

Presuming you read such a character in some kind of binary data – for example from a socket

	&gt;&gt;&gt; foo = '7'

and you want to visualize this data

	&gt;&gt;&gt; print foo

you will not get any results – at least none visible. You might hear the Bell sound if you’re not on a silent terminal.

Now, finally – binascii to the rescue:

	&gt;&gt;&gt; binascii.hexlify('7')
	'07'

Voilà, the dubious string is decrypted.

Disable Mail-Forwarding for Lotus Notes programmatically

Alain M. Lafon — Mon, 29 Jun 2009 19:53:40 +0000

Lotus Notes has a nifty feature to lull managers into ~~false~~ safety: for volatile/unsafe e-mails (or users), it let’s you disable printing/forwarding and copying to clipboard. This can be done using rules, on the SMTP server and on a per e-mail basis. When writing somebody you really don’t trust with some information (but in his inability to spread the word otherwise – by copy/pasting for example), writing a mail would look like this:

Now, if your victim wants to forward your mail, Lotus Notes would respond with a little pop-up:

This certainly looks like a magical and proprietary feature, doesn’t it? Let’s look at the source of such a “mail”(aka memo in Notus’ language) – you will have to forward it to another mail-client though, because memos can’t be displayed in source:

...
Subject: Testnachricht
MIME-Version: 1.0
Sensitivity: Private
X-Mailer: Lotus Notes Release 6.5.5  CCH1 March 07, 2006
...

As you can see, there is a proprietary meta-flag Sensitivity: Private. It can be reproduced with any decent mail user agent or programmatically. What follows is a little Python code snippet that just does the trick:

import smtplib
from email.message import Message
msg = Message()
msg.set_payload("Testmessage Body")
msg["Subject"] = "Testmessage from Python"
msg["From"] = "preek@dispatched.ch"
msg["To"] = "somebody@somewhere.com"
msg["Sensitivity"] = "Private"
smtp = smtplib.SMTP("localhost")
smtp.sendmail("preek@dispatched.ch", "somebody@somewhere.com", msg.as_string())

But please, don’t use this information unless you absolutely have to. Lotus Notes.. *brr*.

Enjoy(;

If you liked this article, please feel free to re-tweet it and let others know.

VIM as Python IDE

Alain M. Lafon — Sat, 23 May 2009 23:04:59 +0000

Finding the perfect IDE for Python isn’t an easy feat. There are a great many to chose from, but even though some of them offer really nifty features, I can’t help myself but feel attracted to VIM anyway. I feel that no IDE accomplishes the task of giving the comfort of complete power over the code – something is always missing out. This is why I always come back to using IDLE and VIM. Those two seem to be best companions when doing some quick and agile hacking – but when it comes to managing bigger and longer term projects, this combo needs some tweaking. But when it’s done, VIM will be a powerful IDE for Python – including code completion(with pydoc display), graphical debugging, task-management and a project view.

This is where we are going:

So, these are my thoughts on a VIM setup for coding (Python).

Modern GUI VIM implementations like GVIM or MacVIM give the user the opportunity to organize their open files in tabs. This might look convenient, but to me it is rather bad practice, because a second tab will not be in the in the same buffer scope as the first one which takes away from future interaction options between the two. Using MiniBufExplorer, however, gives the user tabs(not only in the GUI, but also in command line) and leaves the classic buffer interaction intact.

Being able to neatly work on multiple files, the user still misses the potential his favourite IDE gives him in visualizing classes, functions and variables. Luckily there are quite a few plugins around to accomplish this task just as well. My favourite one would be TagList. TagList uses Exuberant Ctags for actually generating the tags(note: it really relies on this specific version of ctags – preinstalled implementations on UNIX systems won’t work).

A lot of coders have the habit of using TODO or FIXME statements in their code. Other IDEs often rely on having good third party project management software, but not VIM. There are great plugins like Tasklist reminding the programmer of those lines of code. Tasklist even implements custom lists – to me that’s an incredible productivity gain.

In these times, the programmer knows his or her programming language more or less by interactively finding out what it can do. Therefore code completion(sometimes also called IntelliSense*ugh*) is a major feature. I have heard many people saying that this is where VIM fails – but luckily they are plain wrong(; In V7, VIM introduced omni completion – given it is configured to recognize Python (if not, this feature is only a plugin away) Ctrl+x Ctrl+o opens a drop down dialog like any other IDE – even the whole Pydoc gets to be displayed in a split window.

Probably the most wanted feature(besides code completion) is debugging graphically. VimPDB is a plugin that lets you do just that(. I acknowledge it is no complete substitution for a full fledged graphical debugger, but I honour the thought that having to rely on a debugger (often), is a hint of bad design.

—

From the eye-candy to the implementation. Don’t worry, it’s no sorcery.

First of all, make sure you have VIM version 7.x installed, compiled with Python support. To check for the second, enter :python print “hello, world” into VIM. If you see an error message like “E319: Sorry, the command is not available in this version”, then it’s time to get a new one. If you’re on a Mac, just install MacVIM(there’s also a binary for the console in /Applications/MacVim.app/Contents/MacOS/). If you’re on Windows, GVIM will suffice(for versions != 2.4 search for the right plugin). If you’re on any other machine, you will probably know how to compile your very own VIM with Python support.

Second, check if you have a plugin directory. In Unix it would typically be located in $HOME/.vim/plugin, in Windows in the Program Files directory. If it doesn’t exist, create it.

Now, let’s start with the MiniBufExplorer. Get it and copy it into your plugin directory. To start it automatically when needed and be able to use it with keyboard and mouse commands, append these lines in your vimrc configuration:

let g:miniBufExplMapWindowNavVim = 1 let g:miniBufExplMapWindowNavArrows = 1 let g:miniBufExplMapCTabSwitchBufs = 1 let g:miniBufExplModSelTarget = 1

For a project view, get TagList and Exuberant Ctags. To install Ctags, unpack it, go into the directory and do a compile/install via:

./configure && sudo make install

Ctags will then be installed in /usr/local/bin. When using a Windows machine, I recommend Cygwin with GCC and Make; it’ll work just fine. If you don’t want to tamper with your original ctags installation, you can propagate the location to VIM by appending the following line to vimrc:

let $Tlist_Ctags_Cmd='/usr/local/bin/ctags'

To install TagList, just drop it into VIMs plugin directory. You will now be able to use the project view by typing the command :TlistToggle.

Tasklist is a simple plugin, too. Copying it into the plugin directory will suffice. I like to have shortcuts and have added
map T :TaskList map P :TlistToggle

to vimrc. Pressing T will then open the TaskList if there are any tasks to process. q quits the TaskList again.

VimPDB is a plugin, as well. Install as before and see the readme for documentation. If it doesn’t work out of the box, watch for the known issues.

To enable code(omni) completion, add this line to your vimrc:

autocmd FileType python set omnifunc=pythoncomplete#Complete

If it doesn’t work then, you’ll need this plugin.

My last two recommondations are setting these lines to comply to PEP 8(Pythons’ style guide) and to have decent eye candy:

set expandtab set textwidth=79 set tabstop=8 set softtabstop=4 set shiftwidth=4 set autoindent :syntax on

There are certainly a lot more flags to help productivity, but those will probably be more user specific.

Have fun coding Python while not being bound to a specific IDE, but having all the benefits of VIM bundled with a few helping hands. Enjoy, everyone.

If you liked this article, please feel free to re-tweet it and let others know.

You should follow me on twitter here

Juno on Solaris 10

Alain M. Lafon — Mon, 18 May 2009 13:23:30 +0000

Juno is an incredibly lightweight webframework. Using Python as backend, it fullfills my very need for just about every small application I want to deploy against the web. It has no need for big runtimes on the server, no files to configure a great many files and most importantly: there’s no coding overhead – the programmer defines only the distinctively wanted features.
However, installing Juno on Solaris 10 isn’t quite as easy as described in Junos’ documentation. Solaris ships with Python 2.4, but Juno depends in Jinja2(a templating engine) which itself depends on Python 2.5+. Even installing Blastwave’s or Sunfreeware’s version won’t help. But that’s no biggie since compiling your own Python is incredibly easy.

Get, compile and install Python (I have used version 2.5.4)
- http://www.python.org/download/releases/
- unpack
- make sure you have a recent version of GCC installed
- ./configure && make && make install
- as a result Python will be installed in /usr/local

Get, compile and install Setuptools
- http://pypi.python.org/pypi/setuptools
- unpack
- python setup.py install
Get, compile and install pysqlite
- http://oss.itsystementwicklung.de/trac/pysqlite/wiki/WikiStart#Downloads
- unpack
- add line “library_dirs=/usr/local/lib” to pysqlite-x.y.z/setup.cfg
- globally export your library paths:
- LD_LIBRARY_PATH=/opt/csw/lib/:/usr/lib/:/lib/:/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH
- python setup.py install
easy_install install sqlalchemy

easy_install jinja2

Get, compile and install Juno
- http://brianreily.com/project/juno
- python setup.py install

Enjoy.

Webscraping with Python and BeautifulSoup

Alain M. Lafon — Sun, 15 Mar 2009 10:05:08 +0000

Recently my life has been a hype; partly due to my upcoming Python addiction. There’s simply no way around it; so I should better confess it in public. I’m in love with Python. It’s not only mature, businessproof and performant, but also benefits from sleekness, great performance and is just so much fun to write. It’s as if I were in Star Trek and only had to tell the computer what I wanted; never minding how the job actually it is done. Even my favourite comic artist(besides Scott Adams, of course..) took up on it; so my feelings have to be honest.

In this short tutorial, I’m going to show you how to scrape a website with the 3rd party html-parsing module BeautifulSoup in a practical example. We will search the wonderful translation engine dict.cc, which holds the key to over 700k translations from English to German and vice versa. Note that BeautifulSoup is liscensed just like Python while dict.cc allows for external searching.

First of, place BeautifulSoup.py in your modules directory. Alternatively, if you just want to do a quick test, put in the same directory where you will be writing your program. Then start your favourite text editor/Python IDE(for quick prototyping like we are about to do, I highly recommend a combination of IDLE and VIM) and begin coding. In this tutorial we won’t be doing any design; we won’t even encapsulate in a class. How to do that, later on, is up to your needs.

What we will do:

go to dict.cc
enter a search word into the webform
submit the form
read the result
parse the html code
save all translations
print them

You can either read the needed coded on the fly or download it.
Now let’s begin the magic. Those are our needed imports.

import urllib
import urllib2
import string
import sys
from BeautifulSoup import BeautifulSoup

urllib and urllib2 are both modules offering the possibility to read data from various URLs; they will be needed to open the connection and retrieve the website. BeautifulSoup is, as mentioned, a html parser.

Since we are going to fetch our data from a website, we have to behave like a browser. That’s why will be needing to fake a user agent. For our program, I chose to push the webstatistics a little in favour of Firefox and Solaris.

user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = { 'User-Agent' : user_agent }

Now let’s take a look at the code of dict.cc. We need to know how the webform is constructed if we want to query it.

...

  
    
      
        
        style="padding:2px;width:340px" value="">
      ...
    
  

...

The relevant parts are action, method and the name inside the input tag. The action is the webapplication that will get called when the form is submitted. The method shows us how we need to encode the data for the form while the name is our query variable.

values = {'s' : sys.argv[1] }
data = urllib.urlencode(values)
request = urllib2.Request("http://www.dict.cc/", data, headers)
response = urllib2.urlopen(request)

Here the data get’s encapsulated in a GET request and packed into the webform. Notice that values is a dictionary which makes handling more complex forms a charm. The the form gets submitted by urlopen() – i.e. we virtually pressed the “Search”-button.
See how easy it is? These are only a couple lines of code, but we already have searched on dict.cc for a completely arbitrary word from the commandline. The response has also been retrieved. All that is left, is to extract the relevant information.

the_page = response.read()
pool = BeautifulSoup(the_page)

The response is read and saved into regular html code. This code could now be analyzed via regular string.find() or re.findall() methods, but this implies hard-coding in reference to a lot of the underlying logic of the page. Besides, it would require a lot reverse engineering of the positional parameters, setting up several potentially recursive methods. This would ultimately produce ugly(i.e. not very pythonic) code. Lucky for us, there already is a full fledged html parser which allows us to ask just about any generic question. Let’s take a look at the resulting html code, first. If you are not yet familar with the tool that can be seen in the screenshot; I’m using Firefox with the Firebug addon. This one is very helpful if you ever need to debug a website.

Let me show an excerpt of the code.

..
  
    
      web
    
  

..

The results are displayed in a table. The two interesting columns share the class td7nl. The most efficient way would seem to just sweep all the data from inside the cells of these two columns. Fortunately for us, BeautifulSoup implemented just that feature.

results = pool.findAll('td', attrs={'class' : 'td7nl'})
source = ''
translations = []

for result in results:
    word = ''
    for tmp in result.findAll(text=True):
        word = word + " " + unicode(tmp).encode("utf-8")
    if source == '':
        source = word
    else:
        translations.append((source, word))

for translation in translations:
    print "%s => %s" % (translation[0], translation[1])

results will be a BeautifulSoup.ResultSet. Each member of the tuple is the html code of one column of the class td7nl. Notice that you can access each element like you would expect in a tuple. result.findAll(text=True) will return each embedded textual element of the table. All we have to do is merge the different tags together.
source and word are temporary variables that will hold one translation in each iteration. Each translation will be saved as a pair(list) inside the translations tuple.
Finally we iterate over the found translations and write them to the screen.

$ python webscraping_demo.py
 kinky   {adj} =>  9 kraus   [Haar]  
 kinky   {adj} =>  nappy   {adj}   [Am.]
 kinky   {adj} =>  6 kraus   [Haar]  
 kinky   {adj} =>  crinkly   {adj}
 kinky   {adj} =>  kraus  
 kinky   {adj} =>  curly   {adj}
 kinky   {adj} =>  kraus  
 kinky   {adj} =>  frizzily   {adv}

In a regular application those results would need a little lexing, of course. The most important thing, however, is that we just wrote a translation wrapper onto a webapplication – in only 28 lines of code. Did I mention that I’m in love with Python?

All that is left is for me to recommend the BeautifulSoup documentation. What we did here really didn’t cover what this module is capable of.

I wish you all the best.