Archive for March 2009


Webscraping with Python and BeautifulSoup

March 15th, 2009 — 11:05 am

Recently my life has been a hype; partly due to my upcoming Python addiction. There’s simply no way around it; so I should better confess it in public. I’m in love with Python. It’s not only mature, businessproof and performant, but also benefits from sleekness, great performance and is just so much fun to write. It’s as if I were in Star Trek and only had to tell the computer what I wanted; never minding how the job actually it is done. Even my favourite comic artist(besides Scott Adams, of course..) took up on it; so my feelings have to be honest.

In this short tutorial, I’m going to show you how to scrape a website with the 3rd party html-parsing module BeautifulSoup in a practical example. We will search the wonderful translation engine dict.cc, which holds the key to over 700k translations from English to German and vice versa. Note that BeautifulSoup is liscensed just like Python while dict.cc allows for external searching.

First of, place BeautifulSoup.py in your modules directory. Alternatively, if you just want to do a quick test, put in the same directory where you will be writing your program. Then start your favourite text editor/Python IDE(for quick prototyping like we are about to do, I highly recommend a combination of IDLE and VIM) and begin coding. In this tutorial we won’t be doing any design; we won’t even encapsulate in a class. How to do that, later on, is up to your needs.

What we will do:

  1. go to dict.cc
  2. enter a search word into the webform
  3. submit the form
  4. read the result
  5. parse the html code
  6. save all translations
  7. print them

You can either read the needed coded on the fly or download it.
Now let’s begin the magic. Those are our needed imports.

import urllib
import urllib2
import string
import sys
from BeautifulSoup import BeautifulSoup

urllib and urllib2 are both modules offering the possibility to read data from various URLs; they will be needed to open the connection and retrieve the website.  BeautifulSoup is, as mentioned, a html parser.

Since we are going to fetch our data from a website, we have to behave like a browser. That’s why will be needing to fake a user agent. For our program, I chose to push the webstatistics a little in favour of Firefox and Solaris.

user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = { 'User-Agent' : user_agent }

Now let’s take a look at the code of dict.cc. We need to know how the webform is constructed if we want to query it.

...
<form style="margin:0px" action="http://www.dict.cc/" method="get">
  <table>
    <tr>
      <td>
        <input id="sinp" maxlength="100" name="s" size="25" type="text" />
        style="padding:2px;width:340px" value="">
      ...</td>
    </tr>
  </table>
</form>
...

The relevant parts are action, method and the name inside the input tag. The action is the webapplication that will get called when the form is submitted. The method shows us how we need to encode the data for the form while the name is our query variable.

values = {'s' : sys.argv[1] }
data = urllib.urlencode(values)
request = urllib2.Request("http://www.dict.cc/", data, headers)
response = urllib2.urlopen(request)

Here the data get’s encapsulated in a GET request and packed into the webform. Notice that values is a dictionary which makes handling more complex forms a charm. The the form gets submitted by urlopen() – i.e. we virtually pressed the “Search”-button.
See how easy it is? These are only a couple lines of code, but we already have searched on dict.cc for a completely arbitrary word from the commandline. The response has also been retrieved. All that is left, is to extract the relevant information.

the_page = response.read()
pool = BeautifulSoup(the_page)

The response is read and saved into regular html code. This code could now be analyzed via regular string.find() or re.findall() methods, but this implies hard-coding in reference to a lot of the underlying logic of the page. Besides, it would require a lot reverse engineering of the positional parameters, setting up several potentially recursive methods. This would ultimately produce ugly(i.e. not very pythonic) code. Lucky for us, there already is a full fledged html parser which allows us to ask just about any generic question. Let’s take a look at the resulting html code, first. If you are not yet familar with the tool that can be seen in the screenshot; I’m using Firefox with the Firebug addon. This one is very helpful if you ever need to debug a website.

dict.cc // search for "web"

Let me show an excerpt of the code.

<table>..
  <td class="td7nl" style="background-color: rgb(233, 233, 233);">
    <a href="/englisch-deutsch/web.html">
      <b>web</b>
    </a>
  </td>
<td class="td7nl" ... /td>
</table>..

The results are displayed in a table. The two interesting columns share the class td7nl. The most efficient way would seem to just sweep all the data from inside the cells of these two columns. Fortunately for us, BeautifulSoup implemented just that feature.

results = pool.findAll('td', attrs={'class' : 'td7nl'})
source = ''
translations = []

for result in results:
    word = ''
    for tmp in result.findAll(text=True):
        word = word + " " + unicode(tmp).encode("utf-8")
    if source == '':
        source = word
    else:
        translations.append((source, word))

for translation in translations:
    print "%s => %s" % (translation[0], translation[1])

results will be a BeautifulSoup.ResultSet. Each member of the tuple is the html code of one column of the class td7nl. Notice that you can access each element like you would expect in a tuple. result.findAll(text=True) will return each embedded textual element of the table. All we have to do is merge the different tags together.
source and word are temporary variables that will hold one translation in each iteration. Each translation will be saved as a pair(list) inside the translations tuple.
Finally we iterate over the found translations and write them to the screen.

$ python webscraping_demo.py
 kinky   {adj} =>  9 kraus   [Haar]  
 kinky   {adj} =>  nappy   {adj}   [Am.]
 kinky   {adj} =>  6 kraus   [Haar]  
 kinky   {adj} =>  crinkly   {adj}
 kinky   {adj} =>  kraus  
 kinky   {adj} =>  curly   {adj}
 kinky   {adj} =>  kraus  
 kinky   {adj} =>  frizzily   {adv}

In a regular application those results would need a little lexing, of course. The most important thing, however, is that we just wrote a translation wrapper onto a webapplication – in only 28 lines of code. Did I mention that I’m in love with Python?

All that is left is for me to recommend the BeautifulSoup documentation. What we did here really didn’t cover what this module is capable of.

I wish you all the best.


38 comments » | articles

On competence #2

March 14th, 2009 — 08:56 pm

Strange things do happen. They just do. But not all of them are strange bad, some are strange good. For example, I just wept like a baby. It was the end of season 4 of Dr. House – well.. in this respect I’m not quite sure whether it’s a good thing. To be alone with strange feelings on a Saturday evening.

On the other hand, admittedly, that’s not what I wanted to write about.

Yesterday began just like the the day before yesterday ended. I saw somebody trying to install Microsoft’s sequelserver on a Windows machine. Mind the fact that current versions of both programs were chosen. Nevertheless it seemed quite impossible just to install it. Crazy thing. One might think that programs of the same vendor should cooperate more easily – at least if both are bought for insane prices and labeled to be compliant to one another. Lucky for me, that’s nothing I have to care about – I already won my fight against MSSQL; all I needed were five servers and three versions of Windows. Probably somebody should mention to Microsoft that there are other operating systems available; equipped with sequelservers off the shelf. Some even are free and performing just as well if not better(I mean.. there’s no LIMIT/OFFSET implemenation in MSSQL; how good can it be?)

Anyways, to once again feel corrupted by such a pitiful sight is nothing special – as I have just pointed out. Thereafter I asked my dear and sincerely noble friend Carl Duevel to help me make the world a better place. And guess what – my wish instantly came true on Friday, 13th. What about all that blabber concerning bad luck and stuff. In my opinion people should get punished if reporting sick on this day; it is clearly economically abstruse to let all these menial dumbheads carry on doing nothing but guarding their bed sheets. But not Carl. He wrote an excellent article on Javas RegEx implementation. That alone has made my day, but there’s more – he’s mocking the use of inferior technology(aka Flash*choke*) only a couple of hours later. A great start for a newly brightened future is before us. Behold, world, for Captain Code will show you how it is done! I can’t even tell how much I’m looking forward to read frequently of his ulterior insights. Those have have been part of the few sorrows I hold on behalf of leaving Stuttgart; I had to physically leave him behind. On the other hand, maybe, just maybe, that might have been one of the glitches needed for him to finally turn public.

The most interesting thing could probably be, that while I was writing this post, I experienced like the tenth power failure in this wonderful building. One of my housemates came home, started her oversized electrical heater – and boom. Well, I understand that she needs it. Our main heating system doesn’t work properly and in her room it doesn’t work at all. Unlucky for us was that she tangled with the main fuses that thus went byebye, too.

1 comment » | personal

On competence

March 12th, 2009 — 11:28 pm

Today was a day of competence – in its pure and inconceivable form. The first 8 hours of my 5.5 hours working day I have spent with a “Senior Consultant”. To shorten the story: after four months working overtime, I have finally reverse engineered enough information to be certain that the product we bought just isn’t going to do what it is supposed to. Instead I have to put up with approx. 10 fully committed days to compensate all flaws and inabilities – only to lessen the gap between what has been promised and what will be possible. Notice that “what has been promised” should have been done in about a weeks worth of work and that this week has lasted approximately four months now. Reasoning, I guess my position is save for yet another week*phew*

On the other hand I just had a soothing conversation on the phone with the astonishingly unfit almighty administration of the university I used to go to. As always they didn’t fail to surprise me once again in matters of stupidity, regression, unfriendliness and a relentless misperception/misapprehension of their own job.
Probably I shouldn’t go into detail too much, but let’s assume a situation where you wanted to send an application for something that is perfectly reasonable, what would you do? Also consider the fact that you already consulted the dean of your faculty and that he confirmed your thesis about the application being not only reasonable, but perfectly valid. What I did was to go the the universitiys’ website, get the form, filled it out and send it to the fax number I found on the application which should have been the last pro-active part of mine in this matter. Six weeks later, I’m still not done. What went wrong? It’s easy to figure.

You probably knew it all the way when I mentioned the fax machine. I have used moderately modern and therefore too complex inadequate technology during the first contact; I even went so far as to use an e-mail address I found under “contact -> administration for students -> computer science department” to send an inquiry on whether my former request has been received. Today, when I called and asked for a confirmation of receiving my application, this most certainly hilariously ugly woman spontaneously burst into shouting. I immediately felt as if I had shot her baby. Turns out I was head-wrong. In the next 15 minutes I came to realize that she doesn’t hate me for personal reasons, but she still behaved as if I were claiming she never has paid taxes and I came to get them from her all at once. Certainly understandable – I wanted to know whether my application has been received and probably is being processed already; that could certainly be considered a matter of existence for her.

In the meantime, she taught me a great many wise things. For example I would be half a year behind. Behind what she didn’t tell. Plus I couldn’t do any exams. Which exams exactly, I also don’t know. When I asked her, the shouting resolved in angry yelling – stressing her vocal cords to a level close before the point where I might have considered it unfriendly. She repeated the upcoming facts that I would be behind and that I couldn’t do exams, because there would be no sixth semester. Never again, I thought? Great! I heard that one stinks, anyways. Then I made my first mistake – I tried to outsmart her. That’s something people usually don’t like very much. I told her I could take courses from the seventh semester. Oh, baby. She didn’t like that too much. After a long and shiny tirade I thought to myself “So what? Couldn’t hurt to tell her a little about her job, could it?”. I told her that it is possible, that I have colleagues doing something similar, that the examination regulations allowed me to do courses whenever I see fit and that I planned everything in full agreement of the dean of computer science. What I should have known is that everyone’s colleague is only telling him lies, that taking courses was not as simple as I think it is(yeah, probably she didn’t bother graduating from knitting school, because taking courses has been just too much a grind..) and finally that the dean simply had no say in these things at all.

Her fury began to annoy me a little by now, but when I tried to tell her that I only wanted to know about the status of my application she told me to shut up and wait. What followed should be considered the greatest accomplishments of mankind – complete and utter disregard for humbleness. She asked her colleague(remember; those are the guys always lying to you – so better don’t ask them too important stuff) whether it is possible to officially be in one semester, but take courses of another. Surely she was determined to start whatever she tried to do to me all over again after hearing reassuring words. Well, she didn’t. The nice, and officially most intelligent person in the bureau, told her that I’m in the main course and that I could do whatever and whenever I wanted to. Hearing this, I expected anything from a sign of insight to an apology of some sort. What I didn’t take into account was that her life already was very confusing and not that pleasing. So she went on hating the phone, me and herself. However, I had enough of this senseless waste of time. I gave her my best wishes and hung up.

What I still don’t know after having to put up with this miserable performance of a bureaucrat is whether my application will ever be processed at all. Today I even received a mail from a professor. He told me that “he heard” I would be taking his classes – he already designated me into a group and told me that next Monday would be a mandatory kick-off meeting. Well, I guess, I won’t pay the semesters’ fee and then I will be banned anyway. That’s what I wanted from the beginning, I think.. And if one thing is for sure: I won’t be in Stuttgart next Monday; for whatever reasons. Apart from the mail from the professor, I am very glad that I had this experience on the phone. It proved once more that just about any random person living in Stuttgart is miserable, unfriendly, conservative, boring and dumb – a combination of attributes I simply don’t want to face in aggregated form. How I miss Stuttgart! not.

Coming home, I realized that my fellow housemates were meeting with the landlord. The last months we were living like insects in a more cold than warm and more stone-age-ish than electriced cave(well.. it has walls and a ceiling of stone, at least). We were told that the house would be torn apart after we leave; granting us the choice whether or not to clean, to fix stuff and to let all the garbage obsolete furniture just inside. Today everything changed. The formerly liberal and avuncular landlord turned into Satan himself; demanding unscrupulous things like painting the tainted walls. Of course they were as stained when everyone moved in as they are now, but I guess that’s no argument to make. There’s more, of course, but it’s probably best not to think of it right now. Having rent apartments for the last years, I know a little about rights and responsibilities in this area – and I just spent my evening funnelling those insights into my fellow housemates.

All in all, I’m pretty impressed. I didn’t even bother to mention that I have been at home tonight for just about six hours before moving back to work; even though I have been there for 14 hours straight yesterday. Well, I even had darn good reason for that. I found code that locked itself out in not less than four places – giving my middleware a little bit of trouble. I also won’t go into detail that I found out about databases that have configuration tables for transcription tables which lead to statistics tables – only to never be read, but to be redundant to other configuration tables for another transcription table leading to its own statistics table. Since none of the tables have any keys or indices and there are lots of statistics to be saved, one of the tables has outgrown the state where queries going in will bring back a result different from a timeout. The most obvious part here is that both tables are never being read in a meaningful way. There is only one daemon process(the same that’s been filling these tables all along), reading one column of only one of the tables, sorts it and writes the top result in a third table. There the data will finally be read from a program and translated by a third transcription table. Apparently it proved impossible to fill these 4 bytes of information in the third table skipping the overhead before. And good for me I had to reverse-engineer all that great business logic; it’s not as if my todo-list is giving me any trouble recently; there’s still some space on the monitor that I can read in between the piles of notes.

I could go on, but then it might seem to you, my dear and noble reader, that I am a bitter old man having a hard time keeping my heart from exploding due to too much pressure. But the truth couldn’t be farther apart. In fact, I don’t think of myself as a truly smart man – I mean, I have my good parts that I have worked pretty hard for – and I am proud of them. But I’m no genius and as it seems never will be. On the other hand, being confronted with those massive amounts of stupidity in the world, I feel pretty neat about myself. I am deeply grateful for what I have not become.  Others missed out on that opportunity and are now stuck in a demeaning life of sluggishness. I look forward to the great journeys of tomorrow, they undoubtedly will be fun.

3 comments » | personal

« Previous Entries