Webscraping with Python and BeautifulSoup

Recently my life has been a hype; partly due to my upcoming Python addiction. There’s simply no way around it; so I should better confess it in public. I’m in love with Python. It’s not only mature, businessproof and performant, but also benefits from sleekness, great performance and is just so much fun to write. It’s as if I were in Star Trek and only had to tell the computer what I wanted; never minding how the job actually it is done. Even my favourite comic artist(besides Scott Adams, of course..) took up on it; so my feelings have to be honest.

In this short tutorial, I’m going to show you how to scrape a website with the 3rd party html-parsing module BeautifulSoup in a practical example. We will search the wonderful translation engine dict.cc, which holds the key to over 700k translations from English to German and vice versa. Note that BeautifulSoup is liscensed just like Python while dict.cc allows for external searching.

First of, place BeautifulSoup.py in your modules directory. Alternatively, if you just want to do a quick test, put in the same directory where you will be writing your program. Then start your favourite text editor/Python IDE(for quick prototyping like we are about to do, I highly recommend a combination of IDLE and VIM) and begin coding. In this tutorial we won’t be doing any design; we won’t even encapsulate in a class. How to do that, later on, is up to your needs.

What we will do:

  1. go to dict.cc
  2. enter a search word into the webform
  3. submit the form
  4. read the result
  5. parse the html code
  6. save all translations
  7. print them

You can either read the needed coded on the fly or download it.
Now let’s begin the magic. Those are our needed imports.

import urllib
import urllib2
import string
import sys
from BeautifulSoup import BeautifulSoup

urllib and urllib2 are both modules offering the possibility to read data from various URLs; they will be needed to open the connection and retrieve the website.  BeautifulSoup is, as mentioned, a html parser.

Since we are going to fetch our data from a website, we have to behave like a browser. That’s why will be needing to fake a user agent. For our program, I chose to push the webstatistics a little in favour of Firefox and Solaris.

user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = { 'User-Agent' : user_agent }

Now let’s take a look at the code of dict.cc. We need to know how the webform is constructed if we want to query it.

...
<form style="margin:0px" action="http://www.dict.cc/" method="get">
  <table>
    <tr>
      <td>
        <input id="sinp" maxlength="100" name="s" size="25" type="text" />
        style="padding:2px;width:340px" value="">
      ...</td>
    </tr>
  </table>
</form>
...

The relevant parts are action, method and the name inside the input tag. The action is the webapplication that will get called when the form is submitted. The method shows us how we need to encode the data for the form while the name is our query variable.

values = {'s' : sys.argv[1] }
data = urllib.urlencode(values)
request = urllib2.Request("http://www.dict.cc/", data, headers)
response = urllib2.urlopen(request)

Here the data get’s encapsulated in a GET request and packed into the webform. Notice that values is a dictionary which makes handling more complex forms a charm. The the form gets submitted by urlopen() – i.e. we virtually pressed the “Search”-button.
See how easy it is? These are only a couple lines of code, but we already have searched on dict.cc for a completely arbitrary word from the commandline. The response has also been retrieved. All that is left, is to extract the relevant information.

the_page = response.read()
pool = BeautifulSoup(the_page)

The response is read and saved into regular html code. This code could now be analyzed via regular string.find() or re.findall() methods, but this implies hard-coding in reference to a lot of the underlying logic of the page. Besides, it would require a lot reverse engineering of the positional parameters, setting up several potentially recursive methods. This would ultimately produce ugly(i.e. not very pythonic) code. Lucky for us, there already is a full fledged html parser which allows us to ask just about any generic question. Let’s take a look at the resulting html code, first. If you are not yet familar with the tool that can be seen in the screenshot; I’m using Firefox with the Firebug addon. This one is very helpful if you ever need to debug a website.

dict.cc // search for "web"

Let me show an excerpt of the code.

<table>..
  <td class="td7nl" style="background-color: rgb(233, 233, 233);">
    <a href="/englisch-deutsch/web.html">
      <b>web</b>
    </a>
  </td>
<td class="td7nl" ... /td>
</table>..

The results are displayed in a table. The two interesting columns share the class td7nl. The most efficient way would seem to just sweep all the data from inside the cells of these two columns. Fortunately for us, BeautifulSoup implemented just that feature.

results = pool.findAll('td', attrs={'class' : 'td7nl'})
source = ''
translations = []

for result in results:
    word = ''
    for tmp in result.findAll(text=True):
        word = word + " " + unicode(tmp).encode("utf-8")
    if source == '':
        source = word
    else:
        translations.append((source, word))

for translation in translations:
    print "%s => %s" % (translation[0], translation[1])

results will be a BeautifulSoup.ResultSet. Each member of the tuple is the html code of one column of the class td7nl. Notice that you can access each element like you would expect in a tuple. result.findAll(text=True) will return each embedded textual element of the table. All we have to do is merge the different tags together.
source and word are temporary variables that will hold one translation in each iteration. Each translation will be saved as a pair(list) inside the translations tuple.
Finally we iterate over the found translations and write them to the screen.

$ python webscraping_demo.py
 kinky   {adj} =>  9 kraus   [Haar]  
 kinky   {adj} =>  nappy   {adj}   [Am.]
 kinky   {adj} =>  6 kraus   [Haar]  
 kinky   {adj} =>  crinkly   {adj}
 kinky   {adj} =>  kraus  
 kinky   {adj} =>  curly   {adj}
 kinky   {adj} =>  kraus  
 kinky   {adj} =>  frizzily   {adv}

In a regular application those results would need a little lexing, of course. The most important thing, however, is that we just wrote a translation wrapper onto a webapplication – in only 28 lines of code. Did I mention that I’m in love with Python?

All that is left is for me to recommend the BeautifulSoup documentation. What we did here really didn’t cover what this module is capable of.

I wish you all the best.


Category: articles | Tags: , , , , , , 38 comments »

38 Responses to “Webscraping with Python and BeautifulSoup”

  1. versatilemind.com » Lesenswert ☛ Websites abfragen mit Python

    [...] Webscraping with Python and BeautifulSoup Kategorie: Uncategorized ∗ [...]

  2. versatilemind.com » versatilemind.com » Lesenswert ☛ Websites abfragen mit Python

    [...] Alain schreibt darueber wie man mittels BeautifulSoup HTML-Seiten parsen kann. Sehr interessant. Kategorie: Uncategorized ∗ [...]

  3. hackbert

    Great article! Hurt me a little bit, because I had to do the same in Java and it’s a lot more code AND harder to understand. Btw: A good plugin for this kind of work is SelectorGadget:http://www.selectorgadget.com/ – really cool tool!

    Keep up the great work!

    *hackbert

  4. Anish chapagain

    Namaste,
    It’s really nice article to get on with motive to use or be familiar with python, thank’s for the roadview..it’s my turn to race upon it now..

    anish

  5. unicode

    Nice example, im trying with rae.es and i have problems on findAll with unicode errors.
    How can BeautifulSoup understand acents and utf-8 on a html content search?

  6. Alain M. Lafon

    @unicode

    Sorry to upset you, but the data you’re getting might be flawed, because BeautifulSoup works automagically with unicode. I’m quoting the documentation here: “By the time your document is parsed, it has been transformed into Unicode. Beautiful Soup stores only Unicode strings in its data structures.”

    You could try
    unicode(your_string).encode(“utf-8″)
    to get your code to understand that it’s really unicode you are handling.

    Best,
    Alain M. Lafon

  7. unicode

    Well this is a example of the code im trying to use with BeautifulSoup. As you can see the chartset is utf-8 and still have errors.
    You know, unicode is a puzzle for newbies.

    soup=BeautifulSoup(unicode(example).encode(‘utf-8′))
    soup.findAll(‘span’)

    http://buscon.rae.es/draeI/SrvltGUIBusUsual?TIPO_HTML=2&TIPO_BUS=3&LEMA=camioneta

    UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xe9′ in position 49: ordinal not in range(128)

    Real Academia Española. Diccionario Usual.     REAL  ACADEMIA  ESPAÑOLA     DICCIONARIO DE LA LENGUA ESPAÑOLA – Vigésima segunda edición camioneta. (Del fr. camionette, dim. de camion). 1. f. Vehículo automóvil menor que el camión y que sirve para transporte de toda clase de mercancías. 2. f. autobús.Real Academia Española © Todos los derechos reservados

  8. Alain M. Lafon

    Hi there Marl,

    Your problem is that you try to encode the whole soup; don’t do that. What you want to do is encode the strings you extract from the soup. Doing it like this works like a charm:
    pool = BeautifulSoup(the_page)
    results = pool.findAll("span")
    for result in results:
    print unicode(result.findAll(text=True)[0]).encode("utf-8")

    All characters are correctly encoded now. The letters(like ñ) – that still don’t add up – are html escaped characters which have to be converted differently(i.e. like http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python). This normally shouldn’t happen; your page says it contains content="text/html; charset=UTF-8", but obviously they intermix with html-encodings.

    I’d also recommend reading about Unicode in particular; http://www.amk.ca/python/howto/unicode is a decent tutorial. It can be a little confusing in the beginning, but if you put your head into it; you’ll get it right(;

    Just one thing: if you post an e-mail address, don’t use a spam aggreator – I’m not showing commentors mail addresses anyway, so no crawler will get it from here. I wanted to write you a mail about the topic, but I won’t give me the hassle if you probably are never going to read it from there. Besides using a real nick, or even name beyond your problem will be appreciated. Thank you very much for your future efforts.

  9. unicode

    Thank you very much Alan, it seems like a some kind of trap to protect the web content.
    Btw im still confuse with utf-8. I can put the page content on a wxhtmlwindows without errors:

    f = urllib2.urlopen(page)
    content=f.read().decode(‘utf-8′, ‘ignore’)
    wx.htmlwindow.setpage(content)

    seems that htmlwindow.setpage do the correct translation but a simple unicode(text.encode(‘utf-8′)) has those html scaped chars.

    About the email direction, mailinator aint a spam agregator, it is precisily to evade spam into my account.
    Mi email from gmail is marlborocb….. if you want to contact me.

  10. Alain M. Lafon

    Hi there Marl,

    you didn’t quite get that right. HTML escaped characters are not for protecting web-content; they have been created to display characters that do not belong to the ASCII[1]. They have been created before Unicode[2] doctypes have been allowed in HTML. You can look them up here[3].
    As I pointed out it isn’t necessary to intermix Unicode and html escaped characters in one page – it is bad style; that’s all. Your page obviously wants to use unicode as can be seen in the header(content="text/html; charset=UTF-8"). But this happens all the time; people tend not to know about what they want to do^^

    Your wx.htmlwindow will display the code correctly anyway, because it was designed to decode html escaped characters – that’s why it is called htmlwindow(;
    If you are confused, I recommend reading about this topic – my links cover everything you need to know about encodings in this particular area.

    Best,
    Alain

    1. http://en.wikipedia.org/wiki/ASCII
    2. http://en.wikipedia.org/wiki/Unicode
    3. http://www.utexas.edu/learn/html/spchar.html

  11. josh

    Be careful with firebug, it’s the best addon for firefox (apart from adblock+), but it shows the html source code for a website after it has been rendered by firefox, which is then tidied by firebug.

    It’s nice to view the source code and css titles quickly, but I still have to delve into the source code to find the proper written code that the author wrote.

    I’m not sure if there is someway to disable it from redoing the source code, but i love it all the same.

    Josh

  12. plain-simple-garak

    Also check out lxml; supposedly it’s as good or better than BeautifulSoup in many respects.

  13. HONDATA

    Why do I get a virus warning when I reach your site? You might want to check your ads. As a matter of fact, I’m going to check my computer to make sure it’s not me.

  14. Amelia Dedominicis

    Thank you for share your rather great informations. Your online is great.I am impressed by the details that you’ve on this blog. It exhibits how nicely you realize this subject. Bookmarked the following page, will arrive back for extra. You, my buddy, awesome! I found just the material I already searched all over the place and just couldn’t come across. What a perfect internet site. Such as this webpage your web-site is 1 of my new most liked.I like this data shown and it has provided me some type of inspiration to have accomplishment for some cause, so keep up the superior perform!

  15. Alethea Mchaffie

    have been visiting your site around a few days. absolutely love what you posted. btw i will be conducting a study about this area. do you happen to know other great blogs or forums that I can get more info? thanks in advance.

  16. clockworkpc

    Hi, I’m following your example line by line in a Python script, but this line gives me an error:

    values = {‘s’ : sys.argv[1] }

    Traceback (most recent call last):
    File “/usr/local/bin/dictscrape.py”, line 33, in
    values = {‘s’ : sys.argv[1] }
    IndexError: list index out of range

    Here’s my Python script exactly as it appears. Am I doing something wrong?

    #!/usr/bin/python
    #/home/clockworkpcasus/Documents/bin/dictscrape.py

    import urllib
    import urllib2
    import string
    import sys
    from BeautifulSoup import BeautifulSoup

    user_agent = ‘Mozilla/5 (Solaris 10) Gecko’
    headers = { ‘User-Agent’ : user_agent}

    values = {‘s’ : sys.argv[1] }
    data = urllib.urlencode(values)
    request = urllib2.Request(‘http://www.dict.cc/’, data, headers)
    response = urllib2.urlopen(request)

  17. Alain M. Lafon

    Well.. Python tells you that you gave no arguments to the script. Did you tell it which word you wanted to look up?(;

  18. Michael Demus

    Outstanding information over again! Thank you:)

  19. William

    Hi! I’m getting this error when I try to run your example, any ideas?

    File “webscraping_demo.py”, line 36
    print “%s => %s” % (translation[0], translation[1])
    ^

  20. Python Quick Hacks and Codes | Pearltrees

    [...] Alternatively, if you just want to do a quick test, put in the same directory where you will be writing your program. Then start your favourite text editor/Python IDE(for quick prototyping like we are about to do, I highly recommend a combination of IDLE and VIM) and begin coding. In this tutorial we won’t be doing any design; we won’t even encapsulate in a class. Webscraping with Python and BeautifulSoup | Alain M. Lafon [...]

  21. Daniel

    Hi, really nice example and exactly what I was looking for.

    However I tried to run it and was not able to produce any results.

    I am a bit ashamed, but yes I don’t know where and how to place the word I want to get translations for.

    I got the same problem as clockworkpc. Could you help me with it?

    Thanks a lot!

  22. Abhinay

    Hi.
    this tutorial was really helpful.
    Thanks a lot!

  23. Mathias' Blog » Python stock quote fetch

    [...] Für den Kurs des ETF benötige ich einen HTML-Parser, da Yahoo! dessen Kurs nicht anbietet. Mit BeautifulSoup habe ich einen neuen Lieblings-Parser ins HErz geschlossen – in Kürze war also der Kurs von finanzen.net eingebunden. (Python-Tipp: Webscraping with Python and beautifulsoup) [...]

  24. Dorkboy

    Great tutorial!

    I am having trouble with http://www.cboe.com/delayedquote/quotetable.aspx

    This page is a ‘POST’ instead of a ‘GET’. Will that matter?

    I want use the first text box – under where it says:

    “Enter a Stock or Index symbol below for delayed quotes.”
    “CBOE Company News and Reports”

    I am using ‘ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol’ for my ‘name’ variable.

    My code looks like:

    values = {‘ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol’ : ‘IBM’ }

    Thanks in advance!

  25. Webscraping with Python and BeautifulSoup « Data Meaning…

    [...] http://blog.dispatched.ch/2009/03/15/webscraping-with-python-and-beautifulsoup/ [...]

  26. Dirk Krause

    https://gist.github.com/4094607

  27. lyly0000

    I am the newbie to python. I installed python 3.1 and tried your tutorial. According to the documentation, urllib2 is combined with urllib, but the problem is I can use urllib2 intellisense. But it shows me the error when I run – File “C:\py\scrap.py”, line 2, in
    import urllib2
    ImportError: No module named urllib2. I also checked other solution for urllib2, but it doesn’t need to be changed the code as far as I know. Can you please help me out. I wasted too much time on starting python on windows xp and urllib2 library. Thanks.

  28. Alain M. Lafon

    Hi there

    I can give you a good piece of advice. WinXP is not a really good development
    platform – except if you want to develop legacy Windows Applications in Win32
    C++ or something similar.

    You could install a Linux (like Ubuntu) and your problems will probably go away.
    If you’re uncertain about getting rid of your Windows, try installing Linux just
    as a Virtual Machine to get started.

    Best regards,
    Munen

  29. David Esp Video Technical Blog » Blog Archive » Web-Scraping in Python

    [...] http://blog.dispatched.ch/2009/03/15/webscraping-with-python-and-beautifulsoup/ [...]

  30. Ibtissem

    Hi,

    I’m trying to get a page from a website, unfortunately I get the message that I’m not authorized to do it even if I’m using the user agent, but before using this method I was fetching without the user agent, so I don’t know if the IP adress was detected and is the cause of this error.

    Thank you in advance four your answer

  31. notalentgeek

    I have a newbie question here… How can I determine which word I am going to search?

  32. Brandy

    Hello, just wanted to say, I liked this article.

    It was practical. Keep on posting!

  33. Jacob

    Is it possible to download dynamically generated comments
    like in register.co.uk (disqus) and click on the button
    “Load more comments” as it appear

  34. Jacob

    Sorry it was not register but telegraph.co.uk
    like in /finance /comment /ambroseevans_pritchard/

    I tried it in ruby…..

  35. Alain M. Lafon

    Disqus has an API, have you checked it out? (http://disqus.com/api/docs/)

  36. Jacob

    I think I must be the owner of the site to use API. But now ruby
    version is working…..

  37. Data Scraping

    Nice article……i am also scraper and doing it with scrappy framework but i am not aware about beautifulsoup….now i am gonna try this….

  38. developer android

    You can definitely see your expertise within the work you write.

    The world hopes for even more passionate writers such as you who are not afraid to mention how they believe.
    All the time go after your heart.


Leave a Reply


*