Webscraping with Python and BeautifulSoup

Recently my life has been a hype; partly due to my upcoming Python addiction. There’s simply no way around it; so I should better confess it in public. I’m in love with Python. It’s not only mature, businessproof and performant, but also benefits from sleekness, great performance and is just so much fun to write. It’s as if I were in Star Trek and only had to tell the computer what I wanted; never minding how the job actually it is done. Even my favourite comic artist(besides Scott Adams, of course..) took up on it; so my feelings have to be honest.

In this short tutorial, I’m going to show you how to scrape a website with the 3rd party html-parsing module BeautifulSoup in a practical example. We will search the wonderful translation engine dict.cc, which holds the key to over 700k translations from English to German and vice versa. Note that BeautifulSoup is liscensed just like Python while dict.cc allows for external searching.

First of, place BeautifulSoup.py in your modules directory. Alternatively, if you just want to do a quick test, put in the same directory where you will be writing your program. Then start your favourite text editor/Python IDE(for quick prototyping like we are about to do, I highly recommend a combination of IDLE and VIM) and begin coding. In this tutorial we won’t be doing any design; we won’t even encapsulate in a class. How to do that, later on, is up to your needs.

What we will do:

go to dict.cc
enter a search word into the webform
submit the form
read the result
parse the html code
save all translations
print them

You can either read the needed coded on the fly or download it.
Now let’s begin the magic. Those are our needed imports.

import urllib
import urllib2
import string
import sys
from BeautifulSoup import BeautifulSoup

urllib and urllib2 are both modules offering the possibility to read data from various URLs; they will be needed to open the connection and retrieve the website. BeautifulSoup is, as mentioned, a html parser.

Since we are going to fetch our data from a website, we have to behave like a browser. That’s why will be needing to fake a user agent. For our program, I chose to push the webstatistics a little in favour of Firefox and Solaris.

user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = { 'User-Agent' : user_agent }

Now let’s take a look at the code of dict.cc. We need to know how the webform is constructed if we want to query it.

...
<form style="margin:0px" action="http://www.dict.cc/" method="get">
  <table>
    <tr>
      <td>
        <input id="sinp" maxlength="100" name="s" size="25" type="text" />
        style="padding:2px;width:340px" value="">
      ...</td>
    </tr>
  </table>
</form>
...

The relevant parts are action, method and the name inside the input tag. The action is the webapplication that will get called when the form is submitted. The method shows us how we need to encode the data for the form while the name is our query variable.

values = {'s' : sys.argv[1] }
data = urllib.urlencode(values)
request = urllib2.Request("http://www.dict.cc/", data, headers)
response = urllib2.urlopen(request)

Here the data get’s encapsulated in a GET request and packed into the webform. Notice that values is a dictionary which makes handling more complex forms a charm. The the form gets submitted by urlopen() – i.e. we virtually pressed the “Search”-button.
See how easy it is? These are only a couple lines of code, but we already have searched on dict.cc for a completely arbitrary word from the commandline. The response has also been retrieved. All that is left, is to extract the relevant information.

the_page = response.read()
pool = BeautifulSoup(the_page)

The response is read and saved into regular html code. This code could now be analyzed via regular string.find() or re.findall() methods, but this implies hard-coding in reference to a lot of the underlying logic of the page. Besides, it would require a lot reverse engineering of the positional parameters, setting up several potentially recursive methods. This would ultimately produce ugly(i.e. not very pythonic) code. Lucky for us, there already is a full fledged html parser which allows us to ask just about any generic question. Let’s take a look at the resulting html code, first. If you are not yet familar with the tool that can be seen in the screenshot; I’m using Firefox with the Firebug addon. This one is very helpful if you ever need to debug a website.

Let me show an excerpt of the code.

<table>..
  <td class="td7nl" style="background-color: rgb(233, 233, 233);">
    <a href="/englisch-deutsch/web.html">
      <b>web</b>
    </a>
  </td>
<td class="td7nl" ... /td>
</table>..

The results are displayed in a table. The two interesting columns share the class td7nl. The most efficient way would seem to just sweep all the data from inside the cells of these two columns. Fortunately for us, BeautifulSoup implemented just that feature.

results = pool.findAll('td', attrs={'class' : 'td7nl'})
source = ''
translations = []

for result in results:
    word = ''
    for tmp in result.findAll(text=True):
        word = word + " " + unicode(tmp).encode("utf-8")
    if source == '':
        source = word
    else:
        translations.append((source, word))

for translation in translations:
    print "%s => %s" % (translation[0], translation[1])

results will be a BeautifulSoup.ResultSet. Each member of the tuple is the html code of one column of the class td7nl. Notice that you can access each element like you would expect in a tuple. result.findAll(text=True) will return each embedded textual element of the table. All we have to do is merge the different tags together.
source and word are temporary variables that will hold one translation in each iteration. Each translation will be saved as a pair(list) inside the translations tuple.
Finally we iterate over the found translations and write them to the screen.

$ python webscraping_demo.py
 kinky   {adj} =>  9 kraus   [Haar]  
 kinky   {adj} =>  nappy   {adj}   [Am.]
 kinky   {adj} =>  6 kraus   [Haar]  
 kinky   {adj} =>  crinkly   {adj}
 kinky   {adj} =>  kraus  
 kinky   {adj} =>  curly   {adj}
 kinky   {adj} =>  kraus  
 kinky   {adj} =>  frizzily   {adv}

In a regular application those results would need a little lexing, of course. The most important thing, however, is that we just wrote a translation wrapper onto a webapplication – in only 28 lines of code. Did I mention that I’m in love with Python?

All that is left is for me to recommend the BeautifulSoup documentation. What we did here really didn’t cover what this module is capable of.

I wish you all the best.

Category: articles | Tags: beautifulsoup, howto, python, scraping, tutorial, web scraping, webscraping 38 comments »

versatilemind.com » Lesenswert ☛ Websites abfragen mit Python
March 15th, 2009 at 3:15 pm

[…] Webscraping with Python and BeautifulSoup Kategorie: Uncategorized ∗ […]

versatilemind.com » versatilemind.com » Lesenswert ☛ Websites abfragen mit Python
March 15th, 2009 at 8:16 pm

[…] Alain schreibt darueber wie man mittels BeautifulSoup HTML-Seiten parsen kann. Sehr interessant. Kategorie: Uncategorized ∗ […]

hackbert
March 16th, 2009 at 7:07 pm

Great article! Hurt me a little bit, because I had to do the same in Java and it’s a lot more code AND harder to understand. Btw: A good plugin for this kind of work is SelectorGadget:http://www.selectorgadget.com/ – really cool tool!

Keep up the great work!

*hackbert

Anish chapagain
March 24th, 2009 at 12:53 am

Namaste,
It’s really nice article to get on with motive to use or be familiar with python, thank’s for the roadview..it’s my turn to race upon it now..

anish

unicode
March 27th, 2009 at 11:04 am

Nice example, im trying with rae.es and i have problems on findAll with unicode errors.
How can BeautifulSoup understand acents and utf-8 on a html content search?

Alain M. Lafon
March 30th, 2009 at 10:03 am

@unicode

Sorry to upset you, but the data you’re getting might be flawed, because BeautifulSoup works automagically with unicode. I’m quoting the documentation here: “By the time your document is parsed, it has been transformed into Unicode. Beautiful Soup stores only Unicode strings in its data structures.”

You could try
unicode(your_string).encode(“utf-8”)
to get your code to understand that it’s really unicode you are handling.

Best,
Alain M. Lafon

unicode
April 2nd, 2009 at 7:13 am

Well this is a example of the code im trying to use with BeautifulSoup. As you can see the chartset is utf-8 and still have errors.
You know, unicode is a puzzle for newbies.

soup=BeautifulSoup(unicode(example).encode(‘utf-8’))
soup.findAll(‘span’)

http://buscon.rae.es/draeI/SrvltGUIBusUsual?TIPO_HTML=2&TIPO_BUS=3&LEMA=camioneta

UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xe9′ in position 49: ordinal not in range(128)

Real Academia Española. Diccionario Usual. REAL ACADEMIA ESPAÑOLA DICCIONARIO DE LA LENGUA ESPAÑOLA – Vigésima segunda edición camioneta. (Del fr. camionette, dim. de camion). 1. f. Vehículo automóvil menor que el camión y que sirve para transporte de toda clase de mercancías. 2. f. autobús.Real Academia Española © Todos los derechos reservados

Alain M. Lafon
April 4th, 2009 at 6:45 pm

Hi there Marl,

Your problem is that you try to encode the whole soup; don’t do that. What you want to do is encode the strings you extract from the soup. Doing it like this works like a charm:
pool = BeautifulSoup(the_page) results = pool.findAll("span") for result in results: print unicode(result.findAll(text=True)[0]).encode("utf-8")
All characters are correctly encoded now. The letters(like ñ) – that still don’t add up – are html escaped characters which have to be converted differently(i.e. like http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python). This normally shouldn’t happen; your page says it contains content="text/html; charset=UTF-8", but obviously they intermix with html-encodings.

I’d also recommend reading about Unicode in particular; http://www.amk.ca/python/howto/unicode is a decent tutorial. It can be a little confusing in the beginning, but if you put your head into it; you’ll get it right(;

Just one thing: if you post an e-mail address, don’t use a spam aggreator – I’m not showing commentors mail addresses anyway, so no crawler will get it from here. I wanted to write you a mail about the topic, but I won’t give me the hassle if you probably are never going to read it from there. Besides using a real nick, or even name beyond your problem will be appreciated. Thank you very much for your future efforts.

unicode
April 6th, 2009 at 9:05 am

Thank you very much Alan, it seems like a some kind of trap to protect the web content.
Btw im still confuse with utf-8. I can put the page content on a wxhtmlwindows without errors:

f = urllib2.urlopen(page)
content=f.read().decode(‘utf-8’, ‘ignore’)
wx.htmlwindow.setpage(content)

seems that htmlwindow.setpage do the correct translation but a simple unicode(text.encode(‘utf-8’)) has those html scaped chars.

About the email direction, mailinator aint a spam agregator, it is precisily to evade spam into my account.
Mi email from gmail is marlborocb….. if you want to contact me.

Alain M. Lafon
April 7th, 2009 at 11:54 am

you didn’t quite get that right. HTML escaped characters are not for protecting web-content; they have been created to display characters that do not belong to the ASCII[1]. They have been created before Unicode[2] doctypes have been allowed in HTML. You can look them up here[3].
As I pointed out it isn’t necessary to intermix Unicode and html escaped characters in one page – it is bad style; that’s all. Your page obviously wants to use unicode as can be seen in the header(content="text/html; charset=UTF-8"). But this happens all the time; people tend not to know about what they want to do^^

Your wx.htmlwindow will display the code correctly anyway, because it was designed to decode html escaped characters – that’s why it is called htmlwindow(;
If you are confused, I recommend reading about this topic – my links cover everything you need to know about encodings in this particular area.

Best,
Alain

1. http://en.wikipedia.org/wiki/ASCII
2. http://en.wikipedia.org/wiki/Unicode
3. http://www.utexas.edu/learn/html/spchar.html

josh
April 21st, 2009 at 1:08 am

Be careful with firebug, it’s the best addon for firefox (apart from adblock+), but it shows the html source code for a website after it has been rendered by firefox, which is then tidied by firebug.

It’s nice to view the source code and css titles quickly, but I still have to delve into the source code to find the proper written code that the author wrote.

I’m not sure if there is someway to disable it from redoing the source code, but i love it all the same.

Josh

plain-simple-garak
May 26th, 2009 at 7:27 am

Also check out lxml; supposedly it’s as good or better than BeautifulSoup in many respects.

HONDATA
September 1st, 2010 at 9:21 am

Why do I get a virus warning when I reach your site? You might want to check your ads. As a matter of fact, I’m going to check my computer to make sure it’s not me.

Amelia Dedominicis
September 25th, 2010 at 10:24 am

Thank you for share your rather great informations. Your online is great.I am impressed by the details that you’ve on this blog. It exhibits how nicely you realize this subject. Bookmarked the following page, will arrive back for extra. You, my buddy, awesome! I found just the material I already searched all over the place and just couldn’t come across. What a perfect internet site. Such as this webpage your web-site is 1 of my new most liked.I like this data shown and it has provided me some type of inspiration to have accomplishment for some cause, so keep up the superior perform!

Alethea Mchaffie
October 18th, 2010 at 4:19 am

have been visiting your site around a few days. absolutely love what you posted. btw i will be conducting a study about this area. do you happen to know other great blogs or forums that I can get more info? thanks in advance.

clockworkpc
April 16th, 2011 at 12:33 am

Hi, I’m following your example line by line in a Python script, but this line gives me an error:

values = {‘s’ : sys.argv[1] }

Traceback (most recent call last):
File “/usr/local/bin/dictscrape.py”, line 33, in
values = {‘s’ : sys.argv[1] }
IndexError: list index out of range

Here’s my Python script exactly as it appears. Am I doing something wrong?

#!/usr/bin/python
#/home/clockworkpcasus/Documents/bin/dictscrape.py

user_agent = ‘Mozilla/5 (Solaris 10) Gecko’
headers = { ‘User-Agent’ : user_agent}

values = {‘s’ : sys.argv[1] }
data = urllib.urlencode(values)
request = urllib2.Request(‘http://www.dict.cc/’, data, headers)
response = urllib2.urlopen(request)

Alain M. Lafon
May 7th, 2011 at 4:26 pm

Well.. Python tells you that you gave no arguments to the script. Did you tell it which word you wanted to look up?(;

Michael Demus
July 6th, 2011 at 6:02 pm

Outstanding information over again! Thank you:)

William
February 8th, 2012 at 7:36 pm

Hi! I’m getting this error when I try to run your example, any ideas?

File “webscraping_demo.py”, line 36
print “%s => %s” % (translation[0], translation[1])
^

Python Quick Hacks and Codes | Pearltrees
April 9th, 2012 at 10:21 pm

[…] Alternatively, if you just want to do a quick test, put in the same directory where you will be writing your program. Then start your favourite text editor/Python IDE(for quick prototyping like we are about to do, I highly recommend a combination of IDLE and VIM) and begin coding. In this tutorial we won’t be doing any design; we won’t even encapsulate in a class. Webscraping with Python and BeautifulSoup | Alain M. Lafon […]

Daniel
May 28th, 2012 at 6:23 pm

Hi, really nice example and exactly what I was looking for.

However I tried to run it and was not able to produce any results.

I am a bit ashamed, but yes I don’t know where and how to place the word I want to get translations for.

I got the same problem as clockworkpc. Could you help me with it?

Thanks a lot!

Abhinay
June 21st, 2012 at 11:57 am

Hi.
this tutorial was really helpful.
Thanks a lot!

Mathias' Blog » Python stock quote fetch
July 12th, 2012 at 7:05 pm

[…] Für den Kurs des ETF benötige ich einen HTML-Parser, da Yahoo! dessen Kurs nicht anbietet. Mit BeautifulSoup habe ich einen neuen Lieblings-Parser ins HErz geschlossen – in Kürze war also der Kurs von finanzen.net eingebunden. (Python-Tipp: Webscraping with Python and beautifulsoup) […]

Dorkboy
July 20th, 2012 at 7:06 pm

Great tutorial!

I am having trouble with http://www.cboe.com/delayedquote/quotetable.aspx

This page is a ‘POST’ instead of a ‘GET’. Will that matter?

I want use the first text box – under where it says:

“Enter a Stock or Index symbol below for delayed quotes.”
“CBOE Company News and Reports”

I am using ‘ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol’ for my ‘name’ variable.

My code looks like:

values = {‘ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol’ : ‘IBM’ }

Thanks in advance!

Webscraping with Python and BeautifulSoup « Data Meaning…
October 2nd, 2012 at 12:56 pm

[…] http://blog.dispatched.ch/2009/03/15/webscraping-with-python-and-beautifulsoup/ […]

Dirk Krause
November 17th, 2012 at 10:59 am

https://gist.github.com/4094607

lyly0000
December 5th, 2012 at 3:54 am

I am the newbie to python. I installed python 3.1 and tried your tutorial. According to the documentation, urllib2 is combined with urllib, but the problem is I can use urllib2 intellisense. But it shows me the error when I run – File “C:\py\scrap.py”, line 2, in
import urllib2
ImportError: No module named urllib2. I also checked other solution for urllib2, but it doesn’t need to be changed the code as far as I know. Can you please help me out. I wasted too much time on starting python on windows xp and urllib2 library. Thanks.

Alain M. Lafon
December 18th, 2012 at 2:34 pm

Hi there

I can give you a good piece of advice. WinXP is not a really good development
platform – except if you want to develop legacy Windows Applications in Win32
C++ or something similar.

You could install a Linux (like Ubuntu) and your problems will probably go away.
If you’re uncertain about getting rid of your Windows, try installing Linux just
as a Virtual Machine to get started.

Best regards,
Munen

David Esp Video Technical Blog » Blog Archive » Web-Scraping in Python
February 15th, 2013 at 5:56 pm

Ibtissem
August 7th, 2013 at 12:06 pm

Hi,

I’m trying to get a page from a website, unfortunately I get the message that I’m not authorized to do it even if I’m using the user agent, but before using this method I was fetching without the user agent, so I don’t know if the IP adress was detected and is the cause of this error.

Thank you in advance four your answer

notalentgeek
October 13th, 2013 at 3:43 pm

I have a newbie question here… How can I determine which word I am going to search?

Brandy
November 20th, 2013 at 3:09 am

Hello, just wanted to say, I liked this article.

It was practical. Keep on posting!

Jacob
December 4th, 2013 at 7:48 am

Is it possible to download dynamically generated comments
like in register.co.uk (disqus) and click on the button
“Load more comments” as it appear

Jacob
December 4th, 2013 at 7:51 am

Sorry it was not register but telegraph.co.uk
like in /finance /comment /ambroseevans_pritchard/

I tried it in ruby…..

Alain M. Lafon
December 5th, 2013 at 3:47 pm

Disqus has an API, have you checked it out? (http://disqus.com/api/docs/)

Jacob
December 6th, 2013 at 7:43 am

I think I must be the owner of the site to use API. But now ruby
version is working…..

Data Scraping
August 5th, 2014 at 2:10 pm

Nice article……i am also scraper and doing it with scrappy framework but i am not aware about beautifulsoup….now i am gonna try this….

developer android
October 19th, 2014 at 7:04 pm

You can definitely see your expertise within the work you write.

The world hopes for even more passionate writers such as you who are not afraid to mention how they believe.
All the time go after your heart.

Alain M. Lafon

Webscraping with Python and BeautifulSoup

38 Responses to “Webscraping with Python and BeautifulSoup”