Archive for August 2010


Open PDFs at specific entry points in the browser

August 17th, 2010 — 03:27 pm

PDF files can be opened at specific points in a browser. It is possible to open them at a given page or at the beginning of a chapter. This is done by applying HTML attributes to the link URL.

Adobe has some documentation concerning start parameters for PDFs here and here.

Open via chapter name

Append the complete chapter name to your link URL.

Example: http://blog.dispatched.ch/sps/thesis.pdf#3.4%20Programmiersprache%20und%20Paradigmen

Note: Chapter names are to be encoded as is common practice in URLs. In this example blanks are encoded to %20.

Open a specific page

Append the attribute ‘page’ followed by the page number to your link URL.

Example: http://blog.dispatched.ch/sps/thesis.pdf#page=38

4 comments » | articles

BeautifulSoup vs. lxml benchmark

August 16th, 2010 — 12:17 pm

Previously, I’ve been using BeautifulSoup whenever I had to parse HTML (for example in my dictionary pDict). But this time I’m working on a larger scale project which involves quite a lot of HTML parsing – and BeautifulSoup disappointed me performance wise. In fact, the project wouldn’t be possible using it. Well, it would be – if I subscribed to half of Amazon EC2(;

Since the project is in stealth mode right now, I can’t say which pages I am referring to, but let me give you these facts:

  • ~170kb HTML code
  • W3C validation shows about 1300 errors and 2600 warnings per page

Considering this many errors and warnings, I previously thought the job had to be done using BeautifulSoup, because it is known to have a very error resistant parser. In fact, BeautifulSoup doesn’t parse the HTML directly, but splits the tags in tag-soup by applying regular expressions around them. Opposing popular stories this seems to make BeautifulSoup very resilient towards bad code.

However, BeautifulSoup doesn’t perform well on the described files. The task: I need to parse 20 links of a particular class off the page. I put the relevant code in a seperate method and profiled it using cProfile:

cProfile.runctx("self.parse_with_beautifulsoup(html_data)", globals(), locals())

def parse_with_beautifulsoup(html_data):
  soup = BeautifulSoup.BeautifulSoup(html_data)
  links_res = soup.findAll("a", attrs={"class":"detailsViewLink"})
  links = [car_link["href"] for car_link in car_links_res]

Parsing 20 pages, this takes 167s on my small Debian VPS. Thats 8s+ per page. Incredibly long. Thinking of how BeautifulSoup parses, it’s understandable however. The overhead of creating tag-soup and parsing via RegExp leads to a whopping 302’000 method calls for just these four lines of code. I repeat: 302’000 method calls for four lines of code.

Hence, I tried lxml. The corresponding code is:

root = lxml.html.fromstring(html_data)
links_lxml_res = root.cssselect("a.detailsViewLink")
links_lxml = [link.get("href") for link in links_lxml_res]
links_lxml = list(set(links_lxml))

On the 20 pages, this takes only 2.4s. That’s only 0.12s per page. lxml needed only 180 method calls for the job. It runs 70x faster than BeautifulSoup and creates 1600x fewer calls.

When you do a graph of these numbers, the performance difference looks ridiculous. Well, let’s have some fun(;

lxml vs BeautifulSoup performance

Considering lxml supports xpath as well, I’m permanently switching my default HTML parsing library.

Note: Ian Bicking wrote a wonderful summary in 2008 on the performance of several Python HTML parsers which led me to lxml and to this article.

Update (08/17/2010): I planned on implementing my results on Google AppEngine. “Unfortunately” lxml relies heavily on C-code (that’s where the speed comes from^^). AppEngine is a pure Python environment. It will never run modules written in C.

14 comments » | articles

Apple vs. Microsoft – The final frontier

August 13th, 2010 — 02:27 pm

Finally, there’s no more listening to this sentence anymore: “Well, I’m using Microsoft Windows. You know, that’s the better OS. If it weren’t, why is Microsoft the bigger player?”
My friends, the tables have turned. According to Yahoo Finance Apple has a net worth of 230Bwhereas Microsoft caps at 212B (08/13/2013).

There’s much joy to extract from this graph from WolframAlpha (note: WA uses data from 06/30/2010. The breakthrough was imminent back then):

Now, please don’t come bothering me, because Microsoft still has Windows and Office to receive the big routine paycheck while Apple has to constantly innovate to keep its pace. Depending on those two has thought Microsoft nothing but stagnation. In the meantime Apple, Google, Facebook and Twitter arose. If Microsoft doesn’t change pace, they will suffer Yahoos’ fate.

If you ask me, it would be for the best of it. In the field of technology not the biggest player will win in the long run, but the most innovative. Do you remember those Kodak cameras that didn’t go digital (1990:7B, 2010:0$, source)? Geocities was hip once and worth 3.6B in 1999, now dead (source).

The list continues. Even when there is market domination, there still is room to manoeuvre. Microsoft didn’t equip the first personal computers – and they certainly won’t be the last.

4 comments » | articles

« Previous Entries