Tag: python


How to become a proficient Python programmer

Spoiler: This post is primarily gonna be an excerpt of my bookmarks collection. That’s because more intelligent men than me have already written great articles on the topic of how to become a great Python programmer.

I will focus on four primary topics: Functional programming, performance, testing and code guidelines. When those four aspects merge in one programmer, he or she will gain greatness no matter what.

Functional programming

Writing code in an imperative style has become the de facto standard. Imperative programs consist of statements that describe change of state. While this might sometimes be a performant way of coding, it sometimes isn’t (for example for sake of complexity) – also, it probably is not the most intuitive way when compared with declarative programming.

If you don’t know what I’m talking about, that’s great. Here are some starter articles to get your mind running. But beware, it’s a little like the red pill
– once you tasted functional programming, you don’t want to go back.

Performance

There’s so much talk going on about how inefficient these ‘scripting languages’ (Python, Ruby, …) are, that it’s easy to forget that very often it’s the algorithm chosen by the programmer that leads to horrible runtime behaviour.

Those articles are a great place to get a feel for the ins and outs of Python’s runtime behaviour, so you can get your high performing application writting in a language that is concise and fun to write. And if your manager asks about Python’s performance, don’t forget to mention that the second largest search engine in the world is run by Python – namely Youtube(see Python quotes).

Testing

Testing is probably one the most misjudged topics in computer science these days. Some programmers really got it and emphasize TDD(test driven development) and it’s successor BDD(behaviour driven development) whereever possible. Others simply don’t feel it yet and think it’s a waste of time. Well, I’m gonna be that guy and tell you: If you haven’t started out on TDD/BDD yet, you have missed out greatly!

It’s not about introducing a technology to replace that release management automaton in your company that mindlessly clicks through the application once in a while, it is about giving you a tool to deeply understand your own problem domain – to really conquer, manipulate and twist it the way you want and need it to be. If you haven’t yet, give it a shot. These articles will give you some impulses:

Code guidelines

Not all code is created equal. Some can be read and changed by any great programmer out there. But some can only be read and only sometimes changed by the original author – and that maybe only a couple of hours after he or she wrote it. Why is that? Because of missing test coverage (see above) and the lack of proper usage of coding guidelines.

These articles establish an absolute minimum to adhere to. When you follow these, you will write more consise and beautiful code. As a side effect it will be more readable and adaptable by you or anyone else.

Now go ahead and spread the word. Start with the guy sitting right next to you. Maybe you can go to the next hackathlon or code dojo and start becoming great proficient programmers together!

All the best on your journey.

If you liked this article, please feel free to re-tweet it and let others know.


    You should follow me on twitter here
twitter_preek

31 comments » | articles

BeautifulSoup vs. lxml benchmark

Previously, I’ve been using BeautifulSoup whenever I had to parse HTML (for example in my dictionary pDict). But this time I’m working on a larger scale project which involves quite a lot of HTML parsing – and BeautifulSoup disappointed me performance wise. In fact, the project wouldn’t be possible using it. Well, it would be – if I subscribed to half of Amazon EC2(;

Since the project is in stealth mode right now, I can’t say which pages I am referring to, but let me give you these facts:

  • ~170kb HTML code
  • W3C validation shows about 1300 errors and 2600 warnings per page

Considering this many errors and warnings, I previously thought the job had to be done using BeautifulSoup, because it is known to have a very error resistant parser. In fact, BeautifulSoup doesn’t parse the HTML directly, but splits the tags in tag-soup by applying regular expressions around them. Opposing popular stories this seems to make BeautifulSoup very resilient towards bad code.

However, BeautifulSoup doesn’t perform well on the described files. The task: I need to parse 20 links of a particular class off the page. I put the relevant code in a seperate method and profiled it using cProfile:

cProfile.runctx("self.parse_with_beautifulsoup(html_data)", globals(), locals())

def parse_with_beautifulsoup(html_data):
  soup = BeautifulSoup.BeautifulSoup(html_data)
  links_res = soup.findAll("a", attrs={"class":"detailsViewLink"})
  links = [car_link["href"] for car_link in car_links_res]

Parsing 20 pages, this takes 167s on my small Debian VPS. Thats 8s+ per page. Incredibly long. Thinking of how BeautifulSoup parses, it’s understandable however. The overhead of creating tag-soup and parsing via RegExp leads to a whopping 302’000 method calls for just these four lines of code. I repeat: 302’000 method calls for four lines of code.

Hence, I tried lxml. The corresponding code is:

root = lxml.html.fromstring(html_data)
links_lxml_res = root.cssselect("a.detailsViewLink")
links_lxml = [link.get("href") for link in links_lxml_res]
links_lxml = list(set(links_lxml))

On the 20 pages, this takes only 2.4s. That’s only 0.12s per page. lxml needed only 180 method calls for the job. It runs 70x faster than BeautifulSoup and creates 1600x fewer calls.

When you do a graph of these numbers, the performance difference looks ridiculous. Well, let’s have some fun(;

lxml vs BeautifulSoup performance

Considering lxml supports xpath as well, I’m permanently switching my default HTML parsing library.

Note: Ian Bicking wrote a wonderful summary in 2008 on the performance of several Python HTML parsers which led me to lxml and to this article.

Update (08/17/2010): I planned on implementing my results on Google AppEngine. “Unfortunately” lxml relies heavily on C-code (that’s where the speed comes from^^). AppEngine is a pure Python environment. It will never run modules written in C.

14 comments » | articles

Serving images dynamically with CherryPy (on Google AppEngine)

Google AppEngine(GAE) is great for hosting Python (or Java)  Web-Applications. They offer 1.3mio hits/d and 1GB up- and downstream/d for free. Considering that you will get access to Google infrastructure that let’s you crawl the web as fast as Google does itself, choosing GAE is a no-brainer for applications doing a lot of web-crawling, screen scraping or web-indexing. You can even do cron-jobs to get your job done periodically.

I won’t elaborate on how to get an account, download the SDK and get started, because Google hosts great tutorials for these itself. If you are already familiar with Python web development this will get you started in a matter of minutes.

I personally chose not to use the Google webapp framework, because I’m quite familiar with CherryPy. I fell in love with it, because it feels very sleek – very Zen-like. This comes to no surprise, because it was a deliberate design decision as can be read in The Zen of CherryPy.

Getting started with CherryPy on GAE is no trouble, either. GAE supports any Python framework that is WSGI-compliant. Those include Django, CherryPy, web.py and Pylons. Google doesn’t host these frameworks themselves, so all you have to do is copy the whole framework into your GAE project to get the import to work. That’s it. Same counts for any 3rd party module. Need BeautifulSoup? Just copy the py-file to your project. Easy as cake.

Now, if you want to serve images dynamically, you don’t have to store them on harddisk to link to them. Just save them in the Google Datastore and serve whenever needed.

Using the following snippet you will be able to dynamically serve images with URLs like this:
http://application/handler_name/index/[0-9]*

import cherrypy
from cherrypy import expose
import wsgiref.handlers
import DynamicImage

class Root:
  @expose
  def index(self):
    return ""

class GetImage():
  """ GetImage provides a handler for dynamic images """

  def __init__(self):
    """
      Mockup for getting some images. Datastore or live
      scraping could be done here
    """
    # Note: DynamicImage is just a mockup.
    # There is no such module.
    dynamic_image = DynamicImage.DynamicImage()
    self.pictures = dynamic_image.getImages()

  @expose
  def index(self, num=None):
    """
      Provides the handler for urls:
        application/handler_name/index/[0-9]*
    """
    return self._convert_to_image(self.pictures[0][int(num)])

  def _convert_to_image(self, picture):
    cherrypy.response.headers['Content-Type'] = "image/jpg"
    return picture

# Root() doesn't do anything here. It normally serves your index page.
root = Root()

# Generate route http://app/img/
root.img = GetImage()

# Start CherryPy app in wsgi mode
app = cherrypy.tree.mount(root, "/")
wsgiref.handlers.CGIHandler().run(app)

One last note: Processes running longer than 15-30s will be cut off from GAE with the DeadlineExceededError exception. You can catch this exception and try to divide your workload into smaller pieces.

2 comments » | articles

« Previous Entries