<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Alain M. Lafon &#187; python</title>
	<atom:link href="http://blog.dispatched.ch/tag/python/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.dispatched.ch</link>
	<description>code, life and struggles thereof</description>
	<lastBuildDate>Wed, 18 Aug 2010 14:58:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>BeautifulSoup vs. lxml benchmark</title>
		<link>http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/</link>
		<comments>http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/#comments</comments>
		<pubDate>Mon, 16 Aug 2010 11:17:00 +0000</pubDate>
		<dc:creator>Alain M. Lafon</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[beautifulsoup]]></category>
		<category><![CDATA[cprofile]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[lxml]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[profile]]></category>
		<category><![CDATA[profiling]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://blog.dispatched.ch/?p=1241</guid>
		<description><![CDATA[Previously, I&#8217;ve been using BeautifulSoup whenever I had to parse HTML (for example in my dictionary pDict). But this time I&#8217;m working on a larger scale project which involves quite a lot of HTML parsing &#8211; and BeautifulSoup disappointed me performance wise. In fact, the project wouldn&#8217;t be possible using it. Well, it would be [...]]]></description>
			<content:encoded><![CDATA[<p>Previously, I&#8217;ve been using <a href="http://www.crummy.com/software/BeautifulSoup/" target="_blank">BeautifulSoup</a> whenever I had to parse HTML (for example in my dictionary <a href="http://dispatched.ch/projects/pdict/" target="_blank">pDict</a>). But this time I&#8217;m working on a larger scale project which involves quite a lot of HTML parsing &#8211; and BeautifulSoup disappointed me performance wise. In fact, the project wouldn&#8217;t be possible using it. Well, it would be &#8211; if I subscribed to half of Amazon EC2(;</p>
<p>Since the project is in stealth mode right now, I can&#8217;t say which pages I am referring to, but let me give you these facts:</p>
<ul>
<li>~170kb HTML code</li>
<li><a href="http://validator.w3.org/" target="_blank">W3C validation</a> shows about 1300 errors and 2600 warnings per page</li>
</ul>
<p>Considering this many errors and warnings, I previously thought the job had to be done using BeautifulSoup, because it is known to have a very error resistant parser. In fact, BeautifulSoup doesn&#8217;t parse the HTML directly, but splits the tags in tag-soup by applying regular expressions around them. Opposing <a href="http://stackoverflow.com/questions/1732348?tab=votes#tab-top" target="_blank">popular stories</a> this seems to make BeautifulSoup very resilient towards bad code.</p>
<p>However, BeautifulSoup doesn&#8217;t perform well on the described files. The task: I need to parse 20 links of a particular class off the page. I put the relevant code in a seperate method and profiled it using <a href="http://docs.python.org/library/profile.html" target="_blank">cProfile</a>:</p>
<pre class="brush: python;">
cProfile.runctx(&quot;self.parse_with_beautifulsoup(html_data)&quot;, globals(), locals())

def parse_with_beautifulsoup(html_data):
  soup = BeautifulSoup.BeautifulSoup(html_data)
  links_res = soup.findAll(&quot;a&quot;, attrs={&quot;class&quot;:&quot;detailsViewLink&quot;})
  links = [car_link[&quot;href&quot;] for car_link in car_links_res]
</pre>
<p>Parsing 20 pages, this takes 167s on my small Debian VPS. Thats 8s+ per page. Incredibly long. Thinking of how BeautifulSoup parses, it&#8217;s understandable however. The overhead of creating tag-soup and parsing via RegExp leads to a whopping 302&#8217;000 method calls for just these four lines of code. I repeat: 302&#8217;000 method calls for four lines of code.</p>
<p>Hence, I tried <a href="http://codespeak.net/lxml/" target="_blank">lxml</a>. The corresponding code is:</p>
<pre class="brush: python;">
root = lxml.html.fromstring(html_data)
links_lxml_res = root.cssselect(&quot;a.detailsViewLink&quot;)
links_lxml = [link.get(&quot;href&quot;) for link in links_lxml_res]
links_lxml = list(set(links_lxml))
</pre>
<p>On the 20 pages, this takes only 2.4s. That&#8217;s only 0.12s per page. lxml needed only 180 method calls for the job. It runs 70x faster than BeautifulSoup and creates 1600x fewer calls.</p>
<p>When you do a graph of these numbers, the performance difference looks ridiculous. Well, let&#8217;s have some fun(;</p>
<div class="wp-caption aligncenter" style="width: 455px"><a href="http://blog.dispatched.ch/wp-content/uploads/2010/08/lxml_vs_beautifulsoup.png" rel="lightbox[1241]"><img class="  " src="http://blog.dispatched.ch/wp-content/uploads/2010/08/lxml_vs_beautifulsoup.png" alt="" width="445" height="531" /></a><p class="wp-caption-text">lxml vs BeautifulSoup performance</p></div>
<p>Considering lxml supports xpath as well, I&#8217;m permanently switching my default HTML parsing library.</p>
<p>Note: Ian Bicking <a href="http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/" target="_blank">wrote</a> a wonderful summary in 2008 on the performance of several Python HTML parsers which led me to lxml and to this article.</p>
<p>Update (08/17/2010): I planned on implementing my results on Google AppEngine. &#8220;Unfortunately&#8221; lxml relies heavily on C-code (that&#8217;s where the speed comes from^^). AppEngine is a <a href="http://code.google.com/appengine/docs/python/overview.html">pure Python</a> environment. It will never run modules written in C.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Serving images dynamically with CherryPy (on Google AppEngine)</title>
		<link>http://blog.dispatched.ch/2010/08/13/serving-images-dynamically-with-cherrypy-on-google-appengine/</link>
		<comments>http://blog.dispatched.ch/2010/08/13/serving-images-dynamically-with-cherrypy-on-google-appengine/#comments</comments>
		<pubDate>Fri, 13 Aug 2010 09:22:34 +0000</pubDate>
		<dc:creator>Alain M. Lafon</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[cherrypy]]></category>
		<category><![CDATA[google appengine]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[Web]]></category>

		<guid isPermaLink="false">http://blog.dispatched.ch/2010/08/13/serving-dynamic-images-with-cherrypy-on-google-appengine/</guid>
		<description><![CDATA[Google AppEngine(GAE) is great for hosting Python (or Java)  Web-Applications. They offer 1.3mio hits/d and 1GB up- and downstream/d for free. Considering that you will get access to Google infrastructure that let&#8217;s you crawl the web as fast as Google does itself, choosing GAE is a no-brainer for applications doing a lot of web-crawling, screen [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://code.google.com/appengine/">Google AppEngine</a>(GAE) is great for hosting Python (or Java)  Web-Applications. They offer 1.3mio hits/d and 1GB up- and downstream/d for free. Considering that you will get access to Google infrastructure that let&#8217;s you crawl the web as fast as Google does itself, choosing GAE is a no-brainer for applications doing a lot of web-crawling, screen scraping or web-indexing. You can even do <a href="http://code.google.com/appengine/docs/python/config/cron.html">cron-jobs</a> to get your job done periodically.</p>
<p>I won&#8217;t elaborate on how to get an account, download the SDK and get started, because Google hosts <a href="http://code.google.com/appengine/">great tutorials</a> for these itself. If you are already familiar with Python web development this will get you started in a matter of minutes.</p>
<p>I personally chose not to use the Google <a href="http://code.google.com/appengine/docs/python/tools/webapp/">webapp </a>framework, because I&#8217;m quite familiar with <a href="http://www.cherrypy.org/">CherryPy</a>. I fell in love with it, because it feels very sleek &#8211; very Zen-like. This comes to no surprise, because it was a deliberate design decision as can be read in <a href="http://www.cherrypy.org/wiki/ZenOfCherryPy">The Zen of CherryPy</a>.</p>
<p>Getting started with CherryPy on GAE is no trouble, either. GAE <a href="http://code.google.com/appengine/docs/python/gettingstarted/usingwebapp.html">supports</a> any Python framework that is WSGI-compliant. Those include Django, CherryPy, web.py and Pylons. Google doesn&#8217;t host these frameworks themselves, so all you have to do is copy the whole framework into your GAE project to get the import to work. That&#8217;s it. Same counts for any 3rd party module. Need BeautifulSoup? Just copy the py-file to your project. Easy as cake.</p>
<p>Now, if you want to serve images dynamically, you don&#8217;t have to store them on harddisk to link to them. Just save them in the Google <a href="http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html">Datastore</a> and serve whenever needed.</p>
<p>Using the following snippet you will be able to dynamically serve images with URLs like this: </p>
<p>http://application/handler_name/index/[0-9]*</p>
<pre class="brush: python;">
import cherrypy
from cherrypy import expose
import wsgiref.handlers
import DynamicImage

class Root:
  @expose
  def index(self):
    return &quot;&quot;

class GetImage():
  &quot;&quot;&quot; GetImage provides a handler for dynamic images &quot;&quot;&quot;

  def __init__(self):
    &quot;&quot;&quot;
      Mockup for getting some images. Datastore or live
      scraping could be done here
    &quot;&quot;&quot;
    # Note: DynamicImage is just a mockup.
    # There is no such module.
    dynamic_image = DynamicImage.DynamicImage()
    self.pictures = dynamic_image.getImages()

  @expose
  def index(self, num=None):
    &quot;&quot;&quot;
      Provides the handler for urls:
        application/handler_name/index/[0-9]*
    &quot;&quot;&quot;
    return self._convert_to_image(self.pictures[0][int(num)])

  def _convert_to_image(self, picture):
    cherrypy.response.headers['Content-Type'] = &quot;image/jpg&quot;
    return picture

# Root() doesn't do anything here. It normally serves your index page.
root = Root()

# Generate route http://app/img/
root.img = GetImage()

# Start CherryPy app in wsgi mode
app = cherrypy.tree.mount(root, &quot;/&quot;)
wsgiref.handlers.CGIHandler().run(app)
</pre>
<p>One last note: Processes running longer than 15-30s will be cut off from GAE with the <a href="http://code.google.com/appengine/docs/python/runtime.html">DeadlineExceededError </a>exception. You can catch this exception and try to divide your workload into smaller pieces.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.dispatched.ch/2010/08/13/serving-images-dynamically-with-cherrypy-on-google-appengine/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>C# 4.0&#8242;s dynamicity</title>
		<link>http://blog.dispatched.ch/2010/05/10/c-40s-dynamicity/</link>
		<comments>http://blog.dispatched.ch/2010/05/10/c-40s-dynamicity/#comments</comments>
		<pubDate>Mon, 10 May 2010 21:23:21 +0000</pubDate>
		<dc:creator>Alain M. Lafon</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[C#]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://blog.dispatched.ch/?p=985</guid>
		<description><![CDATA[I just found an article ranking highly on Hacker News (my favourite read) concerning the release of C# 4.0 &#8211; you can find it on blogs.msdn.com. On this blog Microsoft claims a couple of highly sophistacated new features. Being the spoiled guy I am, they just seem natural to me. Since they started their article [...]]]></description>
			<content:encoded><![CDATA[<p>I just found an article ranking highly on <a href="http://news.ycombinator.com">Hacker News</a> (my favourite read) concerning the release of C# 4.0 &#8211; you can find it on <a href="http://blogs.msdn.com/csharpfaq/archive/2010/04/12/get-ready-for-c-4-0.aspx">blogs.msdn.com</a>. On this blog Microsoft claims a couple of highly sophistacated new features. Being the spoiled guy I am, they just seem natural to me. Since they started their article with the words &#8220;The <strong>dynamic</strong> keyword is a key feature of this release.&#8221;, let me demonstrate the <em>new</em> features in a really dynamically typed language where they have been around for quite some time: Python </p>
<h3>Dynamic</h3>
<table>
<tr>
<td>
<h4>C#</h4>
</td>
<td>&nbsp;</td>
<td>
<pre class="brush: cpp;">
                        dynamic contact = new ExpandoObject();
                        contact.Name = &quot;Patrick Hines&quot;;
                        contact.Phone = &quot;206-555-0144&quot;;
                        </pre>
</td>
</tr>
<tr>
<td>
<h4>Python</h4>
</td>
<td>&nbsp;</td>
<td>
<pre class="brush: python;">
                        class contact: None
                        contact.Name = &quot;Patrick Hines&quot;
                        contact.Phone = &quot;206-555-0144&quot;
                        </pre>
</td>
</tr>
</table>
<h3>Optional (or Default) Parameters</h3>
<table>
<tr>
<td>
<h4>C#</h4>
</td>
<td>&nbsp;</td>
<td>
<pre class="brush: cpp;">
                        public static void SomeMethod(int optional = 0) { }
                        SomeMethod(); // 0 is used in the method.
                        SomeMethod(10);
                        </pre>
</td>
</tr>
<tr>
<td>
<h4>Python</h4>
</td>
<td>&nbsp;</td>
<td>
<pre class="brush: python;">
                        def some_method(optional = 0): pass

                        some_method()
                        some_method(10)
                        </pre>
</td>
</tr>
</tr>
</table>
<h3>Named Arguments</h3>
<table>
<tr>
<td>
<h4>C#</h4>
</td>
<td>&nbsp;</td>
<td>
<pre class="brush: cpp;">
                        var sample = new List&lt;String&gt;();
                        sample.InsertRange(collection: new List&lt;String&gt;(), index: 0);
                        sample.InsertRange(index: 0, collection: new List&lt;String&gt;());
                        </pre>
</td>
</tr>
<tr>
<td>
<h4>Python</h4>
</td>
<td>&nbsp;</td>
<td>
<pre class="brush: python;">
                        def foo(bar, foobar): None

                        foo(bar='asdf', foobar=12)
                        foo(foobar=12, bar='asdf')
                        </pre>
</td>
</tr>
</table>
<p>I honestly know that a comparison of a statically typed language on the CLR and an interpreted <emph>dynamic</emph> language doesn&#8217;t account for too much. But since Microsoft is making a fuzz about dynamic being <b>the</b> keyword of the new release, I felt the urge to drop this note.  </p>
<p>The Python code was tested with v2.5 &#8211; that&#8217;s the oldest installation I&#8217;ve got. However, it&#8217;s old enough, because .NET didn&#8217;t even have decent IPC back then (i.e. Named pipes were <a href="http://msdn.microsoft.com/en-us/library/system.io.pipes.namedpipeserverstream.aspx[+]" class="broken_link">added</a> in .NET 3.5).</p>
<p>Well, that&#8217;s my rant for the night &#8211; I&#8217;m going back to teaching myself a real programmers language(<a href="http://clojure.org">Clojure</a>) &#8211; as you should, too(;</p>
<table border="0">
<tbody>
<tr>
<td><script type="text/javascript"><!--
tweetmeme_url = 'http://blog.dispatched.ch/2010/05/10/c-40s-dynamicity/';
// --></script><br />
<script src="http://tweetmeme.com/i/scripts/button.js" type="text/javascript"></script></td>
<td>&nbsp;&nbsp;&nbsp;</td>
<td>
If you liked the article, follow me on twitter <a href="http://twitter.com/preek">here</a><br />
<a href="http://twitter.com/preek"><img class="alignnone" style="border: 0pt none;" title="twitter_preek" src="http://blog.dispatched.ch/wp-content/uploads/2009/05/twitter_preek.gif" border="0" alt="twitter_preek" width="180" height="18" /></a></td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://blog.dispatched.ch/2010/05/10/c-40s-dynamicity/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Python&#8217;s binascii &#8211; hexlify() and unhexlify()What the heck?</title>
		<link>http://blog.dispatched.ch/2009/12/09/pythons-binascii-hexlify-unhexlify/</link>
		<comments>http://blog.dispatched.ch/2009/12/09/pythons-binascii-hexlify-unhexlify/#comments</comments>
		<pubDate>Tue, 08 Dec 2009 23:55:11 +0000</pubDate>
		<dc:creator>Alain M. Lafon</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[ascii]]></category>
		<category><![CDATA[binary]]></category>
		<category><![CDATA[binascii]]></category>
		<category><![CDATA[conversion]]></category>
		<category><![CDATA[cryptography]]></category>
		<category><![CDATA[hex]]></category>
		<category><![CDATA[hexlify]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[unhexlify]]></category>

		<guid isPermaLink="false">http://blog.dispatched.ch/?p=941</guid>
		<description><![CDATA[We]]></description>
			<content:encoded><![CDATA[<p>Today, a dear <a title="versatilemind.com" href="http://www.versatilemind.com" target="_blank">friend of mine</a> came up to me and asked about the Python module binascii &#8211; particularly about the methods hexlify() and unhexlify(). Since he asked for it, I&#8217;m going to share my answer publicly with you.</p>
<p>First of all, I&#8217;m defining the used nomenclature:</p>
<ul>
<li>ASCII characters are being written in single quotes</li>
<li>decimal numbers are of the type Long with a <em>L</em> suffix</li>
<li>hex values have a <em>x</em> prefix</li>
</ul>
<p>First, let me quote the <a href="http://docs.python.org/library/binascii.html" target="_blank">documentation</a>:</p>
<dt id="binascii.hexlify"><tt class="descclassname"><tt class="descclassname">binascii.</tt><tt class="descname">b2a_hex</tt><big>(</big><em>data</em><big>)</big></tt></dt>
<dt><tt class="descclassname"><big></big>binascii.</tt><tt class="descname">hexlify</tt><big>(</big><em>data</em><big>)</big><a class="headerlink" title="Permalink to this definition" href="http://docs.python.org/library/binascii.html#binascii.hexlify"></a></dt>
<dd>Return the hexadecimal representation of the binary <em>data</em>. Every byte of <em>data</em> is converted into the corresponding 2-digit hex representation. The resulting string is therefore twice as long as the length of <em>data</em>.</dd>
<dt id="binascii.a2b_hex"><tt class="descclassname">binascii.</tt><tt class="descname">a2b_hex</tt><big>(</big><em>hexstr</em><big>)</big><a class="headerlink" title="Permalink to this definition" href="http://docs.python.org/library/binascii.html#binascii.a2b_hex"></a></dt>
<dt id="binascii.unhexlify"><tt class="descclassname">binascii.</tt><tt class="descname">unhexlify</tt><big>(</big><em>hexstr</em><big>)</big><a class="headerlink" title="Permalink to this definition" href="http://docs.python.org/library/binascii.html#binascii.unhexlify"></a></dt>
<dd>Return the binary data represented by the hexadecimal string <em>hexstr</em>. This function is the inverse of <a class="reference internal" title="binascii.b2a_hex" href="http://docs.python.org/library/binascii.html#binascii.b2a_hex"><tt class="xref docutils literal"><span class="pre">b2a_hex()</span></tt></a>. <em>hexstr</em> must contain an even number of hexadecimal digits (which can be upper or lower case), otherwise a <a class="reference external" title="exceptions.TypeError" href="http://docs.python.org/library/exceptions.html#exceptions.TypeError"><tt class="xref docutils literal"><span class="pre">TypeError</span></tt></a> is raised.</dd>
<p>I&#8217;ll begin with hexlify(). As the documentation states, this method splits a string which consists of hex-tuples into distinct bytes.</p>
<p>The <a href="http://en.wikipedia.org/wiki/ASCII" target="_blank">ASCII</a> character &#8216;A&#8217; has 65L as numerical representation. To verify this in Python:</p>
<pre class="brush: python;">&amp;gt;&amp;gt;&amp;gt; long(ord('A'))
65L</pre>
<p>You might ask &#8220;Why is this even relevant to understand binascii?&#8221; Well, we don&#8217;t know anything about how ord() does its job. But with binascii we can re-calculate manually and verify.</p>
<pre class="brush: python;">&amp;gt;&amp;gt;&amp;gt; binascii.hexlify('A')
'41'</pre>
<p>Now we know that an &#8216;A&#8217; &#8211; interpreted as binary data and shown in hex &#8211; resembles &#8217;41&#8242;. But wait, &#8217;41&#8242; is a string and no hex value! That&#8217;s no biggy, hexlify() represents its result as string.</p>
<p>To stay with the example, let&#8217;s convert 41 into a decimal number and check if it equals 65L.</p>
<pre class="brush: python;">&amp;gt;&amp;gt;&amp;gt; long('41', 16)
65L</pre>
<p>Tada! It seems that <em>&#8216;A&#8217; = 41 = 65L</em>.<br />
You might have known that already, but please, stay with me a minute longer.</p>
<p>To make it look a little more complex:</p>
<pre class="brush: python;">&amp;gt;&amp;gt;&amp;gt; binascii.hexlify('A') == &amp;quot;%X&amp;quot; % long('41', 16)
True</pre>
<p>Be aware that</p>
<pre class="brush: python;">&amp;gt;&amp;gt;&amp;gt; &amp;quot;%X&amp;quot; %n</pre>
<p>converts a decimal number into its hex representation.</p>
<p>&#8212;&#8212;</p>
<p>binascii.unhexlify() naturally does the same thing as hexlify(), but in reverse. It takes binary data and displays it in tuples of hex-values.</p>
<p>I&#8217;ll start off with an example:</p>
<pre class="brush: python;">	&amp;gt;&amp;gt;&amp;gt; binascii.unhexlify('41')
	'A'

	&amp;gt;&amp;gt;&amp;gt; binascii.unhexlify(&amp;quot;%X&amp;quot; % ord('A'))
	'A'</pre>
<p>Here, unhexlify() takes the numerical representation 65L from the ASCII character &#8216;A&#8217;</p>
<pre class="brush: python;">	&amp;gt;&amp;gt;&amp;gt; ord('A')
	65</pre>
<p>converts it into hex 41</p>
<pre class="brush: python;">	&amp;gt;&amp;gt;&amp;gt; &amp;quot;%X&amp;quot; % ord('A')
	'41'</pre>
<p>and represents it as a 1-tuple (meaning dimension of one) of hex values.</p>
<p>And now the conclusio &#8211; why might all of this be useful?<br />
Right now, I can think of at least four use cases:</p>
<ul>
<li>cryptography</li>
<li>data-transformation (i.e. Base64 for MIME/E-Mail attachements)</li>
<li>security (deciphering binary readings off a network, pattern matching, &#8230;)</li>
<li>textual representation of escape sequences</li>
</ul>
<p>Taking up the last example, I&#8217;ll show you how to visualize the Bell esape sequence (you know, that thing that keeps <em>beep</em>ing in your terminal).<br />
Taken from the ASCII table, the numerical representation of the Bell is  7. Programmers might know it better as a.</p>
<pre class="brush: python;">	&amp;gt;&amp;gt;&amp;gt; '7' == 'a'
	True</pre>
<p>Presuming you read such a character in some kind of binary data &#8211; for example from a socket</p>
<pre class="brush: python;">	&amp;gt;&amp;gt;&amp;gt; foo = '7'</pre>
<p>and you want to visualize this data</p>
<pre class="brush: python;">	&amp;gt;&amp;gt;&amp;gt; print foo</pre>
<p>you will not get any results &#8211; at least none visible. You might hear the Bell sound if you&#8217;re not on a silent terminal.</p>
<p>Now, finally &#8211; binascii to the rescue:</p>
<pre class="brush: python;">	&amp;gt;&amp;gt;&amp;gt; binascii.hexlify('7')
	'07'</pre>
<p>Voilà, the dubious string is decrypted.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.dispatched.ch/2009/12/09/pythons-binascii-hexlify-unhexlify/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Disable Mail-Forwarding for Lotus Notes programmatically</title>
		<link>http://blog.dispatched.ch/2009/06/29/disable-mail-forwardin-for-lotus-notes-with-python/</link>
		<comments>http://blog.dispatched.ch/2009/06/29/disable-mail-forwardin-for-lotus-notes-with-python/#comments</comments>
		<pubDate>Mon, 29 Jun 2009 19:53:40 +0000</pubDate>
		<dc:creator>Alain M. Lafon</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[forwarding]]></category>
		<category><![CDATA[ibm]]></category>
		<category><![CDATA[lotus notes]]></category>
		<category><![CDATA[mail]]></category>
		<category><![CDATA[mail forwarding]]></category>
		<category><![CDATA[mail-forward]]></category>
		<category><![CDATA[prevent copying]]></category>
		<category><![CDATA[proprietary]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sensitivity private]]></category>
		<category><![CDATA[smtp]]></category>

		<guid isPermaLink="false">http://blog.dispatched.ch/?p=871</guid>
		<description><![CDATA[Lotus Notes has a nifty feature to lull managers into false safety: for volatile/unsafe e-mails (or users), it let&#8217;s you disable printing/forwarding and copying to clipboard. This can be done using rules, on the SMTP server and on a per e-mail basis. When writing somebody you really don&#8217;t trust with some information (but in his [...]]]></description>
			<content:encoded><![CDATA[<p>Lotus Notes has a nifty feature to lull managers into <del datetime="2009-06-29T17:07:46+00:00">false</del> safety: for volatile/unsafe e-mails (or users), it let&#8217;s you <a href="http://www-01.ibm.com/support/docview.wss?rs=0&amp;uid=swg21085758">disable</a> printing/forwarding and copying to clipboard. This can be done using rules, on the SMTP server and on a per e-mail basis. When writing somebody you really don&#8217;t trust with some information (but in his inability to spread the word otherwise &#8211; by copy/pasting for example), writing a mail would look like this:</p>
<p><a href="http://blog.dispatched.ch/wp-content/uploads/2009/06/prevent_copying.png" rel="lightbox[871]"><img class="aligncenter size-full wp-image-872" title="prevent_copying" src="http://blog.dispatched.ch/wp-content/uploads/2009/06/prevent_copying.png" alt="prevent_copying" width="469" height="386" /></a></p>
<p>Now, if your victim wants to forward your mail, Lotus Notes would respond with a little pop-up:</p>
<p style="text-align: center;"><a href="http://blog.dispatched.ch/wp-content/uploads/2009/06/success.png" rel="lightbox[871]"><img class="aligncenter size-full wp-image-874" title="success" src="http://blog.dispatched.ch/wp-content/uploads/2009/06/success.png" alt="success" width="411" height="106" /></a></p>
<p style="text-align: left;">This certainly looks like a magical and proprietary feature, doesn&#8217;t it?  Let&#8217;s look at the source of such a &#8220;mail&#8221;(aka memo in Notus&#8217; language) &#8211; you will have to forward it to another mail-client though, because memos can&#8217;t be displayed in source:</p>
<pre>...
Subject: Testnachricht
MIME-Version: 1.0
<span style="color: #ff9900;"><span style="color: #993300;">Sensitivity: Private</span>
</span>X-Mailer: Lotus Notes Release 6.5.5  CCH1 March 07, 2006
...</pre>
<p>As you can see, there is a proprietary meta-flag <em>Sensitivity: Private</em>. It can be reproduced with any decent mail user agent or programmatically. What follows is a little Python code snippet that just does the trick:</p>
<p style="text-align: left;">
<pre class="brush: python;">
import smtplib
from email.message import Message
msg = Message()
msg.set_payload(&quot;Testmessage Body&quot;)
msg[&quot;Subject&quot;] = &quot;Testmessage from Python&quot;
msg[&quot;From&quot;] = &quot;preek@dispatched.ch&quot;
msg[&quot;To&quot;] = &quot;somebody@somewhere.com&quot;
msg[&quot;Sensitivity&quot;] = &quot;Private&quot;
smtp = smtplib.SMTP(&quot;localhost&quot;)
smtp.sendmail(&quot;preek@dispatched.ch&quot;, &quot;somebody@somewhere.com&quot;, msg.as_string())
</pre>
<p>But please, don&#8217;t use this information unless you absolutely have to. Lotus Notes.. *brr*.</p>
<p>Enjoy(;</p>
<p>If you liked this article, please feel free to re-tweet it and let others know.</p>
<table border="0">
<tbody>
<tr>
<td><script type="text/javascript"><!--
tweetmeme_url = 'http://blog.dispatched.ch/2009/06/29/disable-mail-forwardin-for-lotus-notes-with-python/';
// --></script> <script src="http://tweetmeme.com/i/scripts/button.js" type="text/javascript"></script></td>
<td></td>
<td><a href="http://twitter.com/preek"><img class="alignnone" style="border: 0pt none;" title="twitter_preek" src="http://blog.dispatched.ch/wp-content/uploads/2009/05/twitter_preek.gif" border="0" alt="twitter_preek" width="180" height="18" /></a></td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://blog.dispatched.ch/2009/06/29/disable-mail-forwardin-for-lotus-notes-with-python/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>VIM as Python IDE</title>
		<link>http://blog.dispatched.ch/2009/05/24/vim-as-python-ide/</link>
		<comments>http://blog.dispatched.ch/2009/05/24/vim-as-python-ide/#comments</comments>
		<pubDate>Sat, 23 May 2009 23:04:59 +0000</pubDate>
		<dc:creator>Alain M. Lafon</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[ctags]]></category>
		<category><![CDATA[exuberant ctags]]></category>
		<category><![CDATA[ide]]></category>
		<category><![CDATA[minibuf]]></category>
		<category><![CDATA[omni completion]]></category>
		<category><![CDATA[pep 8]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[python ide]]></category>
		<category><![CDATA[taglist]]></category>
		<category><![CDATA[tasklist]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[vi]]></category>
		<category><![CDATA[vim]]></category>
		<category><![CDATA[vimpdb]]></category>
		<category><![CDATA[walkthrough]]></category>

		<guid isPermaLink="false">http://blog.dispatched.ch/?p=777</guid>
		<description><![CDATA[Finding the perfect IDE for Python isn&#8217;t an easy feat. There are a great many to chose from, but even though some of them offer really nifty features, I can&#8217;t help myself but feel attracted to VIM anyway. I feel that no IDE accomplishes the task of giving the comfort of complete power over the [...]]]></description>
			<content:encoded><![CDATA[<p>Finding the perfect IDE for Python isn&#8217;t an easy feat. There are a great many to chose from, but even though some of them offer really nifty features, I can&#8217;t help myself but feel attracted to VIM anyway. I feel that no IDE accomplishes the task of giving the comfort of complete power over the code &#8211; something is always missing out. This is why I always come back to using IDLE and VIM. Those two seem to be best companions when doing some quick and agile hacking &#8211; but when it comes to managing bigger and longer term projects, this combo needs some tweaking. But when it&#8217;s done, VIM will be a powerful IDE for Python &#8211; including code completion(with pydoc display), graphical debugging, task-management and a project view.</p>
<p>This is where we are going:</p>
<p style="text-align: center;"><a href="http://blog.dispatched.ch/wp-content/uploads/2009/05/vim-as-python-ide.png" rel="lightbox[777]"><img class="size-full wp-image-799 aligncenter" title="vim-as-python-ide" src="http://blog.dispatched.ch/wp-content/uploads/2009/05/vim-as-python-ide.png" alt="VIM as Python IDE" width="491" height="401" /></a></p>
<p>So, these are my thoughts on a VIM setup for coding (Python).</p>
<p>Modern GUI VIM implementations like GVIM or MacVIM give the user the opportunity to organize their open files in tabs. This might look convenient, but to me it is rather bad practice, because a second tab will not be in the in the same buffer scope as the first one which takes away from future interaction options between the two. Using <a title="MiniBuf" href="http://www.vim.org/scripts/script.php?script_id=159" target="_blank">MiniBufExplorer</a>, however, gives the user tabs(not only in the GUI, but also in command line) and leaves the classic buffer interaction intact.</p>
<p style="text-align: center;"><a href="http://blog.dispatched.ch/wp-content/uploads/2009/05/minibuf.png" rel="lightbox[777]"><img class="size-full wp-image-784 aligncenter" title="minibuf" src="http://blog.dispatched.ch/wp-content/uploads/2009/05/minibuf.png" alt="MiniBuf Explorer" width="484" height="87" /></a></p>
<p>Being able to neatly work on multiple files, the user still misses the potential his favourite IDE gives him in visualizing classes, functions and variables. Luckily there are quite a few plugins around to accomplish this task just as well. My favourite one would be <a title="TagList" href="http://vim-taglist.sourceforge.net/" target="_blank">TagList</a>. TagList uses <a title="Exuberant CTags" href="http://ctags.sourceforge.net/" target="_blank">Exuberant Ctags</a> for actually generating the tags(note: it really relies on this specific version of ctags &#8211; preinstalled implementations on UNIX systems won&#8217;t work).</p>
<p style="text-align: center;"><a href="http://blog.dispatched.ch/wp-content/uploads/2009/05/taglist.png" rel="lightbox[777]"><img class="size-full wp-image-787 aligncenter" title="taglist" src="http://blog.dispatched.ch/wp-content/uploads/2009/05/taglist.png" alt="TagList" width="481" height="260" /></a></p>
<p>A lot of coders have the habit of using TODO or FIXME statements in their code. Other IDEs often rely on having good third party project management software, but not VIM. There are great plugins like <a title="TaskList" href="http://www.vim.org/scripts/script.php?script_id=2607" target="_blank">Tasklist</a> reminding the programmer of those lines of code. Tasklist even implements custom lists &#8211; to me that&#8217;s an incredible productivity gain.</p>
<p style="text-align: center;"><a href="http://blog.dispatched.ch/wp-content/uploads/2009/05/tasklist.png" rel="lightbox[777]"><img class="size-full wp-image-781 aligncenter" title="tasklist" src="http://blog.dispatched.ch/wp-content/uploads/2009/05/tasklist.png" alt="TaskList" width="491" height="163" /></a></p>
<p>In these times, the programmer knows his or her programming language more or less by interactively finding out what it can do. Therefore code completion(sometimes also called IntelliSense*ugh*) is a major feature. I have heard  many people saying that this is where VIM fails &#8211; but luckily they are plain wrong(; In V7, VIM introduced <a title="Omni Completion" href="http://vim.wikia.com/wiki/Omni_completion" target="_blank">omni completion</a> &#8211; given it is configured to recognize Python (if not, this feature is only a <a title="Python Omni Completion" href="http://www.vim.org/scripts/script.php?script_id=1542" target="_blank">plugin</a> away) Ctrl+x Ctrl+o opens a drop down dialog like any other IDE &#8211; even the whole Pydoc gets to be displayed in a split window.</p>
<p style="text-align: center;"><a href="http://blog.dispatched.ch/wp-content/uploads/2009/05/omnicompletion.png" rel="lightbox[777]"><img class="size-full wp-image-791 aligncenter" title="omnicompletion" src="http://blog.dispatched.ch/wp-content/uploads/2009/05/omnicompletion.png" alt="Omni Completion" width="436" height="312" /></a></p>
<p>Probably the most wanted feature(besides code completion) is debugging graphically. <a title="VimPDB" href="http://code.google.com/p/vimpdb/" target="_blank">VimPDB</a> is a plugin that lets you do just that(. I acknowledge it is no complete substitution for a full fledged graphical debugger, but I honour the thought that having to rely on a debugger (often), is a hint of bad design.</p>
<p style="text-align: center;"><a href="http://blog.dispatched.ch/wp-content/uploads/2009/05/vimpdb.png" rel="lightbox[777]"><img class="size-full wp-image-794 aligncenter" title="vimpdb" src="http://blog.dispatched.ch/wp-content/uploads/2009/05/vimpdb.png" alt="VimPDB" width="498" height="134" /></a></p>
<p>&#8211;</p>
<p style="text-align: center;">
<p style="text-align: left;">From the eye-candy to the implementation. Don&#8217;t worry, it&#8217;s no sorcery.</p>
<p style="text-align: left;">First of all, make sure you have VIM version 7.x installed, compiled with Python support. To check for the second, enter <em>:python print &#8220;hello, world&#8221;</em> into VIM. If you see an error message like <em>&#8220;E319: Sorry, the command is not available in this version&#8221;</em>, then it&#8217;s time to get a new one. If you&#8217;re on a Mac, just install MacVIM(there&#8217;s also a binary for the console in /Applications/MacVim.app/Contents/MacOS/). If you&#8217;re on Windows, GVIM will suffice(for versions != 2.4 search for the right <a title="Vim for Windows32" href="http://www.gooli.org/blog/gvim-72-with-python-2526-support-windows-binaries/" target="_blank">plugin</a>). If you&#8217;re on any other machine, you will probably know how to compile your very own VIM with Python support.</p>
<p style="text-align: left;">Second, check if you have a plugin directory. In Unix it would typically be located in <em>$HOME/.vim/plugin</em>, in Windows in the <em>Program Files </em>directory. If it doesn&#8217;t exist, create it.</p>
<p style="text-align: left;">Now, let&#8217;s start with the MiniBufExplorer. <a title="MiniBuf Explorer" href="http://www.vim.org/scripts/script.php?script_id=159" target="_blank">Get</a> it and copy it into your plugin directory. To start it automatically when needed and be able to use it with keyboard and mouse commands, append these lines in your vimrc configuration:</p>
<p><code>let g:miniBufExplMapWindowNavVim = 1<br />
let g:miniBufExplMapWindowNavArrows = 1<br />
let g:miniBufExplMapCTabSwitchBufs = 1<br />
let g:miniBufExplModSelTarget = 1</code></p>
<p style="text-align: left;">For a project view, get <a title="TagList" href="http://vim-taglist.sourceforge.net/" target="_blank">TagList</a> and <a title="Exuberant CTags" href="http://ctags.sourceforge.net/" target="_blank">Exuberant Ctags</a>. To install Ctags, unpack it, go into the directory and do a compile/install via:</p>
<p><code>./configure &amp;&amp; sudo make install</code></p>
<p>Ctags will then be installed in /usr/local/bin. When using a Windows machine, I recommend <a href="http://cygwin.com/">Cygwin</a> with GCC and Make; it&#8217;ll work just fine. If you don&#8217;t want to tamper with your original ctags installation, you can propagate the location to VIM by appending the following line to vimrc:</p>
<p><code>let Tlist_Ctags_Cmd='/usr/local/bin/ctags'</code></p>
<p>To install TagList, just drop it into VIMs plugin directory. You will now be able to use the project view by typing the command <em>:TlistToggle</em>.</p>
<p><a title="TaskList" href="http://www.vim.org/scripts/script.php?script_id=2607" target="_blank">Tasklist</a> is a simple plugin, too. Copying it into the plugin directory will suffice. I like to have shortcuts and have added<br />
<code><br />
map T :TaskList&lt;CR&gt;<br />
map P :TlistToggle&lt;CR&gt;<br />
</code></p>
<p>to vimrc. Pressing <em>T </em>will then open the TaskList if there are any tasks to process. <em>q </em>quits the TaskList again.</p>
<p><a title="VimPDB" href="http://code.google.com/p/vimpdb/" target="_blank">VimPDB</a> is a plugin, as well. Install as before and see the readme for documentation. If it doesn&#8217;t work out of the box, watch for the known <a title="Issuses VimPDB" href="http://code.google.com/p/vimpdb/issues/list" target="_blank">issues</a>.</p>
<p>To enable code(omni) completion, add this line to your vimrc:</p>
<p><code>autocmd FileType python set omnifunc=pythoncomplete#Complete</code></p>
<p>If it doesn&#8217;t work then, you&#8217;ll need this <a title="Python Omni Completion" href="http://www.vim.org/scripts/script.php?script_id=1542" target="_blank">plugin</a>.</p>
<p style="text-align: left;">My last two recommondations are setting these lines to comply to <a title="PEP 8" href="http://www.python.org/dev/peps/pep-0008/" target="_blank">PEP 8</a>(Pythons&#8217; style guide) and to have decent eye candy:</p>
<p><code>set expandtab<br />
set textwidth=79<br />
set tabstop=8<br />
set softtabstop=4<br />
set shiftwidth=4<br />
set autoindent<br />
:syntax on</code></p>
<p>There are certainly a lot more flags to help productivity, but those will probably be more user specific.</p>
<p>Have fun coding Python while not being bound to a specific IDE, but having all the benefits of VIM bundled with a few helping hands. Enjoy, everyone.</p>
<p>If you liked this article, please feel free to re-tweet it and let others know.</p>
<table border="0">
<tbody>
<tr>
<td><script type="text/javascript"><!--
tweetmeme_url = 'http://blog.dispatched.ch/2009/05/24/vim-as-python-ide/';
// --></script><br />
<script src="http://tweetmeme.com/i/scripts/button.js" type="text/javascript"></script></td>
<td>&nbsp;&nbsp;&nbsp;</td>
<td>
You should follow me on twitter <a href="http://twitter.com/preek">here</a><br />
<a href="http://twitter.com/preek"><img class="alignnone" style="border: 0pt none;" title="twitter_preek" src="http://blog.dispatched.ch/wp-content/uploads/2009/05/twitter_preek.gif" border="0" alt="twitter_preek" width="180" height="18" /></a></td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://blog.dispatched.ch/2009/05/24/vim-as-python-ide/feed/</wfw:commentRss>
		<slash:comments>66</slash:comments>
		</item>
		<item>
		<title>Juno on Solaris 10</title>
		<link>http://blog.dispatched.ch/2009/05/18/juno-on-solaris-10/</link>
		<comments>http://blog.dispatched.ch/2009/05/18/juno-on-solaris-10/#comments</comments>
		<pubDate>Mon, 18 May 2009 13:23:30 +0000</pubDate>
		<dc:creator>Alain M. Lafon</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[Compile Python]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[Juno]]></category>
		<category><![CDATA[lightweight]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[Solaris 10]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[webframework]]></category>

		<guid isPermaLink="false">http://blog.dispatched.ch/?p=753</guid>
		<description><![CDATA[Juno is an incredibly lightweight webframework. Using Python as backend, it fullfills my very need for just about every small application I want to deploy against the web. It has no need for big runtimes on the server, no files to configure a great many files and most importantly: there&#8217;s no coding overhead &#8211; the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://brianreily.com/project/juno" class="broken_link">Juno</a> is an incredibly lightweight webframework. Using Python as backend, it fullfills my very need for just about every small application I want to deploy against the web. It has no need for big runtimes on the server, no files to configure a great many files and most importantly: there&#8217;s no coding overhead &#8211; the programmer defines only the distinctively wanted features.<br />
However, installing Juno on Solaris 10 isn&#8217;t quite as easy as described in Junos&#8217; documentation. Solaris ships with Python 2.4, but Juno depends in Jinja2(a templating engine) which itself depends on Python 2.5+. Even installing Blastwave&#8217;s or Sunfreeware&#8217;s version won&#8217;t help. But that&#8217;s no biggie since compiling your own Python is incredibly easy.</p>
<ol>
<li>Get, compile and install Python (I have used version 2.5.4)
<ul>
<li><a href="http://www.python.org/download/releases/" target="_blank">http://www.python.org/download/releases/</a></li>
<li>unpack</li>
<li>make sure you have a recent version of GCC installed</li>
<li>./configure &amp;&amp; make &amp;&amp; make install</li>
<li>as a result Python will be installed in /usr/local</li>
</ul>
</li>
<p></p>
<li>Get, compile and install Setuptools
<ul>
<li><a href="http://pypi.python.org/pypi/setuptools" target="_self">http://pypi.python.org/pypi/setuptools</a></li>
<li>unpack</li>
<li>python setup.py install</li>
</ul>
<p>
</li>
<li> Get, compile and install  pysqlite
<ul>
<li><a href="http://oss.itsystementwicklung.de/trac/pysqlite/wiki/WikiStart#Downloads" target="_blank">http://oss.itsystementwicklung.de/trac/pysqlite/wiki/WikiStart#Downloads</a></li>
<li>unpack</li>
<li>add line &#8220;library_dirs=/usr/local/lib&#8221; to pysqlite-x.y.z/setup.cfg</li>
<li>globally export your library paths:
<li>LD_LIBRARY_PATH=/opt/csw/lib/:/usr/lib/:/lib/:/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH</li>
<li>python setup.py install</li>
</ul>
</li>
<li>easy_install install sqlalchemy
</li>
<p></p>
<li>easy_install jinja2</li>
<p></p>
<li>Get, compile and install Juno
<ul>
<li><a href="http://brianreily.com/project/juno" target="_blank" class="broken_link"> http://brianreily.com/project/juno</a></li>
<li>python setup.py install</li>
</ul>
<p>
</li>
</ol>
<p>Enjoy.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.dispatched.ch/2009/05/18/juno-on-solaris-10/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Webscraping with Python and BeautifulSoup</title>
		<link>http://blog.dispatched.ch/2009/03/15/webscraping-with-python-and-beautifulsoup/</link>
		<comments>http://blog.dispatched.ch/2009/03/15/webscraping-with-python-and-beautifulsoup/#comments</comments>
		<pubDate>Sun, 15 Mar 2009 10:05:08 +0000</pubDate>
		<dc:creator>Alain M. Lafon</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[beautifulsoup]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[scraping]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[web scraping]]></category>
		<category><![CDATA[webscraping]]></category>

		<guid isPermaLink="false">http://gefechtsdienst.de/?p=567</guid>
		<description><![CDATA[Recently my life has been a hype; partly due to my upcoming Python addiction. There&#8217;s simply no way around it; so I should better confess it in public. I&#8217;m in love with Python. It&#8217;s not only mature, businessproof and performant, but also benefits from sleekness, great performance and is just so much fun to write. [...]]]></description>
			<content:encoded><![CDATA[<p>Recently my life has been a hype; partly due to my upcoming Python addiction. There&#8217;s simply no way around it; so I should better confess it in public. I&#8217;m in love with Python. It&#8217;s not only mature, businessproof and performant, but also benefits from sleekness, great performance and is just so much fun to write. It&#8217;s as if I were in Star Trek and only had to tell the computer what I wanted; never minding how the job actually it is done. Even my favourite comic artist(besides Scott Adams, of course..)  <a href="http://xkcd.com/353/" target="_blank">took up</a> on it; so my feelings have to be honest.</p>
<p>In this short tutorial, I&#8217;m going to show you how to scrape a website with the 3rd party html-parsing module <a href="http://www.crummy.com/software/BeautifulSoup/" target="_blank">BeautifulSoup</a> in a practical example. We will search the wonderful translation engine <a href="http://www.dict.cc/" target="_blank">dict.cc</a>, which holds the key to over 700k translations from English to German and vice versa. Note that BeautifulSoup is <a href="http://www.crummy.com/software/BeautifulSoup/#Download" target="_blank">liscensed</a> just like Python while dict.cc allows for <a href="http://www.dict.cc/?s=about%3Afaq#faq15" target="_blank">external searching</a>.</p>
<p>First of, place BeautifulSoup.py in your modules directory. Alternatively, if you just want to do a quick test, put in the same directory where you will be writing your program. Then start your favourite text editor/Python IDE(for quick prototyping like we are about to do, I highly recommend a combination of IDLE and VIM) and begin coding. In this tutorial we won&#8217;t be doing any design; we won&#8217;t even encapsulate in a class. How to do that, later on, is up to your needs.</p>
<p>What we will do:</p>
<ol>
<li>go to dict.cc</li>
<li>enter a search word into the webform</li>
<li>submit the form</li>
<li>read the result</li>
<li>parse the html code</li>
<li>save all translations</li>
<li>print them</li>
</ol>
<p>You can either read the needed coded on the fly or <a href='http://blog.dispatched.ch/wp-content/uploads/2009/03/webscraping_demo.py'>download </a>it.<br />
Now let&#8217;s begin the magic. Those are our needed imports.</p>
<pre class="brush: python;">
import urllib
import urllib2
import string
import sys
from BeautifulSoup import BeautifulSoup
</pre>
<p><a href="http://docs.python.org/library/urllib.html" target="_blank">urllib</a> and <a href="http://docs.python.org/library/urllib2.html" target="_blank">urllib2</a> are both modules offering the possibility to read data from various URLs; they will be needed to open the connection and retrieve the website.  BeautifulSoup is, as mentioned, a html parser.</p>
<p>Since we are going to fetch our data from a website, we have to behave like a browser. That&#8217;s why will be needing to fake a <a href="http://de.wikipedia.org/wiki/User_Agent" target="_blank">user agent</a>. For our program, I chose to push the webstatistics a little in favour of Firefox and Solaris.</p>
<pre class="brush: python;">
user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = { 'User-Agent' : user_agent }
</pre>
<p>Now let&#8217;s take a look at the code of dict.cc. We need to know how the webform is constructed if we want to query it.</p>
<pre class="brush: xml;">
...
&lt;form style=&quot;margin:0px&quot; action=&quot;http://www.dict.cc/&quot; method=&quot;get&quot;&gt;
  &lt;table&gt;
    &lt;tr&gt;
      &lt;td&gt;
        &lt;input id=&quot;sinp&quot; maxlength=&quot;100&quot; name=&quot;s&quot; size=&quot;25&quot; type=&quot;text&quot; /&gt;
        style=&quot;padding:2px;width:340px&quot; value=&quot;&quot;&gt;
      ...&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/table&gt;
&lt;/form&gt;
...
</pre>
<p>The relevant parts are <em>action</em>, <em>method</em> and the <em>name</em> inside the <em>input</em> tag. The action is the webapplication that will get called when the form is submitted. The method shows us how we need to encode the data for the form while the <em>name</em> is our query variable.</p>
<pre class="brush: python;">
values = {'s' : sys.argv[1] }
data = urllib.urlencode(values)
request = urllib2.Request(&quot;http://www.dict.cc/&quot;, data, headers)
response = urllib2.urlopen(request)
</pre>
<p>Here the data get&#8217;s encapsulated in a <a href="http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol" target="_blank">GET request</a> and packed into the webform. Notice that <em>values</em> is a dictionary which makes handling more complex forms a charm. The the form gets submitted by urlopen() &#8211; i.e. we virtually pressed the &#8220;Search&#8221;-button.<br />
See how easy it is? These are only a couple lines of code, but we already have searched on dict.cc for a completely arbitrary word from the commandline. The <em>response</em> has also been retrieved. All that is left, is to extract the relevant information.</p>
<pre class="brush: python;">
the_page = response.read()
pool = BeautifulSoup(the_page)
</pre>
<p>The <em>response</em> is read and saved into regular html code. This code could now be analyzed via regular string.find() or re.findall() methods, but this implies hard-coding in reference to a lot of the underlying logic of the page. Besides, it would require a lot reverse engineering of the positional parameters, setting up several potentially recursive methods. This would ultimately produce ugly(i.e. not very pythonic) code. Lucky for us, there already is a full fledged html parser which allows us to ask just about any generic question. Let&#8217;s take a look at the resulting html code, first. If you are not yet familar with the tool that can be seen in the screenshot; I&#8217;m using Firefox with the <a href="https://addons.mozilla.org/de/firefox/addon/1843" target="_blank">Firebug</a> addon. This one is very helpful if you ever need to debug a website.</p>
<dl id="attachment_606" class="wp-caption aligncenter" style="width: 449px;">
<dt class="wp-caption-dt"><a href="http://blog.dispatched.ch/wp-content/uploads/2009/03/picture-2.png" rel="lightbox[567]"><img class="size-full wp-image-606" title="dict_cc_search_for_web" src="http://blog.dispatched.ch/wp-content/uploads/2009/03/picture-2.png" alt="dict.cc // search for &quot;web&quot;" width="439" height="334" /></a></dt>
</dl>
<p>Let me show an excerpt of the code.</p>
<pre class="brush: xml;">
&lt;table&gt;..
  &lt;td class=&quot;td7nl&quot; style=&quot;background-color: rgb(233, 233, 233);&quot;&gt;
    &lt;a href=&quot;/englisch-deutsch/web.html&quot;&gt;
      &lt;b&gt;web&lt;/b&gt;
    &lt;/a&gt;
  &lt;/td&gt;
&lt;td class=&quot;td7nl&quot; ... /td&gt;
&lt;/table&gt;..
</pre>
<p>The results are displayed in a table. The two interesting columns share the class <em>td7nl</em>. The most efficient way would seem to just sweep all the data from inside the cells of these two columns. Fortunately for us, BeautifulSoup implemented just that feature.</p>
<pre class="brush: python;">
results = pool.findAll('td', attrs={'class' : 'td7nl'})
source = ''
translations = []

for result in results:
    word = ''
    for tmp in result.findAll(text=True):
        word = word + &quot; &quot; + unicode(tmp).encode(&quot;utf-8&quot;)
    if source == '':
        source = word
    else:
        translations.append((source, word))

for translation in translations:
    print &quot;%s =&gt; %s&quot; % (translation[0], translation[1])
</pre>
<p><em>results</em> will be a BeautifulSoup.ResultSet. Each member of the tuple is the html code of one column of the class <em>td7nl</em>. Notice that you can access each element like you would expect in a tuple. <em>result.findAll(text=True)</em> will return each embedded textual element of the table. All we have to do is merge the different tags together.<br />
<em>source</em> and <em>word</em> are temporary variables that will hold one translation in each iteration. Each translation will be saved as a pair(list) inside the <em>translations</em> tuple.<br />
Finally we iterate over the found translations and write them to the screen.</p>
<pre class="box">
$ python webscraping_demo.py
 kinky   {adj} =>  9 kraus   [Haar]
 kinky   {adj} =>  nappy   {adj}   [Am.]
 kinky   {adj} =>  6 kraus   [Haar]
 kinky   {adj} =>  crinkly   {adj}
 kinky   {adj} =>  kraus
 kinky   {adj} =>  curly   {adj}
 kinky   {adj} =>  kraus
 kinky   {adj} =>  frizzily   {adv}
</pre>
<p>In a regular application those results would need a little lexing, of course. The most important thing, however, is that we just wrote a translation wrapper onto a webapplication &#8211; in only 28 lines of code. Did I mention that I&#8217;m in love with Python?</p>
<p>All that is left is for me to recommend the <a href="http://www.crummy.com/software/BeautifulSoup/documentation.html">BeautifulSoup documentation</a>. What we did here really didn&#8217;t cover what this module is capable of.</p>
<p>I wish you all the best.</p>
<p><script type="text/javascript"><!--
digg_url = 'http://digg.com/programming/Webscraping_with_Python_and_BeautifulSoup';
// --></script><br />
<script src="http://digg.com/api/diggthis.js"></script></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.dispatched.ch/2009/03/15/webscraping-with-python-and-beautifulsoup/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
	</channel>
</rss>
