<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
> <channel><title>Alain M. Lafon &#187; howto</title> <atom:link href="http://blog.dispatched.ch/tag/howto/feed/" rel="self" type="application/rss+xml" /><link>http://blog.dispatched.ch</link> <description>code, life and struggles thereof</description> <lastBuildDate>Mon, 16 Jan 2012 13:44:17 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <item><title>Juno on Solaris 10</title><link>http://blog.dispatched.ch/2009/05/18/juno-on-solaris-10/</link> <comments>http://blog.dispatched.ch/2009/05/18/juno-on-solaris-10/#comments</comments> <pubDate>Mon, 18 May 2009 13:23:30 +0000</pubDate> <dc:creator>Alain M. Lafon</dc:creator> <category><![CDATA[articles]]></category> <category><![CDATA[Compile Python]]></category> <category><![CDATA[howto]]></category> <category><![CDATA[Juno]]></category> <category><![CDATA[lightweight]]></category> <category><![CDATA[python]]></category> <category><![CDATA[Solaris 10]]></category> <category><![CDATA[tutorial]]></category> <category><![CDATA[webframework]]></category> <guid
isPermaLink="false">http://blog.dispatched.ch/?p=753</guid> <description><![CDATA[Juno is an incredibly lightweight webframework. Using Python as backend, it fullfills my very need for just about every small application I want to deploy against the web. It has no need for big runtimes on the server, no files to configure a great many files and most importantly: there&#8217;s no coding overhead &#8211; the [...]]]></description> <content:encoded><![CDATA[<p><a
href="http://brianreily.com/project/juno" class="broken_link">Juno</a> is an incredibly lightweight webframework. Using Python as backend, it fullfills my very need for just about every small application I want to deploy against the web. It has no need for big runtimes on the server, no files to configure a great many files and most importantly: there&#8217;s no coding overhead &#8211; the programmer defines only the distinctively wanted features.<br
/> However, installing Juno on Solaris 10 isn&#8217;t quite as easy as described in Junos&#8217; documentation. Solaris ships with Python 2.4, but Juno depends in Jinja2(a templating engine) which itself depends on Python 2.5+. Even installing Blastwave&#8217;s or Sunfreeware&#8217;s version won&#8217;t help. But that&#8217;s no biggie since compiling your own Python is incredibly easy.</p><ol><li>Get, compile and install Python (I have used version 2.5.4)<ul><li><a
href="http://www.python.org/download/releases/" target="_blank">http://www.python.org/download/releases/</a></li><li>unpack</li><li>make sure you have a recent version of GCC installed</li><li>./configure &amp;&amp; make &amp;&amp; make install</li><li>as a result Python will be installed in /usr/local</li></ul></li><p></p><li>Get, compile and install Setuptools<ul><li><a
href="http://pypi.python.org/pypi/setuptools" target="_self">http://pypi.python.org/pypi/setuptools</a></li><li>unpack</li><li>python setup.py install</li></ul><p></li><li> Get, compile and install  pysqlite<ul><li><a
href="http://oss.itsystementwicklung.de/trac/pysqlite/wiki/WikiStart#Downloads" target="_blank">http://oss.itsystementwicklung.de/trac/pysqlite/wiki/WikiStart#Downloads</a></li><li>unpack</li><li>add line &#8220;library_dirs=/usr/local/lib&#8221; to pysqlite-x.y.z/setup.cfg</li><li>globally export your library paths:<li>LD_LIBRARY_PATH=/opt/csw/lib/:/usr/lib/:/lib/:/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH</li><li>python setup.py install</li></ul></li><li>easy_install install sqlalchemy</li><p></p><li>easy_install jinja2</li><p></p><li>Get, compile and install Juno<ul><li><a
href="http://brianreily.com/project/juno" target="_blank" class="broken_link"> http://brianreily.com/project/juno</a></li><li>python setup.py install</li></ul><p></li></ol><p>Enjoy.</p> ]]></content:encoded> <wfw:commentRss>http://blog.dispatched.ch/2009/05/18/juno-on-solaris-10/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Webscraping with Python and BeautifulSoup</title><link>http://blog.dispatched.ch/2009/03/15/webscraping-with-python-and-beautifulsoup/</link> <comments>http://blog.dispatched.ch/2009/03/15/webscraping-with-python-and-beautifulsoup/#comments</comments> <pubDate>Sun, 15 Mar 2009 10:05:08 +0000</pubDate> <dc:creator>Alain M. Lafon</dc:creator> <category><![CDATA[articles]]></category> <category><![CDATA[beautifulsoup]]></category> <category><![CDATA[howto]]></category> <category><![CDATA[python]]></category> <category><![CDATA[scraping]]></category> <category><![CDATA[tutorial]]></category> <category><![CDATA[web scraping]]></category> <category><![CDATA[webscraping]]></category> <guid
isPermaLink="false">http://gefechtsdienst.de/?p=567</guid> <description><![CDATA[Recently my life has been a hype; partly due to my upcoming Python addiction. There&#8217;s simply no way around it; so I should better confess it in public. I&#8217;m in love with Python. It&#8217;s not only mature, businessproof and performant, but also benefits from sleekness, great performance and is just so much fun to write. [...]]]></description> <content:encoded><![CDATA[<p>Recently my life has been a hype; partly due to my upcoming Python addiction. There&#8217;s simply no way around it; so I should better confess it in public. I&#8217;m in love with Python. It&#8217;s not only mature, businessproof and performant, but also benefits from sleekness, great performance and is just so much fun to write. It&#8217;s as if I were in Star Trek and only had to tell the computer what I wanted; never minding how the job actually it is done. Even my favourite comic artist(besides Scott Adams, of course..) <a
href="http://xkcd.com/353/" target="_blank">took up</a> on it; so my feelings have to be honest.</p><p>In this short tutorial, I&#8217;m going to show you how to scrape a website with the 3rd party html-parsing module <a
href="http://www.crummy.com/software/BeautifulSoup/" target="_blank">BeautifulSoup</a> in a practical example. We will search the wonderful translation engine <a
href="http://www.dict.cc/" target="_blank">dict.cc</a>, which holds the key to over 700k translations from English to German and vice versa. Note that BeautifulSoup is <a
href="http://www.crummy.com/software/BeautifulSoup/#Download" target="_blank">liscensed</a> just like Python while dict.cc allows for <a
href="http://www.dict.cc/?s=about%3Afaq#faq15" target="_blank">external searching</a>.</p><p>First of, place BeautifulSoup.py in your modules directory. Alternatively, if you just want to do a quick test, put in the same directory where you will be writing your program. Then start your favourite text editor/Python IDE(for quick prototyping like we are about to do, I highly recommend a combination of IDLE and VIM) and begin coding. In this tutorial we won&#8217;t be doing any design; we won&#8217;t even encapsulate in a class. How to do that, later on, is up to your needs.</p><p>What we will do:</p><ol><li>go to dict.cc</li><li>enter a search word into the webform</li><li>submit the form</li><li>read the result</li><li>parse the html code</li><li>save all translations</li><li>print them</li></ol><p>You can either read the needed coded on the fly or <a
href='http://blog.dispatched.ch/wp-content/uploads/2009/03/webscraping_demo.py'>download </a>it.<br
/> Now let&#8217;s begin the magic. Those are our needed imports.</p><pre class="brush: python; title: ; notranslate">
import urllib
import urllib2
import string
import sys
from BeautifulSoup import BeautifulSoup
</pre><p><a
href="http://docs.python.org/library/urllib.html" target="_blank">urllib</a> and <a
href="http://docs.python.org/library/urllib2.html" target="_blank">urllib2</a> are both modules offering the possibility to read data from various URLs; they will be needed to open the connection and retrieve the website.  BeautifulSoup is, as mentioned, a html parser.</p><p>Since we are going to fetch our data from a website, we have to behave like a browser. That&#8217;s why will be needing to fake a <a
href="http://de.wikipedia.org/wiki/User_Agent" target="_blank">user agent</a>. For our program, I chose to push the webstatistics a little in favour of Firefox and Solaris.</p><pre class="brush: python; title: ; notranslate">
user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = { 'User-Agent' : user_agent }
</pre><p>Now let&#8217;s take a look at the code of dict.cc. We need to know how the webform is constructed if we want to query it.</p><pre class="brush: xml; title: ; notranslate">
...
&lt;form style=&quot;margin:0px&quot; action=&quot;http://www.dict.cc/&quot; method=&quot;get&quot;&gt;
  &lt;table&gt;
    &lt;tr&gt;
      &lt;td&gt;
        &lt;input id=&quot;sinp&quot; maxlength=&quot;100&quot; name=&quot;s&quot; size=&quot;25&quot; type=&quot;text&quot; /&gt;
        style=&quot;padding:2px;width:340px&quot; value=&quot;&quot;&gt;
      ...&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/table&gt;
&lt;/form&gt;
...
</pre><p>The relevant parts are <em>action</em>, <em>method</em> and the <em>name</em> inside the <em>input</em> tag. The action is the webapplication that will get called when the form is submitted. The method shows us how we need to encode the data for the form while the <em>name</em> is our query variable.</p><pre class="brush: python; title: ; notranslate">
values = {'s' : sys.argv[1] }
data = urllib.urlencode(values)
request = urllib2.Request(&quot;http://www.dict.cc/&quot;, data, headers)
response = urllib2.urlopen(request)
</pre><p>Here the data get&#8217;s encapsulated in a <a
href="http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol" target="_blank">GET request</a> and packed into the webform. Notice that <em>values</em> is a dictionary which makes handling more complex forms a charm. The the form gets submitted by urlopen() &#8211; i.e. we virtually pressed the &#8220;Search&#8221;-button.<br
/> See how easy it is? These are only a couple lines of code, but we already have searched on dict.cc for a completely arbitrary word from the commandline. The <em>response</em> has also been retrieved. All that is left, is to extract the relevant information.</p><pre class="brush: python; title: ; notranslate">
the_page = response.read()
pool = BeautifulSoup(the_page)
</pre><p>The <em>response</em> is read and saved into regular html code. This code could now be analyzed via regular string.find() or re.findall() methods, but this implies hard-coding in reference to a lot of the underlying logic of the page. Besides, it would require a lot reverse engineering of the positional parameters, setting up several potentially recursive methods. This would ultimately produce ugly(i.e. not very pythonic) code. Lucky for us, there already is a full fledged html parser which allows us to ask just about any generic question. Let&#8217;s take a look at the resulting html code, first. If you are not yet familar with the tool that can be seen in the screenshot; I&#8217;m using Firefox with the <a
href="https://addons.mozilla.org/de/firefox/addon/1843" target="_blank">Firebug</a> addon. This one is very helpful if you ever need to debug a website.</p><dl
id="attachment_606" class="wp-caption aligncenter" style="width: 449px;"><dt
class="wp-caption-dt"><a
href="http://blog.dispatched.ch/wp-content/uploads/2009/03/picture-2.png" rel="lightbox[567]"><img
class="size-full wp-image-606" title="dict_cc_search_for_web" src="http://blog.dispatched.ch/wp-content/uploads/2009/03/picture-2.png" alt="dict.cc // search for &quot;web&quot;" width="439" height="334" /></a></dt></dl><p>Let me show an excerpt of the code.</p><pre class="brush: xml; title: ; notranslate">
&lt;table&gt;..
  &lt;td class=&quot;td7nl&quot; style=&quot;background-color: rgb(233, 233, 233);&quot;&gt;
    &lt;a href=&quot;/englisch-deutsch/web.html&quot;&gt;
      &lt;b&gt;web&lt;/b&gt;
    &lt;/a&gt;
  &lt;/td&gt;
&lt;td class=&quot;td7nl&quot; ... /td&gt;
&lt;/table&gt;..
</pre><p>The results are displayed in a table. The two interesting columns share the class <em>td7nl</em>. The most efficient way would seem to just sweep all the data from inside the cells of these two columns. Fortunately for us, BeautifulSoup implemented just that feature.</p><pre class="brush: python; title: ; notranslate">
results = pool.findAll('td', attrs={'class' : 'td7nl'})
source = ''
translations = []
for result in results:
    word = ''
    for tmp in result.findAll(text=True):
        word = word + &quot; &quot; + unicode(tmp).encode(&quot;utf-8&quot;)
    if source == '':
        source = word
    else:
        translations.append((source, word))
for translation in translations:
    print &quot;%s =&gt; %s&quot; % (translation[0], translation[1])
</pre><p><em>results</em> will be a BeautifulSoup.ResultSet. Each member of the tuple is the html code of one column of the class <em>td7nl</em>. Notice that you can access each element like you would expect in a tuple. <em>result.findAll(text=True)</em> will return each embedded textual element of the table. All we have to do is merge the different tags together.<br
/> <em>source</em> and <em>word</em> are temporary variables that will hold one translation in each iteration. Each translation will be saved as a pair(list) inside the <em>translations</em> tuple.<br
/> Finally we iterate over the found translations and write them to the screen.</p><pre class="box">
$ python webscraping_demo.py
 kinky   {adj} =>  9 kraus   [Haar]
 kinky   {adj} =>  nappy   {adj}   [Am.]
 kinky   {adj} =>  6 kraus   [Haar]
 kinky   {adj} =>  crinkly   {adj}
 kinky   {adj} =>  kraus
 kinky   {adj} =>  curly   {adj}
 kinky   {adj} =>  kraus
 kinky   {adj} =>  frizzily   {adv}
</pre><p>In a regular application those results would need a little lexing, of course. The most important thing, however, is that we just wrote a translation wrapper onto a webapplication &#8211; in only 28 lines of code. Did I mention that I&#8217;m in love with Python?</p><p>All that is left is for me to recommend the <a
href="http://www.crummy.com/software/BeautifulSoup/documentation.html">BeautifulSoup documentation</a>. What we did here really didn&#8217;t cover what this module is capable of.</p><p>I wish you all the best.</p><p><script type="text/javascript">digg_url = 'http://digg.com/programming/Webscraping_with_Python_and_BeautifulSoup';</script><br
/> <script src="http://digg.com/api/diggthis.js"></script></p> ]]></content:encoded> <wfw:commentRss>http://blog.dispatched.ch/2009/03/15/webscraping-with-python-and-beautifulsoup/feed/</wfw:commentRss> <slash:comments>19</slash:comments> </item> </channel> </rss>

<!-- W3 Total Cache: Minify debug info:
Engine:             disk: basic
Theme:              44184
Template:           tag
-->
<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Served from: blog.dispatched.ch @ 2012-02-10 15:03:35 -->

<!-- W3 Total Cache: Page cache debug info:
Engine:             disk: basic
Cache key:          w3tc_blog.dispatched.ch_1_page_32121ba9ed2509fb0696261dad3dc869_gzip
Caching:            enabled
Status:             not cached
Creation Time:      0.865s
Header info:
ETag:               "274915380f43058ea86f113a32d41d8b"
Last-Modified:      Mon, 16 Jan 2012 13:44:17 GMT
Vary:               Accept-Encoding, Cookie
X-Powered-By:       W3 Total Cache/0.9.2.4
Content-Encoding:   gzip
X-Pingback:         http://blog.dispatched.ch/xmlrpc.php
Content-Type:       text/xml; charset=UTF-8
-->
