web2py at pycon 2012
2 month(s) ago uolter comments
web2py is the more productive web framework I used so far.
And this is Massimo Di Pierro - the author - presentation at the 2012 python conference.
Crawling with mongo db
3 month(s) ago uolter comments

I've been inspired by the examples in the chapter 4 Searching and ranking of the amazing O'Reilly book Programming Collective Intelligence.
In this chapter it is explained how to built a simple crawler and search engine. I don't want go through the code or pretend to explain how it works, because it's already been done excellently in the book. Instead I want to track here my experiments and the results:
The code in the book it's just for a teaching purpose and uses sqllite as local database, which probably it's not the best choice for a production environment.
Basically 5 tables have been defined:
- urllist: it's a simple list of urls filled in by the crawler.
- wordlist: list of words fetched from each page pointed by the urls.
- link: joins two urls from the urllist table. The first url shows the page wich contains the target page, which is the second link.
- wordlocation: it's a mapping table that records the position of a word into the page.
- linkwords: joins the words with the link text in each page.
this is the table definition:
1. | # Create the database tables |
I wanted to replace sqllite with the no- sql database mongoDb and compare the performances.
I also used pymongo as a python driver to call the mongo api.
Sqllite
I'll report here just the code for the query method since from my point of view is the key point of the solution. Instead, you can find the full code here:
1. | def getmatchrows(self, q): |
you can test it running in a python shell:
1. | import mysearch |
This step will crawl a list of pages. In the example above it will take more than one hour.
After that you can start to run some queries:
1. | e = mysearch.searchengine('searchindex.db') |
where 849 is python's id in the wordlist table
something a little more complicated:
1. | e.query('python java ruby') |
in this case it took around 2 secs to present the result and the sql statement joins three tables.
mongoDb.
Try to figure out how to adapt the table schemas already discussed and the sql statement used to query in the mongodb world.
This is what I did. The simplest and fastest solution: a collection for each table:
1. | # urllist document: => list of urls |
and finally the query function:
1. | def getmatchrows(self, q): |
Like the sql solution where 2n queries are required (n is the word number) within the mongo solution I hit the database the same number of time.
Start the mongod process and start a python console:
1. | import mongosearch |
I got the results in less than one second, while I had to wait more than 2 seconds with the sqllite solution.
You can find the full code here
To reach good performances even with mongodb it's really important to have the required indexes defines.
Use the mongodb client and add the following indexes:
1. | use searchindex |
Note that crawling the Toby Segaran site it took to me more than one hour. Try something smaller or bigger depending on the time you have and how fast your network is.
Once we were developers...
7 month(s) ago uolter comments
I started coding when my parents gifted me a Sinclair zx Spectrum at the secondary school period. So my first programming language was Basic. While most of my friends with a computer were very keen in gaming I was spending planty of time in learning programming and trying to invent small projects to solve my day by day needs.

Then I went to the University and I studied Computer Science: there I learned coding in Pascal, C and C++
I got a degree while the first Internet bubble was growing. I had the chance to get a strong experience in java architectures in my profession.
Nowadays, I code mainly with python for fun and ... let say..: my non-working stuff, while my Os is linux - probably the easiest distribution: Ubuntu.
A few days ago, I was typing some java code in order to test a bloody SSO which wasn't working and in a provoking way I asked to the young developers sitting beside me: "How can I comment out these 3 lines of code?"
He came out to me saying: "select the three lines and press ctrl+k"
WHAT??? ctlr+k ???*
He was actually right, but I was exacting something like:
1. | /* |
crl+k it's the IDE short-cut, it's not the language syntax.
Even if I spend a very few time coding in the office, since I moved more closely to coordination stuff, I'm still in the front line with developers and I have the feeling most of them, who are actually great developers, moved form coding to the knowledge of an IDE.
My favourite IDE is still vim ;-)
singleton with decorator
8 month(s) ago uolter comments
I had to implement a Singleton pattern in a small Python program. The idea under the hood was to have a single instance of an object responsible to load an heavy xml file which doesn't change somehow during the life-chicle of all the possible use cases.
definition: the singleton pattern is a design pattern used to implement the mathematical concept of a singleton, by restricting the instantiation of a class to one object. This is useful when exactly one object is needed to coordinate actions across the system. ref: wikipedia
and this is the best I found out. Very cool:
1. | class SingletonDecorator: |
As you can guess this is a decorator too.
1. | @SingletonDecorator |
This approach it's very nice from my point of view since the pattern implementation is completely separated from the business logic that can laid inside a different class.
You can compare this solution with the java implementation below, where all the business rules are mixed with the pattern itself.
This is the java version:
1. | public class Singleton { |
your tweets with web2py and google app engine
9 month(s) ago uolter comments
As you can notice at the right bottom of this page you can find my latest post on twitter.
This blog is deployed at the moment on google appengine and built on top of web2py
I used more or less the same code showed in the book
1. | @cache(request.env.path_info, time_expire=20*60, cache_model=cache.ram) |
I introduced the decorator at line 1 in order to cache with memcached the results for 20 minutes. In fact, it is not possible to call the twitter api more than 150 times per hours and caching results make sense for performance reasons, too. I am not twitting all day.
As you can imagine looking the code I call the method in an ajax way and I come back the result in a list shape.
1. | { {=LOAD(c='default', f='get_tweets', target='tweets', ajax=True) } } |
Everything went fine for a month or more then the call against the twitter api started coming back with an error telling that more than 150 requests per hour are not allowed.
"How is it possible?" I'm caching every call for 20mins. Is memcache on google appengine doing it wrong?
Then, I realized that I haven't a static IP assigned to my application and probably a kind of subset were calling twitter api. My application is not alone :-)
So I had to introduce the authenticated call to the twitter api with oauth.
This lib you can grab from github oauth.py helped me to achieve the goal. I put the python module inside the web2py modules directory of my app and I changed the code in this way:
1. | if request.env.web2py_runtime_gae: |
1. | def __gae_get_tweet(): |
Before running this code successfully you must register the application on dev twitter get the access token and access token secret and assign them to the variables:
1. | twitter_user = 'uollter' |
model file is the right place where to put them.
