[ login ]

web2py at pycon 2012

2 month(s) ago uolter   comments 


web2py is the more productive web framework I used so far.

And this is Massimo Di Pierro - the author - presentation at the 2012 python conference.



Crawling with mongo db

3 month(s) ago uolter   comments 


search with mongodb

I've been inspired by the examples in the chapter 4 Searching and ranking of the amazing O'Reilly book Programming Collective Intelligence.

In this chapter it is explained how to built a simple crawler and search engine. I don't want go through the code or pretend to explain how it works, because it's already been done excellently in the book. Instead I want to track here my experiments and the results:

The code in the book it's just for a teaching purpose and uses sqllite as local database, which probably it's not the best choice for a production environment.

Basically 5 tables have been defined:

  1. urllist: it's a simple list of urls filled in by the crawler.
  2. wordlist: list of words fetched from each page pointed by the urls.
  3. link: joins two urls from the urllist table. The first url shows the page wich contains the target page, which is the second link.
  4. wordlocation: it's a mapping table that records the position of a word into the page.
  5. linkwords: joins the words with the link text in each page.

this is the table definition:

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
# Create the database tables
def createindextables(self):
self.con.execute('create table urllist(url)')
self.con.execute('create table wordlist(word)')
self.con.execute('create table wordlocation(urlid,wordid,location)')
self.con.execute('create table link(fromid integer,toid integer)')
self.con.execute('create table linkwords(wordid,linkid)')
self.con.execute('create index wordidx on wordlist(word)')
self.con.execute('create index urlidx on urllist(url)')
self.con.execute('create index wordurlidx on wordlocation(wordid)')
self.con.execute('create index urltoidx on link(toid)')
self.con.execute('create index urlfromidx on link(fromid)')
self.dbcommit()

I wanted to replace sqllite with the no- sql database mongoDb and compare the performances.

I also used pymongo as a python driver to call the mongo api.

Sqllite

I'll report here just the code for the query method since from my point of view is the key point of the solution. Instead, you can find the full code here:

full code

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
 def getmatchrows(self, q):

fieldlist = 'w0.urlid'
tablelist = ''
clauselist = ''
wordids = []

words = q.split(' ')
tablenumber = 0

for word in words:

wordrow = self.con.execute(
"select rowid from wordlist where word = '%s'" %word).fetchone()

if wordrow != None:
wordid = wordrow[0]
wordids.append(wordid)

if tablenumber > 0:
tablelist += ','
clauselist += ' and '
clauselist += 'w%d.urlid = w%d.urlid and ' %(tablenumber - 1, tablenumber)

fieldlist += ', w%d.location' % tablenumber
tablelist += 'wordlocation w%d' % tablenumber
clauselist += 'w%d.wordid=%d' %(tablenumber, wordid)

tablenumber += 1

fullquery = 'select %s from %s where %s ' %(fieldlist, tablelist, clauselist)

print 'Query: fullquery: %s' % fullquery

cur = self.con.execute(fullquery)
rows = [row for row in cur]

return rows, wordids

you can test it running in a python shell:

1.
2.
3.
4.
import mysearch
pages = ['http://kiwitobes.com/wiki/Categorical_list_of_programming_languages.html']
crawler = searchengine.crawler('searchindex.db')
crawler.crawl(pages)

This step will crawl a list of pages. In the example above it will take more than one hour.

After that you can start to run some queries:

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
e = mysearch.searchengine('searchindex.db')
e.query('python')
Query: fullquery:
select w0.urlid, w0.location
from wordlocation w0
where
w0.wordid=849
1.000000
http://kiwitobes.com/wiki/Python_programming_language.html
0.070968 http://kiwitobes.com/wiki/Ruby_programming_language.html
0.045161 http://kiwitobes.com/wiki/Categorical_list_of_programming_langua

.............................................................................
.............................................................................
Query executed in 0 sec

where 849 is python's id in the wordlist table

something a little more complicated:

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
 e.query('python java ruby')
Query: fullquery:
select w0.urlid, w0.location, w1.location, w2.location
from wordlocation w0,wordlocation w1,wordlocation w2
where
w0.wordid=849
and w0.urlid = w1.urlid
and w1.wordid=489
and w1.urlid = w2.urlid
and w2.wordid=985
1.000000
http://kiwitobes.com/wiki/Ruby_programming_language.html
0.242947 http://kiwitobes.com/wiki/Python_programming_language.html
............................................................
.............................................................
0.010972 http://kiwitobes.com/wiki/XSL_Transformations.html
Query executed in 2 sec

in this case it took around 2 secs to present the result and the sql statement joins three tables.

mongoDb.

Try to figure out how to adapt the table schemas already discussed and the sql statement used to query in the mongodb world.

This is what I did. The simplest and fastest solution: a collection for each table:

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
# urllist document: => list of urls
db.urllist.find_one()
{
u'url': u'http://kiwitobes.com/wiki/Categorical_list_of_programming_languages.html',

u'_id': ObjectId('4f2c26837db8360819000000')}

# list of words
db.wordlist.find_one()
{
u'_id': ObjectId('4f2c26837db8360819000001'), u'word': u'doctype'}

db.link.find_one()
{
u'toid': u'4f2c26ac7db8360819001456', u'fromid': u'4f2c26837db8360819000000', u'_id': ObjectId('4f2c27177db8360819002360')}

# words in a text link
db.linkwords.find_one()
{
u'linkid': u'4f2c27177db8360819002360', u'_id':
ObjectId('4f2c27177db8360819002361'), u'wordid': u'4f2c26a27db8360819000029'}

# words is a page (identified with a url) and their position in the page
# most relevant words can be on top of the page, or in a query with a few words pages where
# the word distance is smaller have to be put on top of the results

db.wordlocation.find_one()
{
u'urlid': u'4f2c26837db8360819000000', u'_id': ObjectId('4f2c26a27db8360819000002'),
u'location': 0, u'wordid': u'4f2c26837db8360819000001'}

and finally the query function:

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
def getmatchrows(self, q):

clauselist = ''
wordids = []
words = q.split(' ')

p = None

for
word in words:
w = self.db.wordlist.find_one({'word': '%s' %word})
if w:
wid = str(w['_id'])
wordids.append(wid)

if p == None:
p = self.db.wordlocation.find({'wordid':'%s' %wid})
else:
urlids = []
for a in p:
urlids.append(a['urlid'])

p = self.db.wordlocation.find({'wordid':'%s' %wid, 'urlid': {'$in': urlids}})


rows = [row for row in p]

return rows, wordids

Like the sql solution where 2n queries are required (n is the word number) within the mongo solution I hit the database the same number of time.

Start the mongod process and start a python console:

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
import mongosearch
pages = ['http://kiwitobes.com/wiki/Categorical_list_of_programming_languages.html']
crawler = mongosearch.crawler('searchindex')
crawler.crawl(pages)

e = mongosearch.searchengine('searchindex')
e.query('python java ruby')

1.000000 {u'url': u'http://kiwitobes.com/wiki/Ruby_programming_language.html',
u'_id': ObjectId('4f2c271d7db8360819002862')}
0.103448 {u'url': u'http://kiwitobes.com0.025862 {u'url': u'http://kiwitobes.com/wiki/Interpreted_language.html', u'_id':
ObjectId('4f2c271a7db83608190026e2')}
------------------------------------------------------------
------------------------------------------------------------
0.017241 {u'url': u'http://kiwitobes.com/wiki/Smalltalk_programming_language.html', u'_id': ObjectId('4f2c271f7db8360819002992')}
Query executed in 0 sec

I got the results in less than one second, while I had to wait more than 2 seconds with the sqllite solution.

You can find the full code here

To reach good performances even with mongodb it's really important to have the required indexes defines.

Use the mongodb client and add the following indexes:

1.
2.
3.
4.
5.
use searchindex

db.wordlocation.ensureIndex({wordid:1})
db.wordlist.ensureIndex({word:1})
db.urllist.ensureIndex({url:1})

Note that crawling the Toby Segaran site it took to me more than one hour. Try something smaller or bigger depending on the time you have and how fast your network is.



Once we were developers...

7 month(s) ago uolter   comments 


I started coding when my parents gifted me a Sinclair zx Spectrum at the secondary school period. So my first programming language was Basic. While most of my friends with a computer were very keen in gaming I was spending planty of time in learning programming and trying to invent small projects to solve my day by day needs.

Sinclair

Then I went to the University and I studied Computer Science: there I learned coding in Pascal, C and C++

I got a degree while the first Internet bubble was growing. I had the chance to get a strong experience in java architectures in my profession.

Nowadays, I code mainly with python for fun and ... let say..: my non-working stuff, while my Os is linux - probably the easiest distribution: Ubuntu.

A few days ago, I was typing some java code in order to test a bloody SSO which wasn't working and in a provoking way I asked to the young developers sitting beside me: "How can I comment out these 3 lines of code?"

He came out to me saying: "select the three lines and press ctrl+k"

WHAT??? ctlr+k ???*

He was actually right, but I was exacting something like:

1.
2.
3.
4.
5.
/*
line 1
line 2
line 3
*/

crl+k it's the IDE short-cut, it's not the language syntax.

Even if I spend a very few time coding in the office, since I moved more closely to coordination stuff, I'm still in the front line with developers and I have the feeling most of them, who are actually great developers, moved form coding to the knowledge of an IDE.

My favourite IDE is still vim ;-)



singleton with decorator

8 month(s) ago uolter   comments 


I had to implement a Singleton pattern in a small Python program. The idea under the hood was to have a single instance of an object responsible to load an heavy xml file which doesn't change somehow during the life-chicle of all the possible use cases.

definition: the singleton pattern is a design pattern used to implement the mathematical concept of a singleton, by restricting the instantiation of a class to one object. This is useful when exactly one object is needed to coordinate actions across the system. ref: wikipedia

and this is the best I found out. Very cool:

1.
2.
3.
4.
5.
6.
7.
8.
class SingletonDecorator:
def __init__(self, klass):
self.klass=klass
self.instance=None
def
__call__(self, *args, **kwds):
if self.instance==None:
self.instance=self.klass(*args, **kwds)
return self.instance

As you can guess this is a decorator too.

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
@SingletonDecorator
class foo:
def __init__(self):
self.value = 0
def set_value(self, value):
self.value = value
def print_value(self):
print self.value

>>> a = foo()
>>>
a.set_value(5)
>>>
b = foo()
>>>
b.print_value()
5
>>> b.set_value(13)
13
>>>

This approach it's very nice from my point of view since the pattern implementation is completely separated from the business logic that can laid inside a different class.

You can compare this solution with the java implementation below, where all the business rules are mixed with the pattern itself.

This is the java version:

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
public class Singleton  {
private static final Singleton instance = new Singleton();

// Private constructor prevents instantiation from other classes
private Singleton() {
// put here you initialization code
}

public static Singleton getInstance() {
return instance;
}

}



your tweets with web2py and google app engine

9 month(s) ago uolter   comments 


As you can notice at the right bottom of this page you can find my latest post on twitter.

This blog is deployed at the moment on google appengine and built on top of web2py

I used more or less the same code showed in the book

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
@cache(request.env.path_info, time_expire=20*60, cache_model=cache.ram)
def get_tweets():

try:
page= fetch('http://api.twitter.com/1/statuses/user_timeline.json?
screen_name=%s' %twitter_user)
except:
return 'Error reading from twetter'

tweets=json.loads(page)

try:
ret = UL( [LI(XML(__add_link__(item['text'])), _class='twitter' )
for item in tweets] ).xml()
return ret
except:
logging.error('Error getting twitts')
logging.error(tweets)
return 'Error getting twitts'

I introduced the decorator at line 1 in order to cache with memcached the results for 20 minutes. In fact, it is not possible to call the twitter api more than 150 times per hours and caching results make sense for performance reasons, too. I am not twitting all day.

As you can imagine looking the code I call the method in an ajax way and I come back the result in a list shape.

1.
2.
{ {=LOAD(c='default', f='get_tweets', target='tweets', ajax=True) } }
<div id='tweets'></div>

Everything went fine for a month or more then the call against the twitter api started coming back with an error telling that more than 150 requests per hour are not allowed.

"How is it possible?" I'm caching every call for 20mins. Is memcache on google appengine doing it wrong?

Then, I realized that I haven't a static IP assigned to my application and probably a kind of subset were calling twitter api. My application is not alone :-)

So I had to introduce the authenticated call to the twitter api with oauth.

This lib you can grab from github oauth.py helped me to achieve the goal. I put the python module inside the web2py modules directory of my app and I changed the code in this way:

1.
2.
3.
4.
5.
6.
7.
8.
if request.env.web2py_runtime_gae:
return __gae_get_tweet()

try:
page= fetch('http://api.twitter.com/1/statuses/user_timeline.json?
screen_name=%s' %twitter_user)
except:
return 'Error reading from twetter'
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
def __gae_get_tweet():

oauth = local_import('oauth')

client = oauth.TwitterClient(twitter_consumer_key, twitter_consumer_secret,
twitter_callback_url )

timeline_url = "http://twitter.com/statuses/user_timeline.json"
result = client.make_request(url=timeline_url, token=twitter_access_token_key,
secret=twitter_access_token_secret)

tweets=json.loads(result.content)

return UL( [LI(XML(__add_link__(item['text'])), _class='twitter' ) for item in
tweets] ).xml()

def __add_link__(text):

l=text.split()

ret=''
try:
for i, w in enumerate(l):
if w.startswith('http') or w.startswith('https'):
l[i]=A(w, _href=w).xml()
elif w.startswith('@'):
l[i]=A(w, _href='http://twitter.com/%s' %w.replace('@', '')).xml()
elif w.startswith('#'):
l[i]=A(w, _href='http://twitter.com/#!/search?q=%s' %w.replace('#',
'')).xml()

ret+=l[i] + ' '
except UnicodeDecodeError:
logging.error('Error enconding tweet message')
return text

return ret

Before running this code successfully you must register the application on dev twitter get the access token and access token secret and assign them to the variables:

1.
2.
3.
4.
twitter_user = 'uollter'

twitter_access_token_key = '32822337...........................................'
twitter_access_token_secret = '8JIvA...........................................'

model file is the right place where to put them.



 
Back to top