[ Login ]

Coding Challenge with TDD

3 week(s) ago uolter   comments 


Write a function DashInsert(num) insert dashes ('-') between each two odd numbers and insert asterisks ('*') between each two even numbers in num. For example: if num is 4546793 the output should be 454*67-9-3. Don't count zero as a negative or positive number.

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
import unittest

def dash_insert(num):
""" insert dashes ('-') between each two odd numbers
and insert asterisks ('*') between each two even numbers in num
"""
new_num = []
num = str(num)
for i, n in enumerate(num):
if i > 0 and not int(n) == 0:
if int(n) % 2:
# even number
if not int(num[i-1])==0 and int(num[i-1]) % 2:
# an even number follow an even number
new_num.append('-')
else:
# odd number
if not int(num[i-1])==0 and not int(num[i-1]) % 2:
# an odd number follow an odd number
new_num.append('*')
new_num.append(n)
return ''.join(new_num)

class TestInsertDash(unittest.TestCase):
def setUp(self):
pass
def
test_ok(self):
self.assertEqual(dash_insert(4546793), '454*67-9-3')
def test_with_zeros(self):
self.assertEqual(dash_insert(45406793), '454067-9-3')


Debug Bash Script How-To

2 month(s) ago uolter   comments 




What Open Source mean to you?

3 month(s) ago uolter   comments 




Term Frequency Inverse documents

3 month(s) ago uolter   comments 


This is a very simple example which illustrates how it works one of the most used technique to determine which document is most relevant in respect of a given topic defined as a query term.

You can refer the wikipedia page for an exhaustive description of the technique tf_idf. And the python code with a basic implementation is available on github.

Example

Now suppose you have 3 documents and you want to find out which one is more relevant in discussing the topic "Open Source".

Document 1

In production and development, open source as a development model promotes a) universal access via free license to a product's design or blueprint, and b) universal redistribution of that design or blueprint, including subsequent improvements to it by anyone.

Document 2

Open source software is software that can be freely used, changed, and shared (in modified or unmodified form) by anyone. Open source software is made by many people, and distributed under licenses that comply with the Open Source Definition.

Document 3

Leaders of the nation’s biggest technology firms warned President Obama during a lengthy meeting at the White House on Tuesday that National Security Agency spying programs are damaging their reputations and could harm the broader economy.

You can already see by yourself the third document is less relevant than the others.

Below The Full code:

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
from tfidf import tf, idf, tf_idf


doc1 = "In production and development, open source as a development model \
promotes a) universal access via free license to a product's design or \
blueprint, and b) universal redistribution of that design or blueprint, \
including subsequent improvements to it by anyone."

doc2 = "Open source software is software that can be freely used, changed, and\
shared (in modified or unmodified form) by anyone. Open source software is\
made by many people, and distributed under licenses that comply with the Open \
Source Definition."

doc3 = "Leaders of the nation’s biggest technology firms warned President Obama\
during a lengthy meeting at the White House on Tuesday that National Security\
Agency spying programs are damaging their reputations and could harm the broader economy."

# which of these three documents reflect more the "Open Source" topic?
QUERY_TERMS = ['Open', 'Source']
corpus = {'1': doc1, '2': doc2, '3': doc3 }

query_scores = { i:0 for i in corpus.keys()}

for term in [t.lower() for t in QUERY_TERMS]:
for doc in sorted(corpus):
print 'TF(doc%s): %s' % (doc, term), tf(term, corpus[doc])
print 'IDF: %s' % (term, ), idf(term, corpus.values())
print

for
doc in sorted(corpus):
score = tf_idf(term, corpus[doc], corpus.values())
print 'TF-IDF(%s): %s' % (doc, term), score
query_scores[doc] += score
print

print
"Overall TF-IDF scores for query '%s'" % (' '.join(QUERY_TERMS), )
for (doc, score) in sorted(query_scores.items()):
print doc, score

The most relevant document


Overall TF-IDF scores for query 'Open Source'

  1. 0.0720751337491
  2. 0.216225401247
  3. 0.0

As you can see the most relevant document is the second with the highest term frequency inverse document frequency. The third document is completely irrelevant with no tf-idf



 
Back to top