how to debug a #bash script: insert into it set -x trap read debug— Walter Traspadini (@uollter) February 10, 2014
This is a very simple example which illustrates how it works one of the most used technique to determine which document is most relevant in respect of a given topic defined as a query term.
Now suppose you have 3 documents and you want to find out which one is more relevant in discussing the topic "Open Source".
In production and development, open source as a development model promotes a) universal access via free license to a product's design or blueprint, and b) universal redistribution of that design or blueprint, including subsequent improvements to it by anyone.
Open source software is software that can be freely used, changed, and shared (in modified or unmodified form) by anyone. Open source software is made by many people, and distributed under licenses that comply with the Open Source Definition.
Leaders of the nation’s biggest technology firms warned President Obama during a lengthy meeting at the White House on Tuesday that National Security Agency spying programs are damaging their reputations and could harm the broader economy.
You can already see by yourself the third document is less relevant than the others.
Below The Full code:
from tfidf import tf, idf, tf_idf
The most relevant document
Overall TF-IDF scores for query 'Open Source'
As you can see the most relevant document is the second with the highest term frequency inverse document frequency. The third document is completely irrelevant with no tf-idf
Dear recruiter, if you refer to people who work with you as "resources" I'm just not going to respond, okay?— Hilary Mason (@hmason) November 19, 2013
First of all you need the data and you can grab them from the coursera site. This curl returns a json with the list of classes already filtered by the category we are interested in: 'cs-programming', 'infotech', 'math', 'stats', 'cs-systems', 'ee'
curl "https://d1hpa2gdx2lr6r.cloudfront.net/maestro/api/topic/list2?unis=id%2Cname%2Cshort_name%2Cpartner_type%2Cfavicon%2Chome_link%2Cdisplay& topics=id%2Clanguage%2Cname%2Cshort_name%2Csubtitle_languages_csv%2C self_service_course_id%2Csmall_icon_hover%2Clarge_icon&cats=id%2Cname%2C short_name&insts=id%2Cfirst_name%2Cmiddle_name%2Clast_name%2Cshort_name%2C user_profile__user__instructor__id&courses=id%2Cstart_day%2Cstart_month%2C start_year%2Cstatus%2Csignature_track_open_time%2Csignature_track_close_time%2C eligible_for_signature_track%2Cduration_string%2Chome_link%2Ctopic_id%2Cactive &__cf_version=da421ffddfb68aa26f2dbacc9e37dee6f3f119d8&__cf_origin= https%3A%2F%2Fwww.coursera.org"
From the json for each class we get the short name, the name, the language and the url with the full description of the class. We save them in a csv file courses.csv
In one shot you can get the csv file with this command:
./get_coursera_courses.sh | python extract_courses.py
To find out which programming languages are quoted we need to get the full class description page we can get calling the urls in the 4th column of the csv file and from the json extract the text from:
This is an heavy task since we need to run an http request for every classes considered.
To get the programming languages from the text we use the intersection between the words in the text and a set of defined programming languages:
The output of this step is again a csv file with the list of the programming languages for each course.
You can build it by yourself running the following command:
cat courses.csv | python find_languages.py
In order to count the languages given the csv file we can use a dictionary of languages and their occurrences.
Now it's that we counted the language occurrences it would be great to visualize them in a histogram shape. There are different ways to do that and at the time being, D3.js is one of the best http://uolter-blog.appspot.com/uolter_blog/admin/edit/2099002 choices. I choose to use the Series object within pandas and matplotlib.
R is widespread in the academic world more than java or php. C is still very used, while f is something I haven't heard about before.