First of all you need the data and you can grab them from the coursera site. This curl returns a json with the list of classes already filtered by the category we are interested in: 'cs-programming', 'infotech', 'math', 'stats', 'cs-systems', 'ee'
curl "https://d1hpa2gdx2lr6r.cloudfront.net/maestro/api/topic/list2?unis=id%2Cname%2Cshort_name%2Cpartner_type%2Cfavicon%2Chome_link%2Cdisplay& topics=id%2Clanguage%2Cname%2Cshort_name%2Csubtitle_languages_csv%2C self_service_course_id%2Csmall_icon_hover%2Clarge_icon&cats=id%2Cname%2C short_name&insts=id%2Cfirst_name%2Cmiddle_name%2Clast_name%2Cshort_name%2C user_profile__user__instructor__id&courses=id%2Cstart_day%2Cstart_month%2C start_year%2Cstatus%2Csignature_track_open_time%2Csignature_track_close_time%2C eligible_for_signature_track%2Cduration_string%2Chome_link%2Ctopic_id%2Cactive &__cf_version=da421ffddfb68aa26f2dbacc9e37dee6f3f119d8&__cf_origin= https%3A%2F%2Fwww.coursera.org"
From the json for each class we get the short name, the name, the language and the url with the full description of the class. We save them in a csv file courses.csv
In one shot you can get the csv file with this command:
./get_coursera_courses.sh | python extract_courses.py
To find out which programming languages are quoted we need to get the full class description page we can get calling the urls in the 4th column of the csv file and from the json extract the text from:
This is an heavy task since we need to run an http request for every classes considered.
To get the programming languages from the text we use the intersection between the words in the text and a set of defined programming languages:
The output of this step is again a csv file with the list of the programming languages for each course.
You can build it by yourself running the following command:
cat courses.csv | python find_languages.py
In order to count the languages given the csv file we can use a dictionary of languages and their occurrences.
Now it's that we counted the language occurrences it would be great to visualize them in a histogram shape. There are different ways to do that and at the time being, D3.js is one of the best http://uolter-blog.appspot.com/uolter_blog/admin/edit/2099002 choices. I choose to use the Series object within pandas and matplotlib.
R is widespread in the academic world more than java or php. C is still very used, while f is something I haven't heard about before.