Python Package Popularity Contest

I wanted to find out which Python packages are used most to give me a rough idea about the order in which you’d want to process Python packages if you were to, say, build packages for your favorite Linux distribution, starting from the most widely used.

I know that Debian runs a popularity contest that users can opt-in to participate in, and you can easily query for your favorite package. Unfortunately I don’t know if they offer this data in a list format or even in a way I could easily download. Other distros may do so as well, but I don’t know if their databases are publicly available.

Python Package Index (PyPI) has download counters for packages that it hosts, but not all packages that are listed on PyPI have uploaded their packages to PyPI. And even those that have, often just show the latest uploaded version with their download statistics. And PyPI does not list all packages anyway.

With those caveats I thought I would still get some interesting data from the PyPI download numbers. I wrote a little script to go through the roughly 5000 package pages and count the download numbers if any. Here’s my script (it cannot handle all of the URLs it encounters, and it has some other bugs as well, but I was not interested in complete accuracy at this time):

#!/usr/bin/env python
 
import urllib2, time, sys
 
from BeautifulSoup import BeautifulSoup
 
BASE_URL = 'http://pypi.python.org/pypi/'
 
soup = BeautifulSoup(urllib2.urlopen('http://pypi.python.org/simple/'))
 
for i, a in enumerate(soup('a')):
    name = a.contents[0]
    url = a.attrs[0][1]
 
    url = BASE_URL + url
 
    try:
        package_soup = BeautifulSoup(urllib2.urlopen(url))
 
        try:
            values = (int(td.contents[0]) for td in package_soup('td', style='text-align: right;') if td.contents and td.contents[0].isdigit())
        except Exception:
            values = [-1]
    except Exception:
        values = [-2]
 
    print sum(values), name
    sys.stdout.flush() # So that I can tee to file and watch stdout
 
    # Be nice and don't hit PyPI too frequently
    time.sleep(1)
 
    # Break early when you are debugging the script
    # Uncomment this once you are ready to run for real
    if i > 9:
        break
 
print 'Fetched statistics from %d packages' % (i + 1)

The winner is zc.buildout at over 93,000 downloads! I guess now I really need to learn how to use it :)

The top 20 entries are as follows:

93627 zc.buildout
65512 zope.interface
50544 setuptools
47690 zope.event
40940 zope.dottedname
40318 Paste
39446 zope.configuration
38995 kid
38306 zope.dublincore
37698 zope.formlib
37255 zope.location
37118 PasteDeploy
36403 zope.copypastemove
35943 zope.filerepresentation
35257 plone.recipe.distros
35126 RestrictedPython
35051 zope.error
34569 zope.app.error
32811 zope.pagetemplate
32722 zope.tales

To prevent everyone from running this script and bringing PyPI to its knees, you can download the full results here: pypi-popularity-contest-2008-10-23.txt.

In retrospect it would be nice if PyPI could provide download statistics as a simple list itself.

Similar Posts:

6 Comments

  1. Tarek Ziadé:

    you can get these info here: http://pypi.python.org/webstats/

    Also, notice that zc.buildout is automatically downloaded everytime someone in the world builds a Plone or any buildout-based application, so these stats are not reflecting ‘who’ did a download.

    But there’s some work going on for this

    Cheers

  2. Seo Sanghyeon:

    Entire Debian Popularity Contest raw data is available at http://popcon.debian.org/

  3. Christopher Arndt:

    These statistics are heavily skewed because some packages make more intensive usage of the PyPI infrastructure for installation than others.

    For example, I suspect the reason that kid is so high in this list is because it is downloaded whenever TurboGears 1.x is installed via easy_install/tgsetup.py and it is one of the few packages that are not hosted on turbogears.org itself.

  4. Heikki Toivonen:

    Thank you for comments everyone, useful information I was not aware of.

  5. mike bayer:

    this is also only taking the most recent version of each project into account. So projects that have been released more recently are bumped down. There’s no temporal element to the number of downloads in general so its a fairly useless metric overall.

  6. kevin gill:

    Great script, thanks.

    There are now thousands of packages on PyPI, over 700 zope related. It is impossible (for me) to track new useful stuff coming out. This script is a great help.

    However, I agree with you comments – PyPI should be extended to provide this kind of data easily.