I wanted to find out which Python packages are used most to give me a rough idea about the order in which you’d want to process Python packages if you were to, say, build packages for your favorite Linux distribution, starting from the most widely used.
I know that Debian runs a popularity contest that users can opt-in to participate in, and you can easily query for your favorite package. Unfortunately I don’t know if they offer this data in a list format or even in a way I could easily download. Other distros may do so as well, but I don’t know if their databases are publicly available.
Python Package Index (PyPI) has download counters for packages that it hosts, but not all packages that are listed on PyPI have uploaded their packages to PyPI. And even those that have, often just show the latest uploaded version with their download statistics. And PyPI does not list all packages anyway.
With those caveats I thought I would still get some interesting data from the PyPI download numbers. I wrote a little script to go through the roughly 5000 package pages and count the download numbers if any. Here’s my script (it cannot handle all of the URLs it encounters, and it has some other bugs as well, but I was not interested in complete accuracy at this time):
#!/usr/bin/env python import urllib2, time, sys from BeautifulSoup import BeautifulSoup BASE_URL = 'http://pypi.python.org/pypi/' soup = BeautifulSoup(urllib2.urlopen('http://pypi.python.org/simple/')) for i, a in enumerate(soup('a')): name = a.contents url = a.attrs url = BASE_URL + url try: package_soup = BeautifulSoup(urllib2.urlopen(url)) try: values = (int(td.contents) for td in package_soup('td', style='text-align: right;') if td.contents and td.contents.isdigit()) except Exception: values = [-1] except Exception: values = [-2] print sum(values), name sys.stdout.flush() # So that I can tee to file and watch stdout # Be nice and don't hit PyPI too frequently time.sleep(1) # Break early when you are debugging the script # Uncomment this once you are ready to run for real if i > 9: break print 'Fetched statistics from %d packages' % (i + 1)
The winner is zc.buildout at over 93,000 downloads! I guess now I really need to learn how to use it
The top 20 entries are as follows:
93627 zc.buildout 65512 zope.interface 50544 setuptools 47690 zope.event 40940 zope.dottedname 40318 Paste 39446 zope.configuration 38995 kid 38306 zope.dublincore 37698 zope.formlib 37255 zope.location 37118 PasteDeploy 36403 zope.copypastemove 35943 zope.filerepresentation 35257 plone.recipe.distros 35126 RestrictedPython 35051 zope.error 34569 zope.app.error 32811 zope.pagetemplate 32722 zope.tales
To prevent everyone from running this script and bringing PyPI to its knees, you can download the full results here: pypi-popularity-contest-2008-10-23.txt.
In retrospect it would be nice if PyPI could provide download statistics as a simple list itself.