Debugging Python Regular Expressions

Typically when I have needed to write complex regular expressions I have turned to the QuickREx plugin to Eclipse. Unfortunately it does not have Python mode, which could be a problem with some patterns.

While I was reading the new re module documentation, I noticed there was a reference to a program called Kodos, which is a Python Regex Debugger. What sets Kodos apart from the QuickREx plugin (besides being Python specific) is its ability to show all of the matched groups at once, the ability to specify replacement patterns and look at results and finally look at sample code using the regex pattern you are testing. It also provides handy references and links to documentation.

I do have some issues with Kodos as well. It currently does not let you really debug the regex like ActiveState Komodo does, for example. (I think, it has been a while since I used Komodo.) In other words, you can not execute the regex in steps and see what is matched at any point. The UI has some minor issues, like not being able to resize the regex pattern and search string fields. I’d also like if there was a separate library that provided the actual regex “debugging” functionality so that you could integrate that into other tools, for example with the QuickREx plugin.

Even without the issues I think I will continue to use Kodos in addition to QuickREx. Kodos helped me create my first regex lookahead assertion.

The problem I needed to solve was how parse an email sent from Subversion post commit hook, copy any bug number references from the commit message to the subject line, and drop the diff. So given an email like below:

Date: Fri, 07 Nov 2008 18:15:46 -0800
From: Heikki Toivonen <heikki@example.com>
To: test <test@example.com>
Subject: SVN Revision: r99 - path/to/file
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Author: heikki
Date: 2008-11-07 18:14:54 -0800 (Fri, 07 Nov 2008)
New Revision: 99

Added:
   path/to/file
Log:
[Bug 1234] My commit message.
Second line of commit message.

Added: path/to/file
===================================================================
rest of diff follows

I needed to change that into:

Date: Fri, 07 Nov 2008 18:15:46 -0800
From: Heikki Toivonen <heikki@example.com>
To: test <test@example.com>
Subject: [Bug 1234] SVN Revision: r99 - path/to/file
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Author: heikki
Date: 2008-11-07 18:14:54 -0800 (Fri, 07 Nov 2008)
New Revision: 99

Added:
   path/to/file
Log:
[Bug 1234] My commit message.
Second line of commit message.

I did this with the following code:

# msg contains the whole email
m = re.search(r'^(Subject: )(.*)(\[Bug .*?\])(.*(?=^\w+:))', msg, re.MULTILINE | re.DOTALL)
msg = msg[:m.start()] + m.expand('\g<1>\g<3> \g<2>\g<3>\g<4>')

The MULTILINE flag makes the pattern apply to the whole message at once. The DOTALL flag enabled the dot (.) to span newlines. The *? pattern makes the match non-greedy, meaning it will match as short a string as possible. The (?= pattern starts the lookahead assertion, which I am using here to stop the preceding .* matching until the end of the message. Also, since it is a lookahead assertion, the contents it matched won’t be included in the last group which is what I wanted.

There are obviously many ways to write a regex pattern, and I haven’t run this pattern much on the real data yet so I may end up tweaking it still.

Similar Posts:

2 Comments

  1. Kent Johnson:

    A little-known secret is that Python includes a regex program similar to Kodos at /Tools/scripts/redemo.py. redemo does not have all the features of Kodos but it does show matched groups. redemo uses Tkinter so it is easier to install than Kodos.

  2. C. Collis:

    I usually use this: http://www.radsoftware.com.au/regexdesigner/ which is simply excellent when designing regular expressions. Although it is written for C# you will have no problem using the regular expressions with Python.

Leave a comment

Powered by WP Hashcash