Peter's Blog

Redefining the Impossible

Regular expression term matches too many characters


Consider the following regular expression which is searching for a word followed by a parenthesis:

(.*)(\w+)\(

If this expression is fed a string like:

int fred()

it will match but the contents of group 1 will be 'int fre' and the contents of group 2 will be 'd'. This is because the expression '.*' is greedy and will grab as much of the string as it can, even stuff that matches '\w+'. To stop the greediness the quantifier should be followed by a question mark, e.g.

(.*?)(\w+)\(

With this group 1 will become 'int ' and group 2 will become 'fred'.

This works with the python 're' module.


Filed under: python

Comments are Closed