Better Python Code: A Guide for Aspiring Experts by David Mertz

Better Python Code: A Guide for Aspiring Experts by David Mertz

Author:David Mertz [David Mertz]
Language: eng
Format: epub
ISBN: 9780138320997
Publisher: Addison-Wesley Professional
Published: 2023-12-12T00:00:00+00:00


5.6. Regular Expressions And Catastrophic Backtracking

Regular expressions can be extremely nuanced, and are often a concise and powerful way to express patterns in text. This book cannot get far into an explanation or tutorial on regular expressions, but my title Regular Expression Puzzles and AI Coding Assistants (February 2023, ISBN 978-1633437814) contains a tutorial introduction in its appendix; obviously I recommend that title.

Readers might have worked with regular expressions to a fair extent without having fallen into the trap of catastrophic backtracking. When you do hit this issue, it can be a very unpleasant surprise. Patterns which work well and quickly in many circumstances can start taking longer, and become worse at an exponential rate as the strings matched against grow longer.

For this example, suppose that we have a file in which each line contains a non-descending list of (two digit) numbers, each separated by a space. We’d like to identify all the numbers up, but not including, 90 from each line. Some lines will match and others will not. In this hypothetical file format, each line also has a label at its start. Let’s look at an example of such a file (in the presentation here, some lines are wrapped because of book margins; in the file itself each labeled line is a physical line:

Data in file numbers.txt

A: 08 12 22 27 29 38 39 43 47 51 52 73 74 78 78 79 80 83 86 87 88 89 B: 03 04 04 05 16 18 23 26 30 31 33 34 35 36 52 61 63 68 69 72 75 80 82 83 83 90 92 92 92 95 97 C: 01 07 14 19 27 30 34 36 36 38 44 47 47 50 51 54 58 60 61 62 82 83 83 95 D: 05 10 13 17 30 31 42 50 56 61 63 66 76 90 91 91 93 E: 03 21 23 24 26 31 31 31 33 36 38 38 39 42 49 55 68 79 81 F: 04 08 13 14 14 16 19 21 25 26 27 34 36 39 43 45 45 50 51 62 66 67 71 75 79 82 88 G: 03 10 27 49 51 64 70 71 82 86 94 H: 27 31 38 42 43 43 48 50 63 72 83 87 90 92 I: 12 16 18 19 38 39 40 43 54 55 63 73 74 74 75 77 78 79 88

As a naive version of this program, we might try defining the pattern:

pat = re.compile(r"^(.+: )(.+ )+(?=9.)")

Now let’s try to process this file using this pattern. Presumably in real code we would take some action using the groups in the match, beyond printing out the fact it matched or failed.

Timing the regular expression matching

>>> from time import monotonic >>> for line in open("data/numbers.txt"): ... start = monotonic() ... if match := re.search(pat, line): ... print(f"Matched line {line.split(':')[0]} " ... f"in {monotonic()-start:0.3f} seconds") ... else: ... print(f"Fail on line {line.split(':')[0]} " .



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.