9 % \usepackage{hyperref} |
9 % \usepackage{hyperref} |
10 % \usepackage[margin=0.5in]{geometry} |
10 % \usepackage[margin=0.5in]{geometry} |
11 %\usepackage{pmboxdraw} |
11 %\usepackage{pmboxdraw} |
12 |
12 |
13 \title{POSIX Regular Expression Matching and Lexing} |
13 \title{POSIX Regular Expression Matching and Lexing} |
14 \author[1]{Chengsong Tan \\ King's College London\\chengsong.tan@kcl.ac.uk} |
14 \author[1]{Chengsong Tan} |
|
15 \affil[1]{\\ Department of Informatics, King's College London\\ |
|
16 London, UK\\ |
|
17 \texttt{chengsong.tan@kcl.ac.uk}} |
|
18 \authorrunning{Chengsong Tan} |
|
19 \Copyright{Chengsong Tan} |
15 |
20 |
16 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}% |
21 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}% |
17 \newcommand{\ZERO}{\mbox{\bf 0}} |
22 \newcommand{\ZERO}{\mbox{\bf 0}} |
18 \newcommand{\ONE}{\mbox{\bf 1}} |
23 \newcommand{\ONE}{\mbox{\bf 1}} |
19 \def\lexer{\mathit{lexer}} |
24 \def\lexer{\mathit{lexer}} |
40 |
45 |
41 |
46 |
42 \begin{document} |
47 \begin{document} |
43 |
48 |
44 \maketitle |
49 \maketitle |
|
50 |
45 \begin{abstract} |
51 \begin{abstract} |
46 Brzozowski introduced in 1964 a beautifully simple algorithm for |
52 Brzozowski introduced in 1964 a beautifully simple algorithm for |
47 regular expression matching based on the notion of derivatives of |
53 regular expression matching based on the notion of derivatives of |
48 regular expressions. In 2014, Sulzmann and Lu extended this |
54 regular expressions. In 2014, Sulzmann and Lu extended this |
49 algorithm to not just give a YES/NO answer for whether or not a regular |
55 algorithm to not just give a YES/NO answer for whether or not a regular |
156 frequent enough that a separate name has been created for |
162 frequent enough that a separate name has been created for |
157 them---\emph{evil regular expressions}. In empiric work, Davis et al |
163 them---\emph{evil regular expressions}. In empiric work, Davis et al |
158 report that they have found thousands of such evil regular expressions |
164 report that they have found thousands of such evil regular expressions |
159 in the JavaScript and Python ecosystems \cite{Davis18}. |
165 in the JavaScript and Python ecosystems \cite{Davis18}. |
160 |
166 |
161 This exponential blowup sometimes causes real pain in ``real life'': |
167 This exponential blowup sometimes causes real pain in real life: |
162 for example one evil regular expression brought on 20 July 2016 the |
168 for example on 20 July 2016 one evil regular expression brought the |
163 webpage \href{http://stackexchange.com}{Stack Exchange} to its knees\cite{SE16}. |
169 webpage \href{http://stackexchange.com}{Stack Exchange} to its knees\cite{SE16}. |
164 In this instance, a regular expression intended to just trim white |
170 In this instance, a regular expression intended to just trim white |
165 spaces from the beginning and the end of a line actually consumed |
171 spaces from the beginning and the end of a line actually consumed |
166 massive amounts of CPU-resources and because of this the web servers |
172 massive amounts of CPU-resources and because of this the web servers |
167 ground to a halt. This happened when a post with 20,000 white spaces |
173 ground to a halt. This happened when a post with 20,000 white spaces |
194 ``fire''---so is it an identifier or a keyword? While in applications |
200 ``fire''---so is it an identifier or a keyword? While in applications |
195 there is a well-known strategy to decide these questions, called POSIX |
201 there is a well-known strategy to decide these questions, called POSIX |
196 matching, only relatively recently precise definitions of what POSIX |
202 matching, only relatively recently precise definitions of what POSIX |
197 matching actually means have been formalised |
203 matching actually means have been formalised |
198 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly, |
204 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly, |
199 POSIX matching means to match the longest initial substring and |
205 POSIX matching means matching the longest initial substring and |
200 possible ties are solved according to some priorities attached to the |
206 in the case of a tie, the initial submatch is chosen according to some priorities attached to the |
201 regular expressions (e.g.~keywords have a higher priority than |
207 regular expressions (e.g.~keywords have a higher priority than |
202 identifiers). This sounds rather simple, but according to Grathwohl et |
208 identifiers). This sounds rather simple, but according to Grathwohl et |
203 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote: |
209 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote: |
204 |
210 |
205 \begin{quote} |
211 \begin{quote} |
436 The main point of the bitsequences and annotated regular expressions |
442 The main point of the bitsequences and annotated regular expressions |
437 is that we can apply rather aggressive (in terms of size) |
443 is that we can apply rather aggressive (in terms of size) |
438 simplification rules in order to keep derivatives small. We have |
444 simplification rules in order to keep derivatives small. We have |
439 developed such ``aggressive'' simplification rules and generated test |
445 developed such ``aggressive'' simplification rules and generated test |
440 data that show that the expected bound can be achieved. Obviously we |
446 data that show that the expected bound can be achieved. Obviously we |
441 could only cover partially the search space as there are infinitely |
447 could only partially cover the search space as there are infinitely |
442 many regular expressions and strings. One modification we introduced |
448 many regular expressions and strings. One modification we introduced |
443 is to allow a list of annotated regular expressions in the |
449 is to allow a list of annotated regular expressions in the |
444 \textit{ALTS} constructor. This allows us to not just delete |
450 \textit{ALTS} constructor. This allows us to not just delete |
445 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also |
451 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also |
446 unnecessary ``copies'' of regular expressions (very similar to |
452 unnecessary ``copies'' of regular expressions (very similar to |