author | Chengsong |
Wed, 23 Aug 2023 03:02:31 +0100 | |
changeset 668 | 3831621d7b14 |
permissions | -rw-r--r-- |
668
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
1 |
\chapter{Technical Overview} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
2 |
\label{Overview} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
3 |
We start with a technical overview of the |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
4 |
catastrophic backtracking problem, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
5 |
motivating rigorous approaches to regular expression matching and lexing. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
6 |
In doing so we also briefly |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
7 |
introduce common terminology such as |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
8 |
bounded repetitions and back-references. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
9 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
10 |
\section{Motivating Examples} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
11 |
Consider for example the regular expression $(a^*)^*\,b$ and |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
12 |
strings of the form $aa..a$. These strings cannot be matched by this regular |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
13 |
expression: obviously the expected $b$ in the last |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
14 |
position is missing. One would assume that modern regular expression |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
15 |
matching engines can find this out very quickly. Surprisingly, if one tries |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
16 |
this example in JavaScript, Python or Java 8, even with small strings, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
17 |
say of lenght of around 30 $a$'s, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
18 |
the decision takes large amounts of time to finish. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
19 |
This is inproportional |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
20 |
to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}). |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
21 |
The algorithms clearly show exponential behaviour, and as can be seen |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
22 |
is triggered by some relatively simple regular expressions. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
23 |
Java 9 and newer |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
24 |
versions mitigates this behaviour by several magnitudes, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
25 |
but are still slow compared |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
26 |
with the approach we are going to use in this thesis. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
27 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
28 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
29 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
30 |
This superlinear blowup in regular expression engines |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
31 |
has caused grief in ``real life'' where it is |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
32 |
given the name ``catastrophic backtracking'' or ``evil'' regular expressions. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
33 |
A less serious example is a bug in the Atom editor: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
34 |
a user found out when writing the following code |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
35 |
\begin{verbatim} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
36 |
vVar.Type().Name() == "" && vVar.Kind() == reflect.Ptr |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
37 |
\end{verbatim} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
38 |
\begin{verbatim} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
39 |
&& vVar.Type().Elem().Name() == "" && vVar.Type().Elem().Kind() == |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
40 |
\end{verbatim} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
41 |
\begin{verbatim} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
42 |
reflect.Slice |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
43 |
\end{verbatim} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
44 |
it took the editor half a hour to finish computing. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
45 |
This was subsequently fixed by Galbraith \cite{fixedAtom}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
46 |
and the issue was due to the regex engine of Atom. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
47 |
But evil regular expressions can be more than a nuisance in a text editor: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
48 |
on 20 July 2016 one evil |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
49 |
regular expression brought the webpage |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
50 |
\href{http://stackexchange.com}{Stack Exchange} to its |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
51 |
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
52 |
In this instance, a regular expression intended to just trim white |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
53 |
spaces from the beginning and the end of a line actually consumed |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
54 |
massive amounts of CPU resources---causing the web servers to grind to a |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
55 |
halt. In this example, the time needed to process |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
56 |
the string was |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
57 |
$O(n^2)$ with respect to the string length $n$. This |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
58 |
quadratic overhead was enough for the homepage of Stack Exchange to |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
59 |
respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
60 |
attack and therefore stopped the servers from responding to any |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
61 |
requests. This made the whole site become unavailable. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
62 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
63 |
\begin{figure}[p] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
64 |
\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
65 |
\begin{tabular}{@{}c@{\hspace{0mm}}c@{}} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
66 |
\begin{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
67 |
\begin{axis}[ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
68 |
xlabel={$n$}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
69 |
x label style={at={(1.05,-0.05)}}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
70 |
ylabel={time in secs}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
71 |
enlargelimits=false, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
72 |
xtick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
73 |
xmax=33, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
74 |
ymax=35, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
75 |
ytick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
76 |
scaled ticks=false, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
77 |
axis lines=left, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
78 |
width=5cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
79 |
height=4cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
80 |
legend entries={JavaScript}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
81 |
legend pos=north west, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
82 |
legend cell align=left] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
83 |
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
84 |
\end{axis} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
85 |
\end{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
86 |
& |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
87 |
\begin{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
88 |
\begin{axis}[ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
89 |
xlabel={$n$}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
90 |
x label style={at={(1.05,-0.05)}}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
91 |
%ylabel={time in secs}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
92 |
enlargelimits=false, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
93 |
xtick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
94 |
xmax=33, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
95 |
ymax=35, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
96 |
ytick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
97 |
scaled ticks=false, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
98 |
axis lines=left, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
99 |
width=5cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
100 |
height=4cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
101 |
legend entries={Python}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
102 |
legend pos=north west, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
103 |
legend cell align=left] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
104 |
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
105 |
\end{axis} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
106 |
\end{tikzpicture}\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
107 |
\begin{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
108 |
\begin{axis}[ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
109 |
xlabel={$n$}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
110 |
x label style={at={(1.05,-0.05)}}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
111 |
ylabel={time in secs}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
112 |
enlargelimits=false, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
113 |
xtick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
114 |
xmax=33, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
115 |
ymax=35, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
116 |
ytick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
117 |
scaled ticks=false, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
118 |
axis lines=left, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
119 |
width=5cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
120 |
height=4cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
121 |
legend entries={Java 8}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
122 |
legend pos=north west, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
123 |
legend cell align=left] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
124 |
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
125 |
\end{axis} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
126 |
\end{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
127 |
& |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
128 |
\begin{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
129 |
\begin{axis}[ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
130 |
xlabel={$n$}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
131 |
x label style={at={(1.05,-0.05)}}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
132 |
%ylabel={time in secs}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
133 |
enlargelimits=false, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
134 |
xtick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
135 |
xmax=33, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
136 |
ymax=35, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
137 |
ytick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
138 |
scaled ticks=false, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
139 |
axis lines=left, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
140 |
width=5cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
141 |
height=4cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
142 |
legend entries={Dart}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
143 |
legend pos=north west, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
144 |
legend cell align=left] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
145 |
\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
146 |
\end{axis} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
147 |
\end{tikzpicture}\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
148 |
\begin{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
149 |
\begin{axis}[ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
150 |
xlabel={$n$}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
151 |
x label style={at={(1.05,-0.05)}}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
152 |
ylabel={time in secs}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
153 |
enlargelimits=false, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
154 |
xtick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
155 |
xmax=33, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
156 |
ymax=35, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
157 |
ytick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
158 |
scaled ticks=false, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
159 |
axis lines=left, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
160 |
width=5cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
161 |
height=4cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
162 |
legend entries={Swift}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
163 |
legend pos=north west, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
164 |
legend cell align=left] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
165 |
\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
166 |
\end{axis} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
167 |
\end{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
168 |
& |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
169 |
\begin{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
170 |
\begin{axis}[ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
171 |
xlabel={$n$}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
172 |
x label style={at={(1.05,-0.05)}}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
173 |
%ylabel={time in secs}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
174 |
enlargelimits=true, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
175 |
%xtick={0,5000,...,40000}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
176 |
%xmax=40000, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
177 |
%ymax=35, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
178 |
restrict x to domain*=0:40000, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
179 |
restrict y to domain*=0:35, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
180 |
%ytick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
181 |
%scaled ticks=false, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
182 |
axis lines=left, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
183 |
width=5cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
184 |
height=4cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
185 |
legend entries={Java9+}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
186 |
legend pos=north west, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
187 |
legend cell align=left] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
188 |
\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
189 |
\end{axis} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
190 |
\end{tikzpicture}\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
191 |
\begin{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
192 |
\begin{axis}[ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
193 |
xlabel={$n$}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
194 |
x label style={at={(1.05,-0.05)}}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
195 |
ylabel={time in secs}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
196 |
enlargelimits=true, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
197 |
%xtick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
198 |
%xmax=33, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
199 |
%ymax=35, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
200 |
restrict x to domain*=0:60000, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
201 |
restrict y to domain*=0:35, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
202 |
%ytick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
203 |
%scaled ticks=false, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
204 |
axis lines=left, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
205 |
width=5cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
206 |
height=4cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
207 |
legend entries={Scala}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
208 |
legend pos=north west, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
209 |
legend cell align=left] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
210 |
\addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
211 |
\end{axis} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
212 |
\end{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
213 |
& |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
214 |
\begin{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
215 |
\begin{axis}[ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
216 |
xlabel={$n$}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
217 |
x label style={at={(1.05,-0.05)}}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
218 |
%ylabel={time in secs}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
219 |
enlargelimits=true, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
220 |
%xtick={0,5000,...,40000}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
221 |
%xmax=40000, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
222 |
%ymax=35, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
223 |
restrict x to domain*=0:60000, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
224 |
restrict y to domain*=0:45, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
225 |
%ytick={0,5,...,30}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
226 |
%scaled ticks=false, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
227 |
axis lines=left, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
228 |
width=5cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
229 |
height=4cm, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
230 |
legend style={cells={align=left}}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
231 |
legend entries={Isabelle \\ Extracted}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
232 |
legend pos=north west, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
233 |
legend cell align=left] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
234 |
\addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
235 |
\end{axis} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
236 |
\end{tikzpicture}\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
237 |
\multicolumn{2}{c}{Graphs} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
238 |
\end{tabular} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
239 |
\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
240 |
\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
241 |
of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
242 |
The reason for their fast growth %superlinear behaviour |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
243 |
is that they do a depth-first-search |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
244 |
using NFAs. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
245 |
If the string does not match, the regular expression matching |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
246 |
engine starts to explore all possibilities. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
247 |
The last two graphs are for our implementation in Scala, one manual |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
248 |
and one extracted from the verified lexer in Isabelle by $\textit{codegen}$. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
249 |
Our lexer |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
250 |
performs better in this case, and |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
251 |
is formally verified. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
252 |
Despite being almost identical, the codegen-generated lexer |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
253 |
% is generated directly from Isabelle using $\textit{codegen}$, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
254 |
is slower than the manually written version since the code synthesised by |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
255 |
$\textit{codegen}$ does not use native integer or string |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
256 |
types of the target language. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
257 |
%Note that we are comparing our \emph{lexer} against other languages' matchers. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
258 |
}\label{fig:aStarStarb} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
259 |
\end{figure}\afterpage{\clearpage} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
260 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
261 |
A more recent example is a global outage of all Cloudflare servers on |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
262 |
2 July 2019. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
263 |
The traffic Cloudflare services each day is more than |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
264 |
Twitter, Amazon, Instagram, Apple, Bing and Wikipedia combined, yet |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
265 |
it became completely unavailable for half an hour because of |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
266 |
a poorly-written regular expression roughly of the form $^*.^*=.^*$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
267 |
exhausted CPU resources that serve HTTP traffic. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
268 |
Although the outage |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
269 |
had several causes, at the heart was a regular expression that |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
270 |
was used to monitor network |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
271 |
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
272 |
These problems with regular expressions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
273 |
are not isolated events that happen |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
274 |
very rarely, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
275 |
%but actually widespread. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
276 |
%They occur so often that they have a |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
277 |
but they occur actually often enough that they have a |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
278 |
name: Regular-Expression-Denial-Of-Service (ReDoS) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
279 |
attacks. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
280 |
Davis et al. \cite{Davis18} detected more |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
281 |
than 1000 evil regular expressions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
282 |
in Node.js, Python core libraries, npm and pypi. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
283 |
They therefore concluded that evil regular expressions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
284 |
are a real problem rather than just "a parlour trick". |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
285 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
286 |
The work in this thesis aims to address this issue |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
287 |
with the help of formal proofs. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
288 |
We describe a lexing algorithm based |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
289 |
on Brzozowski derivatives with verified correctness |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
290 |
and a finiteness property for the size of derivatives |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
291 |
(which are all done in Isabelle/HOL). |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
292 |
Such properties %guarantee the absence of |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
293 |
are an important step in preventing |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
294 |
catastrophic backtracking once and for all. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
295 |
We will give more details in the next sections |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
296 |
on (i) why the slow cases in graph \ref{fig:aStarStarb} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
297 |
can occur in traditional regular expression engines |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
298 |
and (ii) why we choose our |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
299 |
approach based on Brzozowski derivatives and formal proofs. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
300 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
301 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
302 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
303 |
\section{Preliminaries}%Regex, and the Problems with Regex Matchers} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
304 |
Regular expressions and regular expression matchers |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
305 |
have been studied for many years. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
306 |
Theoretical results in automata theory state |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
307 |
that basic regular expression matching should be linear |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
308 |
w.r.t the input. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
309 |
This assumes that the regular expression |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
310 |
$r$ was pre-processed and turned into a |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
311 |
deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
312 |
By basic we mean textbook definitions such as the one |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
313 |
below, involving only regular expressions for characters, alternatives, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
314 |
sequences, and Kleene stars: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
315 |
\[ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
316 |
r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
317 |
\] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
318 |
Modern regular expression matchers used by programmers, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
319 |
however, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
320 |
support much richer constructs, such as bounded repetitions, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
321 |
negations, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
322 |
and back-references. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
323 |
To differentiate, we use the word \emph{regex} to refer |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
324 |
to those expressions with richer constructs while reserving the |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
325 |
term \emph{regular expression} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
326 |
for the more traditional meaning in formal languages theory. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
327 |
We follow this convention |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
328 |
in this thesis. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
329 |
In the future, we aim to support all the popular features of regexes, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
330 |
but for this work we mainly look at basic regular expressions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
331 |
and bounded repetitions. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
332 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
333 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
334 |
%Most modern regex libraries |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
335 |
%the so-called PCRE standard (Peral Compatible Regular Expressions) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
336 |
%has the back-references |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
337 |
Regexes come with a number of constructs |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
338 |
that make it more convenient for |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
339 |
programmers to write regular expressions. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
340 |
Depending on the types of constructs |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
341 |
the task of matching and lexing with them |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
342 |
will have different levels of complexity. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
343 |
Some of those constructs are syntactic sugars that are |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
344 |
simply short hand notations |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
345 |
that save the programmers a few keystrokes. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
346 |
These will not cause problems for regex libraries. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
347 |
For example the |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
348 |
non-binary alternative involving three or more choices just means: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
349 |
\[ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
350 |
(a | b | c) \stackrel{means}{=} ((a + b)+ c) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
351 |
\] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
352 |
Similarly, the range operator |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
353 |
%used to express the alternative |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
354 |
%of all characters between its operands, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
355 |
is just a concise way |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
356 |
of expressing an alternative of consecutive characters: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
357 |
\[ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
358 |
[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
359 |
\] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
360 |
for an alternative. The |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
361 |
wildcard character '$.$' is used to refer to any single character, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
362 |
\[ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
363 |
. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
364 |
\] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
365 |
except the newline. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
366 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
367 |
\subsection{Bounded Repetitions} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
368 |
More interesting are bounded repetitions, which can |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
369 |
make the regular expressions much |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
370 |
more compact. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
371 |
Normally there are four kinds of bounded repetitions: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
372 |
$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
373 |
(where $n$ and $m$ are constant natural numbers). |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
374 |
Like the star regular expressions, the set of strings or language |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
375 |
a bounded regular expression can match |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
376 |
is defined using the power operation on sets: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
377 |
\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
378 |
\begin{tabular}{lcl} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
379 |
$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
380 |
$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
381 |
$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
382 |
$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
383 |
\end{tabular} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
384 |
\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
385 |
The attraction of bounded repetitions is that they can be |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
386 |
used to avoid a size blow up: for example $r^{\{n\}}$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
387 |
is a shorthand for |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
388 |
the much longer regular expression: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
389 |
\[ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
390 |
\underbrace{r\ldots r}_\text{n copies of r}. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
391 |
\] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
392 |
%Therefore, a naive algorithm that simply unfolds |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
393 |
%them into their desugared forms |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
394 |
%will suffer from at least an exponential runtime increase. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
395 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
396 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
397 |
The problem with matching |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
398 |
such bounded repetitions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
399 |
is that tools based on the classic notion of |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
400 |
automata need to expand $r^{\{n\}}$ into $n$ connected |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
401 |
copies of the automaton for $r$. This leads to very inefficient matching |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
402 |
algorithms or algorithms that consume large amounts of memory. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
403 |
Implementations using $\DFA$s will |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
404 |
in such situations |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
405 |
either become very slow |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
406 |
(for example Verbatim++ \cite{Verbatimpp}) or run |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
407 |
out of memory (for example $\mathit{LEX}$ and |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
408 |
$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
409 |
in C and JAVA that generate $\mathit{DFA}$-based |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
410 |
lexers. The user provides a set of regular expressions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
411 |
and configurations, and then |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
412 |
gets an output program encoding a minimized $\mathit{DFA}$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
413 |
that can be compiled and run. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
414 |
When given the above countdown regular expression, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
415 |
a small $n$ (say 20) would result in a program representing a |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
416 |
DFA |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
417 |
with millions of states.}) for large counters. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
418 |
A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
419 |
where the minimal DFA requires at least $2^{n+1}$ states. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
420 |
For example, when $n$ is equal to 2, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
421 |
the corresponding $\mathit{NFA}$ looks like: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
422 |
\vspace{6mm} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
423 |
\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
424 |
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
425 |
\node[state,initial] (q_0) {$q_0$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
426 |
\node[state, red] (q_1) [right=of q_0] {$q_1$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
427 |
\node[state, red] (q_2) [right=of q_1] {$q_2$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
428 |
\node[state, accepting, red](q_3) [right=of q_2] {$q_3$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
429 |
\path[->] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
430 |
(q_0) edge node {a} (q_1) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
431 |
edge [loop below] node {a,b} () |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
432 |
(q_1) edge node {a,b} (q_2) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
433 |
(q_2) edge node {a,b} (q_3); |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
434 |
\end{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
435 |
\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
436 |
and when turned into a DFA by the subset construction |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
437 |
requires at least $2^3$ states.\footnote{The |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
438 |
red states are "countdown states" which count down |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
439 |
the number of characters needed in addition to the current |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
440 |
string to make a successful match. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
441 |
For example, state $q_1$ indicates a match that has |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
442 |
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
443 |
and just consumed the "delimiter" $a$ in the middle, and |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
444 |
needs to match 2 more iterations of $(a|b)$ to complete. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
445 |
State $q_2$ on the other hand, can be viewed as a state |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
446 |
after $q_1$ has consumed 1 character, and just waits |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
447 |
for 1 more character to complete. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
448 |
The state $q_3$ is the last (accepting) state, requiring 0 |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
449 |
more characters. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
450 |
Depending on the suffix of the |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
451 |
input string up to the current read location, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
452 |
the states $q_1$ and $q_2$, $q_3$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
453 |
may or may |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
454 |
not be active. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
455 |
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
456 |
contain at least $2^3$ non-equivalent states that cannot be merged, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
457 |
because the subset construction during determinisation will generate |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
458 |
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
459 |
Generalizing this to regular expressions with larger |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
460 |
bounded repetitions number, we have that |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
461 |
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
462 |
would require at least $2^{n+1}$ states, if $r$ itself contains |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
463 |
more than 1 string. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
464 |
This is to represent all different |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
465 |
scenarios in which "countdown" states are active.} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
466 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
467 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
468 |
Bounded repetitions are important because they |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
469 |
tend to occur frequently in practical use, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
470 |
for example in the regex library RegExLib, in |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
471 |
the rules library of Snort \cite{Snort1999}\footnote{ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
472 |
Snort is a network intrusion detection (NID) tool |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
473 |
for monitoring network traffic. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
474 |
The network security community curates a list |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
475 |
of malicious patterns written as regexes, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
476 |
which is used by Snort's detection engine |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
477 |
to match against network traffic for any hostile |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
478 |
activities such as buffer overflow attacks.}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
479 |
as well as in XML Schema definitions (XSDs). |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
480 |
According to Bj\"{o}rklund et al \cite{xml2015}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
481 |
more than half of the |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
482 |
XSDs they found on the Maven.org central repository |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
483 |
have bounded regular expressions in them. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
484 |
Often the counters are quite large, with the largest being |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
485 |
close to ten million. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
486 |
A smaller sample XSD they gave |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
487 |
is: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
488 |
\lstset{ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
489 |
basicstyle=\fontsize{8.5}{9}\ttfamily, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
490 |
language=XML, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
491 |
morekeywords={encoding, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
492 |
xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
493 |
} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
494 |
\begin{lstlisting} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
495 |
<sequence minOccurs="0" maxOccurs="65535"> |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
496 |
<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
497 |
<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
498 |
</sequence> |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
499 |
\end{lstlisting} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
500 |
This can be seen as the regex |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
501 |
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
502 |
regular expressions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
503 |
satisfying certain constraints (such as |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
504 |
satisfying the floating point number format). |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
505 |
It is therefore quite unsatisfying that |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
506 |
some regular expressions matching libraries |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
507 |
impose adhoc limits |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
508 |
for bounded regular expressions: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
509 |
For example, in the regular expression matching library in the Go |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
510 |
language the regular expression $a^{1001}$ is not permitted, because no counter |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
511 |
can be above 1000, and in the built-in Rust regular expression library |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
512 |
expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
513 |
for being too big~\footnote{Try it out here: https://rustexp.lpil.uk}. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
514 |
As Becchi and Crawley \cite{Becchi08} have pointed out, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
515 |
the reason for these restrictions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
516 |
is that they simulate a non-deterministic finite |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
517 |
automata (NFA) with a breadth-first search. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
518 |
This way the number of active states could |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
519 |
be equal to the counter number. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
520 |
When the counters are large, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
521 |
the memory requirement could become |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
522 |
infeasible, and a regex engine |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
523 |
like in Go will reject this pattern straight away. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
524 |
\begin{figure}[H] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
525 |
\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
526 |
\begin{tikzpicture} [node distance = 2cm, on grid, auto] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
527 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
528 |
\node (q0) [state, initial] {$0$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
529 |
\node (q1) [state, right = of q0] {$1$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
530 |
%\node (q2) [state, right = of q1] {$2$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
531 |
\node (qdots) [right = of q1] {$\ldots$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
532 |
\node (qn) [state, right = of qdots] {$n$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
533 |
\node (qn1) [state, right = of qn] {$n+1$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
534 |
\node (qn2) [state, right = of qn1] {$n+2$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
535 |
\node (qn3) [state, accepting, right = of qn2] {$n+3$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
536 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
537 |
\path [-stealth, thick] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
538 |
(q0) edge [loop above] node {a} () |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
539 |
(q0) edge node {a} (q1) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
540 |
%(q1) edge node {.} (q2) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
541 |
(q1) edge node {.} (qdots) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
542 |
(qdots) edge node {.} (qn) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
543 |
(qn) edge node {.} (qn1) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
544 |
(qn1) edge node {b} (qn2) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
545 |
(qn2) edge node {$c$} (qn3); |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
546 |
\end{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
547 |
%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
548 |
% \node[state,initial] (q_0) {$0$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
549 |
% \node[state, ] (q_1) [right=of q_0] {$1$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
550 |
% \node[state, ] (q_2) [right=of q_1] {$2$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
551 |
% \node[state, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
552 |
% \node[state, accepting, ](q_3) [right=of q_2] {$3$}; |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
553 |
% \path[->] |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
554 |
% (q_0) edge node {a} (q_1) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
555 |
% edge [loop below] node {a,b} () |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
556 |
% (q_1) edge node {a,b} (q_2) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
557 |
% (q_2) edge node {a,b} (q_3); |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
558 |
%\end{tikzpicture} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
559 |
\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
560 |
\caption{The example given by Becchi and Crawley |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
561 |
that NFA simulation can consume large |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
562 |
amounts of memory: $.^*a.^{\{n\}}bc$ matching |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
563 |
strings of the form $aaa\ldots aaaabc$. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
564 |
When traversing in a breadth-first manner, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
565 |
all states from 0 till $n+1$ will become active.}\label{fig:inj} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
566 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
567 |
\end{figure} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
568 |
%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
569 |
%type of $\mathit{NFA}$ simulation and guarantees a linear runtime |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
570 |
%in terms of input string length. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
571 |
%TODO:try out these lexers |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
572 |
These problems can of course be solved in matching algorithms where |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
573 |
automata go beyond the classic notion and for instance include explicit |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
574 |
counters \cite{Turo_ov__2020}. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
575 |
These solutions can be quite efficient, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
576 |
with the ability to process |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
577 |
gigabits of strings input per second |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
578 |
even with large counters \cite{Becchi08}. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
579 |
These practical solutions do not come with |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
580 |
formal guarantees, and as pointed out by |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
581 |
Kuklewicz \cite{KuklewiczHaskell}, can be error-prone. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
582 |
%But formal reasoning about these automata especially in Isabelle |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
583 |
%can be challenging |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
584 |
%and un-intuitive. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
585 |
%Therefore, we take correctness and runtime claims made about these solutions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
586 |
%with a grain of salt. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
587 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
588 |
In the work reported in \cite{ITP2023} and here, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
589 |
we add better support using derivatives |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
590 |
for bounded regular expression $r^{\{n\}}$. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
591 |
Our results |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
592 |
extend straightforwardly to |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
593 |
repetitions with intervals such as |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
594 |
$r^{\{n\ldots m\}}$. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
595 |
The merit of Brzozowski derivatives (more on this later) |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
596 |
on this problem is that |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
597 |
it can be naturally extended to support bounded repetitions. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
598 |
Moreover these extensions are still made up of only small |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
599 |
inductive datatypes and recursive functions, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
600 |
making it handy to deal with them in theorem provers. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
601 |
%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
602 |
%straightforwardly extended to deal with bounded regular expressions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
603 |
%and moreover the resulting code still consists of only simple |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
604 |
%recursive functions and inductive datatypes. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
605 |
Finally, bounded regular expressions do not destroy our finite |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
606 |
boundedness property, which we shall prove later on. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
607 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
608 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
609 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
610 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
611 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
612 |
\subsection{Back-References} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
613 |
The other way to simulate an $\mathit{NFA}$ for matching is choosing |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
614 |
a single transition each time, keeping all the other options in |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
615 |
a queue or stack, and backtracking if that choice eventually |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
616 |
fails. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
617 |
This method, often called a "depth-first-search", |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
618 |
is efficient in many cases, but could end up |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
619 |
with exponential run time. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
620 |
The backtracking method is employed in regex libraries |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
621 |
that support \emph{back-references}, for example |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
622 |
in Java and Python. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
623 |
%\section{Back-references and The Terminology Regex} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
624 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
625 |
%When one constructs an $\NFA$ out of a regular expression |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
626 |
%there is often very little to be done in the first phase, one simply |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
627 |
%construct the $\NFA$ states based on the structure of the input regular expression. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
628 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
629 |
%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
630 |
%one by keeping track of all active states after consuming |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
631 |
%a character, and update that set of states iteratively. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
632 |
%This can be viewed as a breadth-first-search of the $\mathit{NFA}$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
633 |
%for a path terminating |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
634 |
%at an accepting state. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
635 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
636 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
637 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
638 |
Consider the following regular expression where the sequence |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
639 |
operator is omitted for brevity: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
640 |
\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
641 |
$r_1r_2r_3r_4$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
642 |
\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
643 |
In this example, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
644 |
one could label sub-expressions of interest |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
645 |
by parenthesizing them and giving |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
646 |
them a number in the order in which their opening parentheses appear. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
647 |
One possible way of parenthesizing and labelling is: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
648 |
\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
649 |
$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
650 |
\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
651 |
The sub-expressions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
652 |
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
653 |
by 1 to 4, and can be ``referred back'' by their respective numbers. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
654 |
%These sub-expressions are called "capturing groups". |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
655 |
To do so, one uses the syntax $\backslash i$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
656 |
to denote that we want the sub-string |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
657 |
of the input matched by the i-th |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
658 |
sub-expression to appear again, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
659 |
exactly the same as it first appeared: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
660 |
\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
661 |
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
662 |
\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
663 |
\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
664 |
Once the sub-string $s_i$ for the sub-expression $r_i$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
665 |
has been fixed, there is no variability on what the back-reference |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
666 |
label $\backslash i$ can be---it is tied to $s_i$. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
667 |
The overall string will look like $\ldots s_i \ldots s_i \ldots $ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
668 |
%The backslash and number $i$ are the |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
669 |
%so-called "back-references". |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
670 |
%Let $e$ be an expression made of regular expressions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
671 |
%and back-references. $e$ contains the expression $e_i$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
672 |
%as its $i$-th capturing group. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
673 |
%The semantics of back-reference can be recursively |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
674 |
%written as: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
675 |
%\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
676 |
% \begin{tabular}{c} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
677 |
% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
678 |
% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
679 |
% \end{tabular} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
680 |
%\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
681 |
A concrete example |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
682 |
for back-references is |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
683 |
\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
684 |
$(.^*)\backslash 1$, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
685 |
\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
686 |
which matches |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
687 |
strings that can be split into two identical halves, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
688 |
for example $\mathit{foofoo}$, $\mathit{ww}$ and so on. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
689 |
Note that this is different from |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
690 |
repeating the sub-expression verbatim like |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
691 |
\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
692 |
$(.^*)(.^*)$, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
693 |
\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
694 |
which does not impose any restrictions on what strings the second |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
695 |
sub-expression $.^*$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
696 |
might match. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
697 |
Another example for back-references is |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
698 |
\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
699 |
$(.)(.)\backslash 2\backslash 1$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
700 |
\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
701 |
which matches four-character palindromes |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
702 |
like $abba$, $x??x$ and so on. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
703 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
704 |
Back-references are a regex construct |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
705 |
that programmers find quite useful. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
706 |
According to Becchi and Crawley \cite{Becchi08}, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
707 |
6\% of Snort rules (up until 2008) use them. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
708 |
The most common use of back-references |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
709 |
is to express well-formed html files, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
710 |
where back-references are convenient for matching |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
711 |
opening and closing tags like |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
712 |
\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
713 |
$\langle html \rangle \ldots \langle / html \rangle$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
714 |
\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
715 |
A regex describing such a format |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
716 |
is |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
717 |
\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
718 |
$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
719 |
\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
720 |
Despite being useful, the expressive power of regexes |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
721 |
go beyond regular languages |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
722 |
once back-references are included. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
723 |
In fact, they allow the regex construct to express |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
724 |
languages that cannot be contained in context-free |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
725 |
languages either. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
726 |
For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
727 |
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
728 |
which cannot be expressed by |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
729 |
context-free grammars \cite{campeanu2003formal}. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
730 |
Such a language is contained in the context-sensitive hierarchy |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
731 |
of formal languages. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
732 |
Also solving the matching problem involving back-references |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
733 |
is known to be NP-complete \parencite{alfred2014algorithms}. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
734 |
Regex libraries supporting back-references such as |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
735 |
PCRE \cite{pcre} therefore have to |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
736 |
revert to a depth-first search algorithm including backtracking. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
737 |
What is unexpected is that even in the cases |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
738 |
not involving back-references, there is still |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
739 |
a (non-negligible) chance they might backtrack super-linearly, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
740 |
as shown in the graphs in figure \ref{fig:aStarStarb}. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
741 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
742 |
Summing up, we can categorise existing |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
743 |
practical regex libraries into two kinds: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
744 |
(i) The ones with linear |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
745 |
time guarantees like Go and Rust. The downside with them is that |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
746 |
they impose restrictions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
747 |
on the regular expressions (not allowing back-references, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
748 |
bounded repetitions cannot exceed an ad hoc limit etc.). |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
749 |
And (ii) those |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
750 |
that allow large bounded regular expressions and back-references |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
751 |
at the expense of using backtracking algorithms. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
752 |
They can potentially ``grind to a halt'' |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
753 |
on some very simple cases, resulting |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
754 |
ReDoS attacks if exposed to the internet. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
755 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
756 |
The problems with both approaches are the motivation for us |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
757 |
to look again at the regular expression matching problem. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
758 |
Another motivation is that regular expression matching algorithms |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
759 |
that follow the POSIX standard often contain errors and bugs |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
760 |
as we shall explain next. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
761 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
762 |
%We would like to have regex engines that can |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
763 |
%deal with the regular part (e.g. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
764 |
%bounded repetitions) of regexes more |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
765 |
%efficiently. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
766 |
%Also we want to make sure that they do it correctly. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
767 |
%It turns out that such aim is not so easy to achieve. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
768 |
%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
769 |
% For example, the Rust regex engine claims to be linear, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
770 |
% but does not support lookarounds and back-references. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
771 |
% The GoLang regex library does not support over 1000 repetitions. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
772 |
% Java and Python both support back-references, but shows |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
773 |
%catastrophic backtracking behaviours on inputs without back-references( |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
774 |
%when the language is still regular). |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
775 |
%TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
776 |
%TODO: verify the fact Rust does not allow 1000+ reps |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
777 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
778 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
779 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
780 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
781 |
%The time cost of regex matching algorithms in general |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
782 |
%involve two different phases, and different things can go differently wrong on |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
783 |
%these phases. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
784 |
%$\DFA$s usually have problems in the first (construction) phase |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
785 |
%, whereas $\NFA$s usually run into trouble |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
786 |
%on the second phase. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
787 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
788 |
|
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
789 |
\section{Error-prone POSIX Implementations} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
790 |
Very often there are multiple ways of matching a string |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
791 |
with a regular expression. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
792 |
In such cases the regular expressions matcher needs to |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
793 |
disambiguate. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
794 |
The more widely used strategy is called POSIX, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
795 |
which roughly speaking always chooses the longest initial match. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
796 |
The POSIX strategy is widely adopted in many regular expression matchers |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
797 |
because it is a natural strategy for lexers. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
798 |
However, many implementations (including the C libraries |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
799 |
used by Linux and OS X distributions) contain bugs |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
800 |
or do not meet the specification they claim to adhere to. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
801 |
Kuklewicz maintains a unit test repository which lists some |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
802 |
problems with existing regular expression engines \cite{KuklewiczHaskell}. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
803 |
In some cases, they either fail to generate a |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
804 |
result when there exists a match, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
805 |
or give results that are inconsistent with the POSIX standard. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
806 |
A concrete example is the regex: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
807 |
\begin{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
808 |
$(aba + ab + a)^* \text{and the string} \; ababa$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
809 |
\end{center} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
810 |
The correct POSIX match for the above |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
811 |
involves the entire string $ababa$, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
812 |
split into two Kleene star iterations, namely $[ab], [aba]$ at positions |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
813 |
$[0, 2), [2, 5)$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
814 |
respectively. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
815 |
But feeding this example to the different engines |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
816 |
listed at regex101 \footnote{ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
817 |
regex101 is an online regular expression matcher which |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
818 |
provides API for trying out regular expression |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
819 |
engines of multiple popular programming languages like |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
820 |
Java, Python, Go, etc.} \parencite{regex101}. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
821 |
yields |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
822 |
only two incomplete matches: $[aba]$ at $[0, 3)$ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
823 |
and $a$ at $[4, 5)$. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
824 |
Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
825 |
commented that most regex libraries are not |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
826 |
correctly implementing the central POSIX |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
827 |
rule, called the maximum munch rule. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
828 |
Grathwohl \parencite{grathwohl2014crash} wrote: |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
829 |
\begin{quote}\it |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
830 |
``The POSIX strategy is more complicated than the |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
831 |
greedy because of the dependence on information about |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
832 |
the length of matched strings in the various subexpressions.'' |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
833 |
\end{quote} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
834 |
%\noindent |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
835 |
People have recognised that the implementation complexity of POSIX rules also come from |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
836 |
the specification being not very precise. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
837 |
The developers of the regexp package of Go |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
838 |
\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
839 |
commented that |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
840 |
\begin{quote}\it |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
841 |
`` |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
842 |
The POSIX rule is computationally prohibitive |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
843 |
and not even well-defined. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
844 |
`` |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
845 |
\end{quote} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
846 |
There are many informal summaries of this disambiguation |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
847 |
strategy, which are often quite long and delicate. |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
848 |
For example Kuklewicz \cite{KuklewiczHaskell} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
849 |
described the POSIX rule as (section 1, last paragraph): |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
850 |
\begin{quote} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
851 |
\begin{itemize} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
852 |
\item |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
853 |
regular expressions (REs) take the leftmost starting match, and the longest match starting there |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
854 |
earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
855 |
\item |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
856 |
higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
857 |
\item |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
858 |
REs have right associative concatenation which can be changed with parenthesis\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
859 |
\item |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
860 |
parenthesized subexpressions return the match from their last usage\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
861 |
\item |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
862 |
text of component subexpressions must be contained in the text of the |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
863 |
higher-level subexpressions\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
864 |
\item |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
865 |
if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
866 |
\item |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
867 |
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
868 |
\end{itemize} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
869 |
\end{quote} |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
870 |
%The text above |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
871 |
%is trying to capture something very precise, |
3831621d7b14
added technical Overview section, almost done introduction
Chengsong
parents:
diff
changeset
|
872 |
%and is crying out for formalising. |