1 % Chapter Template |
1 % Chapter Template |
2 |
2 |
3 \chapter{Regular Expressions and POSIX Lexing} % Main chapter title |
3 \chapter{Regular Expressions and POSIX Lexing-Preliminaries} % Main chapter title |
4 |
4 |
5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts |
5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts |
6 %and notations we |
6 %and notations we |
7 % used for describing the lexing algorithm by Sulzmann and Lu, |
7 % used for describing the lexing algorithm by Sulzmann and Lu, |
8 %and then give the algorithm and its variant and discuss |
8 %and then give the algorithm and its variant and discuss |
9 %why more aggressive simplifications are needed. |
9 %why more aggressive simplifications are needed. |
10 |
10 We start with a technical overview of the catastrophic backtracking problem, |
11 In this chapter, we define the basic notions |
11 motivating rigorous approaches to regular expression matching and lexing. |
12 for regular languages and regular expressions. |
12 Then we introduce basic notations and definitions of our problem. |
13 This is essentially a description in ``English'' |
|
14 the functions and datatypes of our formalisation in Isabelle/HOL. |
|
15 We also define what $\POSIX$ lexing means, |
|
16 followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} |
|
17 that produces the output conforming |
|
18 to the $\POSIX$ standard\footnote{In what follows |
|
19 we choose to use the Isabelle-style notation |
|
20 for function applications, where |
|
21 the parameters of a function are not enclosed |
|
22 inside a pair of parentheses (e.g. $f \;x \;y$ |
|
23 instead of $f(x,\;y)$). This is mainly |
|
24 to make the text visually more concise.}. |
|
25 |
|
26 |
13 |
27 \section{Technical Overview} |
14 \section{Technical Overview} |
28 |
|
29 Consider for example the regular expression $(a^*)^*\,b$ and |
15 Consider for example the regular expression $(a^*)^*\,b$ and |
30 strings of the form $aa..a$. These strings cannot be matched by this regular |
16 strings of the form $aa..a$. These strings cannot be matched by this regular |
31 expression: Obviously the expected $b$ in the last |
17 expression: obviously the expected $b$ in the last |
32 position is missing. One would assume that modern regular expression |
18 position is missing. One would assume that modern regular expression |
33 matching engines can find this out very quickly. Surprisingly, if one tries |
19 matching engines can find this out very quickly. Surprisingly, if one tries |
34 this example in JavaScript, Python or Java 8, even with small strings, |
20 this example in JavaScript, Python or Java 8, even with small strings, |
35 say of lenght of around 30 $a$'s, |
21 say of lenght of around 30 $a$'s, |
36 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}). |
22 the decision takes large amounts of time to finish. |
|
23 This is inproportional |
|
24 to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}). |
37 The algorithms clearly show exponential behaviour, and as can be seen |
25 The algorithms clearly show exponential behaviour, and as can be seen |
38 is triggered by some relatively simple regular expressions. |
26 is triggered by some relatively simple regular expressions. |
39 Java 9 and newer |
27 Java 9 and newer |
40 versions improve this behaviour somewhat, but are still slow compared |
28 versions mitigates this behaviour by several magnitudes, |
|
29 but are still slow compared |
41 with the approach we are going to use in this thesis. |
30 with the approach we are going to use in this thesis. |
42 |
31 |
43 |
32 |
44 |
33 |
45 This superlinear blowup in regular expression engines |
34 This superlinear blowup in regular expression engines |
186 legend pos=north west, |
175 legend pos=north west, |
187 legend cell align=left] |
176 legend cell align=left] |
188 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data}; |
177 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data}; |
189 \end{axis} |
178 \end{axis} |
190 \end{tikzpicture}\\ |
179 \end{tikzpicture}\\ |
|
180 \begin{tikzpicture} |
|
181 \begin{axis}[ |
|
182 xlabel={$n$}, |
|
183 x label style={at={(1.05,-0.05)}}, |
|
184 ylabel={time in secs}, |
|
185 enlargelimits=true, |
|
186 %xtick={0,5,...,30}, |
|
187 %xmax=33, |
|
188 %ymax=35, |
|
189 restrict x to domain*=0:60000, |
|
190 restrict y to domain*=0:35, |
|
191 %ytick={0,5,...,30}, |
|
192 %scaled ticks=false, |
|
193 axis lines=left, |
|
194 width=5cm, |
|
195 height=4cm, |
|
196 legend entries={Scala}, |
|
197 legend pos=north west, |
|
198 legend cell align=left] |
|
199 \addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data}; |
|
200 \end{axis} |
|
201 \end{tikzpicture} |
|
202 & |
|
203 \begin{tikzpicture} |
|
204 \begin{axis}[ |
|
205 xlabel={$n$}, |
|
206 x label style={at={(1.05,-0.05)}}, |
|
207 %ylabel={time in secs}, |
|
208 enlargelimits=true, |
|
209 %xtick={0,5000,...,40000}, |
|
210 %xmax=40000, |
|
211 %ymax=35, |
|
212 restrict x to domain*=0:60000, |
|
213 restrict y to domain*=0:45, |
|
214 %ytick={0,5,...,30}, |
|
215 %scaled ticks=false, |
|
216 axis lines=left, |
|
217 width=5cm, |
|
218 height=4cm, |
|
219 legend style={cells={align=left}}, |
|
220 legend entries={Isabelle \\ Extracted}, |
|
221 legend pos=north west, |
|
222 legend cell align=left] |
|
223 \addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data}; |
|
224 \end{axis} |
|
225 \end{tikzpicture}\\ |
191 \multicolumn{2}{c}{Graphs} |
226 \multicolumn{2}{c}{Graphs} |
192 \end{tabular} |
227 \end{tabular} |
193 \end{center} |
228 \end{center} |
194 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings |
229 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings |
195 of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. |
230 of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. |
196 The reason for their superlinear behaviour is that they do a depth-first-search |
231 The reason for their fast growth %superlinear behaviour |
|
232 is that they do a depth-first-search |
197 using NFAs. |
233 using NFAs. |
198 If the string does not match, the regular expression matching |
234 If the string does not match, the regular expression matching |
199 engine starts to explore all possibilities. |
235 engine starts to explore all possibilities. |
|
236 The last two graphs are for our implementation in Scala, one manual |
|
237 and one extracted from the verified lexer in Isabelle by $\textit{codegen}$. |
|
238 Our lexer |
|
239 performs better in this case, and |
|
240 is formally verified. |
|
241 Despite being almost identical, the codegen-generated lexer |
|
242 % is generated directly from Isabelle using $\textit{codegen}$, |
|
243 is slower than the manually written version since the code synthesised by |
|
244 $\textit{codegen}$ does not use native integer or string |
|
245 types of the target language. |
|
246 %Note that we are comparing our \emph{lexer} against other languages' matchers. |
200 }\label{fig:aStarStarb} |
247 }\label{fig:aStarStarb} |
201 \end{figure}\afterpage{\clearpage} |
248 \end{figure}\afterpage{\clearpage} |
202 |
249 |
203 A more recent example is a global outage of all Cloudflare servers on 2 July |
250 A more recent example is a global outage of all Cloudflare servers on 2 July |
204 2019. A poorly written regular expression exhibited catastrophic backtracking |
251 2019. A poorly written regular expression exhibited catastrophic backtracking |
234 can occur in traditional regular expression engines |
281 can occur in traditional regular expression engines |
235 and (ii) why we choose our |
282 and (ii) why we choose our |
236 approach based on Brzozowski derivatives and formal proofs. |
283 approach based on Brzozowski derivatives and formal proofs. |
237 |
284 |
238 |
285 |
|
286 |
|
287 |
|
288 In this chapter, we define the basic notions |
|
289 for regular languages and regular expressions. |
|
290 This is essentially a description in ``English'' |
|
291 the functions and datatypes of our formalisation in Isabelle/HOL. |
|
292 We also define what $\POSIX$ lexing means, |
|
293 followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} |
|
294 that produces the output conforming |
|
295 to the $\POSIX$ standard. |
|
296 This is a preliminary chapter which describes the results of |
|
297 Sulzmann and Lu and Ausaf et al \ref{AusafDyckhoffUrban2016}, |
|
298 but the proof details and function definitions are needed to motivate our work |
|
299 as we show in chapter \ref{Bitcoded2} how the proofs break down when |
|
300 simplifications are applied. |
|
301 %TODO: Actually show this in chapter 4. |
|
302 |
|
303 In what follows |
|
304 we choose to use the Isabelle-style notation |
|
305 for function and datatype constructor applications, where |
|
306 the parameters of a function are not enclosed |
|
307 inside a pair of parentheses (e.g. $f \;x \;y$ |
|
308 instead of $f(x,\;y)$). This is mainly |
|
309 to make the text visually more concise. |
|
310 |
|
311 |
|
312 |
|
313 |
239 \section{Preliminaries}%Regex, and the Problems with Regex Matchers} |
314 \section{Preliminaries}%Regex, and the Problems with Regex Matchers} |
240 Regular expressions and regular expression matchers |
315 Regular expressions and regular expression matchers |
241 have clearly been studied for many, many years. |
316 have clearly been studied for many, many years. |
242 Theoretical results in automata theory state |
317 Theoretical results in automata theory state |
243 that basic regular expression matching should be linear |
318 that basic regular expression matching should be linear |
337 automata need to expand $r^{\{n\}}$ into $n$ connected |
412 automata need to expand $r^{\{n\}}$ into $n$ connected |
338 copies of the automaton for $r$. This leads to very inefficient matching |
413 copies of the automaton for $r$. This leads to very inefficient matching |
339 algorithms or algorithms that consume large amounts of memory. |
414 algorithms or algorithms that consume large amounts of memory. |
340 Implementations using $\DFA$s will |
415 Implementations using $\DFA$s will |
341 in such situations |
416 in such situations |
342 either become excruciatingly slow |
417 either become very slow |
343 (for example Verbatim++ \cite{Verbatimpp}) or run |
418 (for example Verbatim++ \cite{Verbatimpp}) or run |
344 out of memory (for example $\mathit{LEX}$ and |
419 out of memory (for example $\mathit{LEX}$ and |
345 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators |
420 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators |
346 in C and JAVA that generate $\mathit{DFA}$-based |
421 in C and JAVA that generate $\mathit{DFA}$-based |
347 lexers. The user provides a set of regular expressions |
422 lexers. The user provides a set of regular expressions |
940 or identifiers). |
1015 or identifiers). |
941 Second, derivative-based matchers need to be more efficient in terms |
1016 Second, derivative-based matchers need to be more efficient in terms |
942 of the sizes of derivatives. |
1017 of the sizes of derivatives. |
943 Elegant and beautiful |
1018 Elegant and beautiful |
944 as many implementations are, |
1019 as many implementations are, |
945 they can be excruciatingly slow. |
1020 they can be still quite slow. |
946 For example, Sulzmann and Lu |
1021 For example, Sulzmann and Lu |
947 claim a linear running time of their proposed algorithm, |
1022 claim a linear running time of their proposed algorithm, |
948 but that was falsified by our experiments. The running time |
1023 but that was falsified by our experiments. The running time |
949 is actually $\Omega(2^n)$ in the worst case. |
1024 is actually $\Omega(2^n)$ in the worst case. |
950 A similar claim about a theoretical runtime of $O(n^2)$ |
1025 A similar claim about a theoretical runtime of $O(n^2)$ |
1982 $\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\ |
2057 $\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\ |
1983 $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\ |
2058 $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\ |
1984 $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\ |
2059 $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\ |
1985 \end{tabular} |
2060 \end{tabular} |
1986 \end{center} |
2061 \end{center} |
1987 |
|
1988 \noindent |
2062 \noindent |
1989 The function recurses on |
2063 The function recurses on |
1990 the shape of regular |
2064 the shape of regular |
1991 expressions and values. |
2065 expressions and values. |
1992 Intuitively, each clause analyses |
2066 Intuitively, each clause analyses |
1996 to inject the character back into. |
2070 to inject the character back into. |
1997 Once the character is |
2071 Once the character is |
1998 injected back to that sub-value; |
2072 injected back to that sub-value; |
1999 $\inj$ assembles all parts |
2073 $\inj$ assembles all parts |
2000 to form a new value. |
2074 to form a new value. |
2001 |
2075 \subsection{An Example of how Injection Works} |
|
2076 We will provide a few examples on why |
|
2077 |
|
2078 \begin{center} |
|
2079 \begin{tabular}{lcl} |
|
2080 $\inj \; (a\cdot b)\cdot c \; \Seq \; \ONE \; b$ & $=$ & $(a+e)$\\ |
|
2081 $A$ & $B$ & $C$ |
|
2082 \end{tabular} |
|
2083 \end{center} |
2002 For instance, the last clause is an |
2084 For instance, the last clause is an |
2003 injection into a sequence value $v_{i+1}$ |
2085 injection into a sequence value $v_{i+1}$ |
2004 whose second child |
2086 whose second child |
2005 value is a star and the shape of the |
2087 value is a star and the shape of the |
2006 regular expression $r_i$ before injection |
2088 regular expression $r_i$ before injection |