ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 666 6da4516ea87d
parent 657 00171b627b8d
child 667 660cf698eb26
equal deleted inserted replaced
665:3bedbdce3a3b 666:6da4516ea87d
     1 % Chapter Template
     1 % Chapter Template
     2 
     2 
     3 \chapter{Regular Expressions and POSIX Lexing} % Main chapter title
     3 \chapter{Regular Expressions and POSIX Lexing-Preliminaries} % Main chapter title
     4 
     4 
     5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
     5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
     6 %and notations we 
     6 %and notations we 
     7 % used for describing the lexing algorithm by Sulzmann and Lu,
     7 % used for describing the lexing algorithm by Sulzmann and Lu,
     8 %and then give the algorithm and its variant and discuss
     8 %and then give the algorithm and its variant and discuss
     9 %why more aggressive simplifications are needed. 
     9 %why more aggressive simplifications are needed. 
    10 
    10 We start with a technical overview of the catastrophic backtracking problem,
    11 In this chapter, we define the basic notions 
    11 motivating rigorous approaches to regular expression matching and lexing.
    12 for regular languages and regular expressions.
    12 Then we introduce basic notations and definitions of our problem.
    13 This is essentially a description in ``English''
       
    14 the functions and datatypes of our formalisation in Isabelle/HOL.
       
    15 We also define what $\POSIX$ lexing means, 
       
    16 followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
       
    17 that produces the output conforming
       
    18 to the $\POSIX$ standard\footnote{In what follows 
       
    19 we choose to use the Isabelle-style notation
       
    20 for function applications, where
       
    21 the parameters of a function are not enclosed
       
    22 inside a pair of parentheses (e.g. $f \;x \;y$
       
    23 instead of $f(x,\;y)$). This is mainly
       
    24 to make the text visually more concise.}.
       
    25 
       
    26 
    13 
    27 \section{Technical Overview}
    14 \section{Technical Overview}
    28 
       
    29 Consider for example the regular expression $(a^*)^*\,b$ and 
    15 Consider for example the regular expression $(a^*)^*\,b$ and 
    30 strings of the form $aa..a$. These strings cannot be matched by this regular
    16 strings of the form $aa..a$. These strings cannot be matched by this regular
    31 expression: Obviously the expected $b$ in the last
    17 expression: obviously the expected $b$ in the last
    32 position is missing. One would assume that modern regular expression
    18 position is missing. One would assume that modern regular expression
    33 matching engines can find this out very quickly. Surprisingly, if one tries
    19 matching engines can find this out very quickly. Surprisingly, if one tries
    34 this example in JavaScript, Python or Java 8, even with small strings, 
    20 this example in JavaScript, Python or Java 8, even with small strings, 
    35 say of lenght of around 30 $a$'s,
    21 say of lenght of around 30 $a$'s,
    36 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
    22 the decision takes large amounts of time to finish.
       
    23 This is inproportional
       
    24 to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}).
    37 The algorithms clearly show exponential behaviour, and as can be seen
    25 The algorithms clearly show exponential behaviour, and as can be seen
    38 is triggered by some relatively simple regular expressions.
    26 is triggered by some relatively simple regular expressions.
    39 Java 9 and newer
    27 Java 9 and newer
    40 versions improve this behaviour somewhat, but are still slow compared 
    28 versions mitigates this behaviour by several magnitudes, 
       
    29 but are still slow compared 
    41 with the approach we are going to use in this thesis.
    30 with the approach we are going to use in this thesis.
    42 
    31 
    43 
    32 
    44 
    33 
    45 This superlinear blowup in regular expression engines
    34 This superlinear blowup in regular expression engines
   186     legend pos=north west,
   175     legend pos=north west,
   187     legend cell align=left]
   176     legend cell align=left]
   188 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
   177 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
   189 \end{axis}
   178 \end{axis}
   190 \end{tikzpicture}\\ 
   179 \end{tikzpicture}\\ 
       
   180 \begin{tikzpicture}
       
   181 \begin{axis}[
       
   182     xlabel={$n$},
       
   183     x label style={at={(1.05,-0.05)}},
       
   184     ylabel={time in secs},
       
   185     enlargelimits=true,
       
   186     %xtick={0,5,...,30},
       
   187     %xmax=33,
       
   188     %ymax=35,
       
   189     restrict x to domain*=0:60000,
       
   190     restrict y to domain*=0:35,
       
   191     %ytick={0,5,...,30},
       
   192     %scaled ticks=false,
       
   193     axis lines=left,
       
   194     width=5cm,
       
   195     height=4cm, 
       
   196     legend entries={Scala},  
       
   197     legend pos=north west,
       
   198     legend cell align=left]
       
   199 \addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data};
       
   200 \end{axis}
       
   201 \end{tikzpicture}
       
   202   & 
       
   203 \begin{tikzpicture}
       
   204 \begin{axis}[
       
   205     xlabel={$n$},
       
   206     x label style={at={(1.05,-0.05)}},
       
   207     %ylabel={time in secs},
       
   208     enlargelimits=true,
       
   209     %xtick={0,5000,...,40000},
       
   210     %xmax=40000,
       
   211     %ymax=35,
       
   212     restrict x to domain*=0:60000,
       
   213     restrict y to domain*=0:45,
       
   214     %ytick={0,5,...,30},
       
   215     %scaled ticks=false,
       
   216     axis lines=left,
       
   217     width=5cm,
       
   218     height=4cm, 
       
   219     legend style={cells={align=left}},
       
   220     legend entries={Isabelle \\ Extracted},
       
   221     legend pos=north west,
       
   222     legend cell align=left]
       
   223 \addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data};
       
   224 \end{axis}
       
   225 \end{tikzpicture}\\ 
   191 \multicolumn{2}{c}{Graphs}
   226 \multicolumn{2}{c}{Graphs}
   192 \end{tabular}    
   227 \end{tabular}    
   193 \end{center}
   228 \end{center}
   194 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
   229 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
   195            of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
   230            of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
   196    The reason for their superlinear behaviour is that they do a depth-first-search
   231    The reason for their fast growth %superlinear behaviour 
       
   232    is that they do a depth-first-search
   197    using NFAs.
   233    using NFAs.
   198    If the string does not match, the regular expression matching
   234    If the string does not match, the regular expression matching
   199    engine starts to explore all possibilities. 
   235    engine starts to explore all possibilities. 
       
   236    The last two graphs are for our implementation in Scala, one manual
       
   237    and one extracted from the verified lexer in Isabelle by $\textit{codegen}$.
       
   238    Our lexer  
       
   239    performs better in this case, and
       
   240    is formally verified.
       
   241    Despite being almost identical, the codegen-generated lexer
       
   242    % is generated directly from Isabelle using $\textit{codegen}$,
       
   243    is slower than the manually written version since the code synthesised by
       
   244    $\textit{codegen}$ does not use native integer or string
       
   245    types of the target language.
       
   246    %Note that we are comparing our \emph{lexer} against other languages' matchers.
   200 }\label{fig:aStarStarb}
   247 }\label{fig:aStarStarb}
   201 \end{figure}\afterpage{\clearpage}
   248 \end{figure}\afterpage{\clearpage}
   202 
   249 
   203 A more recent example is a global outage of all Cloudflare servers on 2 July
   250 A more recent example is a global outage of all Cloudflare servers on 2 July
   204 2019. A poorly written regular expression exhibited catastrophic backtracking
   251 2019. A poorly written regular expression exhibited catastrophic backtracking
   234 can occur in traditional regular expression engines
   281 can occur in traditional regular expression engines
   235 and (ii) why we choose our 
   282 and (ii) why we choose our 
   236 approach based on Brzozowski derivatives and formal proofs.
   283 approach based on Brzozowski derivatives and formal proofs.
   237 
   284 
   238 
   285 
       
   286 
       
   287 
       
   288 In this chapter, we define the basic notions 
       
   289 for regular languages and regular expressions.
       
   290 This is essentially a description in ``English''
       
   291 the functions and datatypes of our formalisation in Isabelle/HOL.
       
   292 We also define what $\POSIX$ lexing means, 
       
   293 followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
       
   294 that produces the output conforming
       
   295 to the $\POSIX$ standard. 
       
   296 This is a preliminary chapter which describes the results of
       
   297 Sulzmann and Lu and Ausaf et al \ref{AusafDyckhoffUrban2016},
       
   298 but the proof details and function definitions are needed to motivate our work
       
   299 as we show in chapter \ref{Bitcoded2} how the proofs break down when
       
   300 simplifications are applied.
       
   301 %TODO: Actually show this in chapter 4.
       
   302 
       
   303 In what follows 
       
   304 we choose to use the Isabelle-style notation
       
   305 for function and datatype constructor applications, where
       
   306 the parameters of a function are not enclosed
       
   307 inside a pair of parentheses (e.g. $f \;x \;y$
       
   308 instead of $f(x,\;y)$). This is mainly
       
   309 to make the text visually more concise.
       
   310 
       
   311 
       
   312 
       
   313 
   239 \section{Preliminaries}%Regex, and the Problems with Regex Matchers}
   314 \section{Preliminaries}%Regex, and the Problems with Regex Matchers}
   240 Regular expressions and regular expression matchers 
   315 Regular expressions and regular expression matchers 
   241 have clearly been studied for many, many years.
   316 have clearly been studied for many, many years.
   242 Theoretical results in automata theory state 
   317 Theoretical results in automata theory state 
   243 that basic regular expression matching should be linear
   318 that basic regular expression matching should be linear
   337 automata need to expand $r^{\{n\}}$ into $n$ connected 
   412 automata need to expand $r^{\{n\}}$ into $n$ connected 
   338 copies of the automaton for $r$. This leads to very inefficient matching
   413 copies of the automaton for $r$. This leads to very inefficient matching
   339 algorithms  or algorithms that consume large amounts of memory.
   414 algorithms  or algorithms that consume large amounts of memory.
   340 Implementations using $\DFA$s will
   415 Implementations using $\DFA$s will
   341 in such situations
   416 in such situations
   342 either become excruciatingly slow 
   417 either become very slow 
   343 (for example Verbatim++ \cite{Verbatimpp}) or run
   418 (for example Verbatim++ \cite{Verbatimpp}) or run
   344 out of memory (for example $\mathit{LEX}$ and 
   419 out of memory (for example $\mathit{LEX}$ and 
   345 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
   420 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
   346 in C and JAVA that generate $\mathit{DFA}$-based
   421 in C and JAVA that generate $\mathit{DFA}$-based
   347 lexers. The user provides a set of regular expressions
   422 lexers. The user provides a set of regular expressions
   940 or identifiers). 
  1015 or identifiers). 
   941 Second, derivative-based matchers need to be more efficient in terms
  1016 Second, derivative-based matchers need to be more efficient in terms
   942 of the sizes of derivatives.
  1017 of the sizes of derivatives.
   943 Elegant and beautiful
  1018 Elegant and beautiful
   944 as many implementations are,
  1019 as many implementations are,
   945 they can be excruciatingly slow. 
  1020 they can be still quite slow. 
   946 For example, Sulzmann and Lu
  1021 For example, Sulzmann and Lu
   947 claim a linear running time of their proposed algorithm,
  1022 claim a linear running time of their proposed algorithm,
   948 but that was falsified by our experiments. The running time 
  1023 but that was falsified by our experiments. The running time 
   949 is actually $\Omega(2^n)$ in the worst case.
  1024 is actually $\Omega(2^n)$ in the worst case.
   950 A similar claim about a theoretical runtime of $O(n^2)$ 
  1025 A similar claim about a theoretical runtime of $O(n^2)$ 
  1528 $ using Brzozowski's original algorithm}\label{NaiveMatcher}
  1603 $ using Brzozowski's original algorithm}\label{NaiveMatcher}
  1529 \end{center}
  1604 \end{center}
  1530 \end{figure}
  1605 \end{figure}
  1531 \noindent
  1606 \noindent
  1532 If we implement the above algorithm naively, however,
  1607 If we implement the above algorithm naively, however,
  1533 the algorithm can be excruciatingly slow, as shown in 
  1608 the algorithm can be as slow as backtracking lexers, as shown in 
  1534 \ref{NaiveMatcher}.
  1609 \ref{NaiveMatcher}.
  1535 Note that both axes are in logarithmic scale.
  1610 Note that both axes are in logarithmic scale.
  1536 Around two dozen characters
  1611 Around two dozen characters
  1537 this algorithm already ``explodes'' with the regular expression 
  1612 this algorithm already ``explodes'' with the regular expression 
  1538 $(a^*)^*b$.
  1613 $(a^*)^*b$.
  1982   $\dn$  & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
  2057   $\dn$  & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
  1983   $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$  & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
  2058   $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$  & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
  1984   $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
  2059   $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
  1985 \end{tabular}
  2060 \end{tabular}
  1986 \end{center}
  2061 \end{center}
  1987 
       
  1988 \noindent 
  2062 \noindent 
  1989 The function recurses on 
  2063 The function recurses on 
  1990 the shape of regular
  2064 the shape of regular
  1991 expressions and values.
  2065 expressions and values.
  1992 Intuitively, each clause analyses 
  2066 Intuitively, each clause analyses 
  1996 to inject the character back into.
  2070 to inject the character back into.
  1997 Once the character is
  2071 Once the character is
  1998 injected back to that sub-value; 
  2072 injected back to that sub-value; 
  1999 $\inj$ assembles all parts
  2073 $\inj$ assembles all parts
  2000 to form a new value.
  2074 to form a new value.
  2001 
  2075 \subsection{An Example of how Injection Works}
       
  2076 We will provide a few examples on why
       
  2077 
       
  2078 \begin{center}
       
  2079 	\begin{tabular}{lcl}
       
  2080 		$\inj \; (a\cdot b)\cdot c \; \Seq \; \ONE \; b$ & $=$ & $(a+e)$\\
       
  2081 		$A$ & $B$ & $C$
       
  2082 	\end{tabular}
       
  2083 \end{center}
  2002 For instance, the last clause is an
  2084 For instance, the last clause is an
  2003 injection into a sequence value $v_{i+1}$
  2085 injection into a sequence value $v_{i+1}$
  2004 whose second child
  2086 whose second child
  2005 value is a star and the shape of the 
  2087 value is a star and the shape of the 
  2006 regular expression $r_i$ before injection 
  2088 regular expression $r_i$ before injection