lexing: comparison PhdThesisRealOne/LaTeXTemplates_masters-doctoral-thesis

equal deleted inserted replaced

-:e6248d2c20c2
+:2e7c7111c0be
 \def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
 \def\map{\mathit{map}}
 \def\distinct{\mathit{distinct}}
 \def\blexersimp{\mathit{blexer}\_\mathit{simp}}
 %----------------------------------------------------------------------------------------
+%This part is about regular expressions, Brzozowski derivatives,
+%and a bit-coded lexing algorithm with proven correctness and time bounds.
+Regular expressions are widely used in modern day programming tasks.
+Be it IDE with syntax hightlighting and auto completion,
-Regular expression matching and lexing has been
+command line tools like $\mathit{grep}$ that facilitates easy
-widely-used and well-implemented
+processing of text by search and replace, network intrusion
-in software industry.
+detection systems that rejects suspicious traffic, or compiler
+front-ends--there is always an important phase of the
+task that involves matching a regular
+exression with a string.
+Given its usefulness and ubiquity, one would imagine that
+modern regular expression matching implementations
+are mature and fully-studied.
 If you go to a popular programming language's
 regex engine,
 you can supply it with regex and strings of your own,
 and get matching/lexing  information such as how a
 sub-part of the regex matches a sub-part of the string.
 The opens up the possibility of
 a ReDoS (regular expression denial-of-service ) attack.
+\section{Why Backtracking Algorithm for Regexes?}
 Theoretical results say that regular expression matching
 should be linear with respect to the input. You could construct
 an NFA via Thompson construction, and simulate  running it.
 This would be linear.
 2019. A poorly written regular expression exhibited exponential
 behaviour and exhausted CPUs that serve HTTP traffic. Although the
 outage had several causes, at the heart was a regular expression that
 was used to monitor network
 %traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
+It turns out that regex libraries not only suffer from
+exponential backtracking problems,
+but also undesired (or even buggy) outputs.
+%TODO: comment from who
+xxx commented that most regex libraries are not
+correctly implementing the POSIX (maximum-munch)
+rule of regular expression matching.
+A concrete example would be
+the regex
+\begin{verbatim}
+(((((a*a*)b*)b){20})*)c
+\end{verbatim}
+and the string
+\begin{verbatim}
+baabaabababaabaaaaaaaaababaa
+aababababaaaabaaabaaaaaabaab
+aabababaababaaaaaaaaababaaaa
+babababaaaaaaaaaaaaac
+\end{verbatim}
+This seemingly complex regex simply says "some $a$'s
+followed by some $b$'s then followed by 1 single $b$,
+and this iterates 20 times, finally followed by a $c$.
+And a POSIX match would involve the entire string,"eating up"
+all the $b$'s in it.
+%TODO: give a coloured example of how this matches POSIXly
+This regex would trigger catastrophic backtracking in
+languages like Python and Java,
+whereas it gives a correct but uninformative (non-POSIX)
+match in languages like Go or .NET--The match with only
+character $c$.
+Backtracking or depth-first search might give us
+exponential running time, and quite a few tools
+avoid that by determinising the $\mathit{NFA}$
+into a $\mathit{DFA}$ and minimizes it.
+For example, $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
+in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
+lexers.
+However, they do not scale well with bounded repetitions.
+Bounded repetitions, usually written in the form
+$r^{\{c\}}$ (where $c$ is a constant natural number),
+denotes a regular expression accepting strings
+that can be divided into $c$ substrings, and each
+substring is in $r$.
+%TODO: draw example NFA.
+For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
+an $\mathit{NFA}$ describing it would look like:
+\begin{center}
+\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+\node[state,initial] (q_0)   {$q_0$};
+\node[state, red] (q_1) [right=of q_0] {$q_1$};
+\node[state, red] (q_2) [right=of q_1] {$q_2$};
+\node[state,accepting](q_3) [right=of q_2] {$q_3$};
+\path[->]
+(q_0) edge  node {a} (q_1)
+	  edge [loop below] node {a,b} ()
+(q_1) edge  node  {a,b} (q_2)
+edge [loop above] node {0} ()
+(q_2) edge  node  {a,b} (q_3);
+\end{tikzpicture}
+\end{center}
+The red states are "counter states" which counts down
+the number of characters needed in addition to the current
+string to make a successful match.
+For example, state $q_1$ indicates a match that has
+gone past the $(a|b)$ part of $(a|b)^*a(a|b)^{\{2\}}$,
+and just consumed the "delimiter" $a$ in the middle, and
+need to match 2 more iterations of $a|b$ to complete.
+State $q_2$ on the other hand, can be viewed as a state
+after $q_1$ has consumed 1 character, and just waits
+for 1 more character to complete.
+Depending on the actual characters appearing in the
+input string, the states $q_1$ and $q_2$ may or may
+not be active, independent from each other.
+A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
+contain at least 4 non-equivalent states that cannot be merged,
+because subset states indicating which of $q_1$ and $q_2$
+are active are at least four: $\phi$, $\{q_1\}$,
+$\{q_2\}$, $\{q_1, q_2\}$.
+Generalizing this to regular expressions with larger
+bounded repetitions number, we have $r^*ar^{\{n\}}$
+in general would require at least $2^n$ states
+in a $\mathit{DFA}$. This is to represent all different
+configurations of "countdown" states.
+For those regexes, tools such as $\mathit{JFLEX}$
+would generate gigantic $\mathit{DFA}$'s or even
+give out memory errors.
+For this reason, regex libraries that support
+bounded repetitions often choose to use the $\mathit{NFA}$
+approach.
+One can simulate the $\mathit{NFA}$ running in two ways:
+one by keeping track of all active states after consuming
+a character, and update that set of states iteratively.
+This is a breadth-first-search of the $\mathit{NFA}$.
+for a path terminating
+at an accepting state.
+Languages like $\mathit{GO}$ and $\mathit{RUST}$ use this
+type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
+in terms of input string length.
+The other way to use $\mathit{NFA}$ for matching is to take
+a single state in a path each time, and backtrack if that path
+fails. This is a depth-first-search that could end up
+with exponential run time.
+The reason behind backtracking algorithms in languages like
+Java and Python is that they support back-references.
+\subsection{Back References in Regex--Non-Regular part}
+If we label sub-expressions by parenthesizing them and give
+them a number by the order their opening parenthesis appear,
+$\underset{1}{(}\ldots\underset{2}{(}\ldots\underset{3}{(}\ldots\underset{4}{(}\ldots)\ldots)\ldots)\ldots)$
+We can use the following syntax to denote that we want a string just matched by a
+sub-expression to appear at a certain location again exactly:
+$(.*)\backslash 1$
+would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.
+Back-reference is a construct in the "regex" standard
+that programmers found quite useful, but not exactly
+regular any more.
+In fact, that allows the regex construct to express
+languages that cannot be contained in context-free
+languages
+For example, the back reference $(a^*)\backslash 1 \backslash 1$
+expresses the language $\{a^na^na^n\mid n \in N\}$,
+which cannot be expressed by context-free grammars.
+To express such a language one would need context-sensitive
+grammars, and matching/parsing for such grammars is NP-hard in
+general.
+%TODO:cite reference for NP-hardness of CSG matching
+For such constructs, the most intuitive way of matching is
+using backtracking algorithms, and there are no known algorithms
+that does not backtrack.
+%TODO:read a bit more about back reference algorithms
+\section{Our Solution--Brzozowski Derivatives}
 Is it possible to have a regex lexing algorithm with proven correctness and
 time complexity, which allows easy extensions to
 constructs like
 bounded repetitions, negation,  lookarounds, and even back-references?
 We propose Brzozowski's derivatives as a solution to this problem.
+The main contribution of this thesis is a proven correct lexing algorithm
-\section{Why Brzozowski}
+with formalized time bounds.
+To our best knowledge, there is no lexing libraries using Brzozowski derivatives
+that have a provable time guarantee,
+and claims about running time are usually speculative and backed by thin empirical
+evidence.
+%TODO: give references
+For example, Sulzmann and Lu had proposed an algorithm  in which they
+claim a linear running time.
+But that was falsified by our experiments and the running time
+is actually $\Omega(2^n)$ in the worst case.
+A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
+%TODO: give references
+lexer, which calculates POSIX matches and is based on derivatives.
+They formalized the correctness of the lexer, but not the complexity.
+In the performance evaluation section, they simply analyzed the run time
+of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
+and concluded that the algorithm is quadratic in terms of input length.
+When we tried out their extracted OCaml code with our example $(a+aa)^*$,
+the time it took to lex only 40 $a$'s was 5 minutes.
+We therefore believe our results of a proof of performance on general
+inputs rather than specific examples a novel contribution.\\
+\section{Preliminaries about Lexing Using Brzozowski derivatives}
 In the last fifteen or so years, Brzozowski's derivatives of regular
 expressions have sparked quite a bit of interest in the functional
 programming and theorem prover communities.
 The beauty of
 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
 We can give the meaning of regular expressions derivatives
 by language interpretation:
+\begin{center}
+\begin{tabular}{lcl}
+		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
+		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
+		$d \backslash c$     & $\dn$ &
+		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
+	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
+	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
+	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
+\end{tabular}
+\end{center}
+\noindent
+\noindent
+The function derivative, written $\backslash c$,
+defines how a regular expression evolves into
+a new regular expression after all the string it contains
+is chopped off a certain head character $c$.
+The most involved cases are the sequence
+and star case.
+The sequence case says that if the first regular expression
+contains an empty string then second component of the sequence
+might be chosen as the target regular expression to be chopped
+off its head character.
+The star regular expression unwraps the iteration of
+regular expression and attack the star regular expression
+to its back again to make sure there are 0 or more iterations
+following this unfolded iteration.
+The main property of the derivative operation
+that enables us to reason about the correctness of
+an algorithm using derivatives is
+\begin{center}
+$c\!::\!s \in L(r)$ holds
+if and only if $s \in L(r\backslash c)$.
+\end{center}
+\noindent
+We can generalise the derivative operation shown above for single characters
+to strings as follows:
+\begin{center}
+\begin{tabular}{lcl}
+$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
+$r \backslash [\,] $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+\noindent
+and then define Brzozowski's  regular-expression matching algorithm as:
+\[
+match\;s\;r \;\dn\; nullable(r\backslash s)
+\]
+\noindent
+Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$,
+this algorithm presented graphically is as follows:
+\begin{equation}\label{graph:*}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
+\end{tikzcd}
+\end{equation}
+\noindent
+where we start with  a regular expression  $r_0$, build successive
+derivatives until we exhaust the string and then use \textit{nullable}
+to test whether the result can match the empty string. It can  be
+relatively  easily shown that this matcher is correct  (that is given
+an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
+Beautiful and simple definition.
+If we implement the above algorithm naively, however,
+the algorithm can be excruciatingly slow. For example, when starting with the regular
+expression $(a + aa)^*$ and building 12 successive derivatives
+w.r.t.~the character $a$, one obtains a derivative regular expression
+with more than 8000 nodes (when viewed as a tree). Operations like
+$\backslash$ and $\nullable$ need to traverse such trees and
+consequently the bigger the size of the derivative the slower the
+algorithm.
+Brzozowski was quick in finding that during this process a lot useless
+$\ONE$s and $\ZERO$s are generated and therefore not optimal.
+He also introduced some "similarity rules" such
+as $P+(Q+R) = (P+Q)+R$ to merge syntactically
+different but language-equivalent sub-regexes to further decrease the size
+of the intermediate regexes.
+More simplifications are possible, such as deleting duplicates
+and opening up nested alternatives to trigger even more simplifications.
+And suppose we apply simplification after each derivative step, and compose
+these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
+\textit{simp}(a \backslash c)$. Then we can build
+a matcher without having  cumbersome regular expressions.
+If we want the size of derivatives in the algorithm to
+stay even lower, we would need more aggressive simplifications.
+Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
+deleting duplicates whenever possible. For example, the parentheses in
+$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
+\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
+example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
+$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
+to achieve a very tight size bound, namely,
+the same size bound as that of the \emph{partial derivatives}.
+Building derivatives and then simplify them.
+So far so good. But what if we want to
+do lexing instead of just a YES/NO answer?
+This requires us to go back again to the world
+without simplification first for a moment.
+Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
+elegant(arguably as beautiful as the original
+derivatives definition) solution for this.
+\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
+They first defined the datatypes for storing the
+lexing information called a \emph{value} or
+sometimes also \emph{lexical value}.  These values and regular
+expressions correspond to each other as illustrated in the following
+table:
+\begin{center}
+	\begin{tabular}{c@{\hspace{20mm}}c}
+		\begin{tabular}{@{}rrl@{}}
+			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
+			$r$ & $::=$  & $\ZERO$\\
+			& $\mid$ & $\ONE$   \\
+			& $\mid$ & $c$          \\
+			& $\mid$ & $r_1 \cdot r_2$\\
+			& $\mid$ & $r_1 + r_2$   \\
+			\\
+			& $\mid$ & $r^*$         \\
+		\end{tabular}
+		&
+		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
+			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
+			$v$ & $::=$  & \\
+			&        & $\Empty$   \\
+			& $\mid$ & $\Char(c)$          \\
+			& $\mid$ & $\Seq\,v_1\, v_2$\\
+			& $\mid$ & $\Left(v)$   \\
+			& $\mid$ & $\Right(v)$  \\
+			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
+		\end{tabular}
+	\end{tabular}
+\end{center}
+\noindent
+One regular expression can have multiple lexical values. For example
+for the regular expression $(a+b)^*$, it has a infinite list of
+values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
+$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
+$\ldots$, and vice versa.
+Even for the regular expression matching a certain string, there could
+still be more than one value corresponding to it.
+Take the example where $r= (a^*\cdot a^*)^*$ and the string
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
+The number of different ways of matching
+without allowing any value under a star to be flattened
+to an empty string can be given by the following formula:
+\begin{center}
+	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
+\end{center}
+and a closed form formula can be calculated to be
+\begin{equation}
+	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
+\end{equation}
+which is clearly in exponential order.
+A lexer aimed at getting all the possible values has an exponential
+worst case runtime. Therefore it is impractical to try to generate
+all possible matches in a run. In practice, we are usually
+interested about POSIX values, which by intuition always
+match the leftmost regular expression when there is a choice
+and always match a sub part as much as possible before proceeding
+to the next token. For example, the above example has the POSIX value
+$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
+The output of an algorithm we want would be a POSIX matching
+encoded as a value.
+The contribution of Sulzmann and Lu is an extension of Brzozowski's
+algorithm by a second phase (the first phase being building successive
+derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
+is generated in case the regular expression matches  the string.
+Pictorially, the Sulzmann and Lu algorithm is as follows:
+\begin{ceqn}
+\begin{equation}\label{graph:2}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
+v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]
+\end{tikzcd}
+\end{equation}
+\end{ceqn}
+\noindent
+For convenience, we shall employ the following notations: the regular
+expression we start with is $r_0$, and the given string $s$ is composed
+of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
+left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
+to the characters $c_0$, $c_1$  until we exhaust the string and obtain
+the derivative $r_n$. We test whether this derivative is
+$\textit{nullable}$ or not. If not, we know the string does not match
+$r$ and no value needs to be generated. If yes, we start building the
+values incrementally by \emph{injecting} back the characters into the
+earlier values $v_n, \ldots, v_0$. This is the second phase of the
+algorithm from the right to left. For the first value $v_n$, we call the
+function $\textit{mkeps}$, which builds a POSIX lexical value
+for how the empty string has been matched by the (nullable) regular
+expression $r_n$. This function is defined as
+	\begin{center}
+		\begin{tabular}{lcl}
+			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
+			$\mkeps(r_{1}+r_{2})$	& $\dn$
+			& \textit{if} $\nullable(r_{1})$\\
+			& & \textit{then} $\Left(\mkeps(r_{1}))$\\
+			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
+			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
+			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
+		\end{tabular}
+	\end{center}
+\noindent
+After the $\mkeps$-call, we inject back the characters one by one in order to build
+the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
+($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
+After injecting back $n$ characters, we get the lexical value for how $r_0$
+matches $s$. The POSIX value is maintained throught out the process.
+For this Sulzmann and Lu defined a function that reverses
+the ``chopping off'' of characters during the derivative phase. The
+corresponding function is called \emph{injection}, written
+$\textit{inj}$; it takes three arguments: the first one is a regular
+expression ${r_{i-1}}$, before the character is chopped off, the second
+is a character ${c_{i-1}}$, the character we want to inject and the
+third argument is the value ${v_i}$, into which one wants to inject the
+character (it corresponds to the regular expression after the character
+has been chopped off). The result of this function is a new value. The
+definition of $\textit{inj}$ is as follows:
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+$\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
+$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
+$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
+$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+\end{tabular}
+\end{center}
+\noindent This definition is by recursion on the ``shape'' of regular
+expressions and values.
+The clauses basically do one thing--identifying the ``holes'' on
+value to inject the character back into.
+For instance, in the last clause for injecting back to a value
+that would turn into a new star value that corresponds to a star,
+we know it must be a sequence value. And we know that the first
+value of that sequence corresponds to the child regex of the star
+with the first character being chopped off--an iteration of the star
+that had just been unfolded. This value is followed by the already
+matched star iterations we collected before. So we inject the character
+back to the first value and form a new value with this new iteration
+being added to the previous list of iterations, all under the $Stars$
+top level.
+We have mentioned before that derivatives without simplification
+can get clumsy, and this is true for values as well--they reflect
+the regular expressions size by definition.
+One can introduce simplification on the regex and values, but have to
+be careful in not breaking the correctness as the injection
+function heavily relies on the structure of the regexes and values
+being correct and match each other.
+It can be achieved by recording some extra rectification functions
+during the derivatives step, and applying these rectifications in
+each run during the injection phase.
+And we can prove that the POSIX value of how
+regular expressions match strings will not be affected---although is much harder
+to establish. Some initial results in this regard have been
+obtained in \cite{AusafDyckhoffUrban2016}.
+%Brzozowski, after giving the derivatives and simplification,
+%did not explore lexing with simplification or he may well be
+%stuck on an efficient simplificaiton with a proof.
+%He went on to explore the use of derivatives together with
+%automaton, and did not try lexing using derivatives.
+We want to get rid of complex and fragile rectification of values.
+Can we not create those intermediate values $v_1,\ldots v_n$,
+and get the lexing information that should be already there while
+doing derivatives in one pass, without a second phase of injection?
+In the meantime, can we make sure that simplifications
+are easily handled without breaking the correctness of the algorithm?
+Sulzmann and Lu solved this problem by
+introducing additional informtaion to the
+regular expressions called \emph{bitcodes}.
+\subsection*{Bit-coded Algorithm}
+Bits and bitcodes (lists of bits) are defined as:
+\begin{center}
+		$b ::=   1 \mid  0 \qquad
+bs ::= [] \mid b::bs
+$
+\end{center}
+\noindent
+The $1$ and $0$ are not in bold in order to avoid
+confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
+bit-lists) can be used to encode values (or potentially incomplete values) in a
+compact form. This can be straightforwardly seen in the following
+coding function from values to bitcodes:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
+$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
+$\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
+$\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
+$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
+$\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
+$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
+code(\Stars\,vs)$
+\end{tabular}
+\end{center}
+\noindent
+Here $\textit{code}$ encodes a value into a bitcodes by converting
+$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
+star iteration by $1$. The border where a local star terminates
+is marked by $0$. This coding is lossy, as it throws away the information about
+characters, and also does not encode the ``boundary'' between two
+sequence values. Moreover, with only the bitcode we cannot even tell
+whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
+reason for choosing this compact way of storing information is that the
+relatively small size of bits can be easily manipulated and ``moved
+around'' in a regular expression. In order to recover values, we will
+need the corresponding regular expression as an extra information. This
+means the decoding function is defined as:
+%\begin{definition}[Bitdecoding of Values]\mbox{}
+\begin{center}
+\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
+$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
+$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
+$\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
+(\Left\,v, bs_1)$\\
+$\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
+(\Right\,v, bs_1)$\\
+$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
+$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
+& &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
+& &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
+$\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
+$\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ &
+$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
+& &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
+& &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
+$\textit{decode}\,bs\,r$ & $\dn$ &
+$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
+& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
+\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+%\end{definition}
+Sulzmann and Lu's integrated the bitcodes into regular expressions to
+create annotated regular expressions \cite{Sulzmann2014}.
+\emph{Annotated regular expressions} are defined by the following
+grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{a}$ & $::=$  & $\ZERO$\\
+& $\mid$ & $_{bs}\ONE$\\
+& $\mid$ & $_{bs}{\bf c}$\\
+& $\mid$ & $_{bs}\sum\,as$\\
+& $\mid$ & $_{bs}a_1\cdot a_2$\\
+& $\mid$ & $_{bs}a^*$
+\end{tabular}
+\end{center}
+%(in \textit{ALTS})
+\noindent
+where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
+expressions and $as$ for a list of annotated regular expressions.
+The alternative constructor($\sum$) has been generalized to
+accept a list of annotated regular expressions rather than just 2.
+We will show that these bitcodes encode information about
+the (POSIX) value that should be generated by the Sulzmann and Lu
+algorithm.
+To do lexing using annotated regular expressions, we shall first
+transform the usual (un-annotated) regular expressions into annotated
+regular expressions. This operation is called \emph{internalisation} and
+defined as follows:
+%\begin{definition}
+\begin{center}
+\begin{tabular}{lcl}
+$(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
+$(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
+$(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
+$(r_1 + r_2)^\uparrow$ & $\dn$ &
+$_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
+\textit{fuse}\,[1]\,r_2^\uparrow]$\\
+$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
+$_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
+$(r^*)^\uparrow$ & $\dn$ &
+$_{[]}(r^\uparrow)^*$\\
+\end{tabular}
+\end{center}
+%\end{definition}
+\noindent
+We use up arrows here to indicate that the basic un-annotated regular
+expressions are ``lifted up'' into something slightly more complex. In the
+fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
+attach bits to the front of an annotated regular expression. Its
+definition is as follows:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
+$\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
+$_{bs @ bs'}\ONE$\\
+$\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
+$_{bs@bs'}{\bf c}$\\
+$\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
+$_{bs@bs'}\sum\textit{as}$\\
+$\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
+$_{bs@bs'}a_1 \cdot a_2$\\
+$\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
+$_{bs @ bs'}a^*$
+\end{tabular}
+\end{center}
+\noindent
+After internalising the regular expression, we perform successive
+derivative operations on the annotated regular expressions. This
+derivative operation is the same as what we had previously for the
+basic regular expressions, except that we beed to take care of
+the bitcodes:
+\iffalse
+%\begin{definition}{bder}
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+$(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+$(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
+$\textit{if}\;c=d\; \;\textit{then}\;
+\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
+$(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
+$\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
+$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
+$\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
+					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
+& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
+$(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
+$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
+(\textit{STAR}\,[]\,r)$
+\end{tabular}
+\end{center}
+%\end{definition}
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+$(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+$(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
+$\textit{if}\;c=d\; \;\textit{then}\;
+_{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\
+$(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
+$_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
+$(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
+$\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
+					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
+& &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
+$(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
+$_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
+(_{bs}\textit{STAR}\,[]\,r)$
+\end{tabular}
+\end{center}
+%\end{definition}
+\fi
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
+$(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
+$(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
+$\textit{if}\;c=d\; \;\textit{then}\;
+_{bs}\ONE\;\textit{else}\;\ZERO$\\
+$(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
+$_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
+$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
+$\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
+					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
+& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
+$(_{bs}a^*)\,\backslash c$ & $\dn$ &
+$_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
+(_{[]}r^*))$
+\end{tabular}
+\end{center}
+%\end{definition}
+\noindent
+For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
+we need to unfold it into a sequence,
+and attach an additional bit $0$ to the front of $r \backslash c$
+to indicate that there is one more star iteration. Also the sequence clause
+is more subtle---when $a_1$ is $\textit{bnullable}$ (here
+\textit{bnullable} is exactly the same as $\textit{nullable}$, except
+that it is for annotated regular expressions, therefore we omit the
+definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
+$a_1$ matches the string prior to character $c$ (more on this later),
+then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
+\backslash c)$ will collapse the regular expression $a_1$(as it has
+already been fully matched) and store the parsing information at the
+head of the regular expression $a_2 \backslash c$ by fusing to it. The
+bitsequence $\textit{bs}$, which was initially attached to the
+first element of the sequence $a_1 \cdot a_2$, has
+now been elevated to the top-level of $\sum$, as this information will be
+needed whichever way the sequence is matched---no matter whether $c$ belongs
+to $a_1$ or $ a_2$. After building these derivatives and maintaining all
+the lexing information, we complete the lexing by collecting the
+bitcodes using a generalised version of the $\textit{mkeps}$ function
+for annotated regular expressions, called $\textit{bmkeps}$:
+%\begin{definition}[\textit{bmkeps}]\mbox{}
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
+$\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
+$\textit{if}\;\textit{bnullable}\,a$\\
+& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
+& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
+$\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
+$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
+$\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
+$bs \,@\, [0]$
+\end{tabular}
+\end{center}
+%\end{definition}
+\noindent
+This function completes the value information by travelling along the
+path of the regular expression that corresponds to a POSIX value and
+collecting all the bitcodes, and using $S$ to indicate the end of star
+iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
+decode them, we get the value we expect. The corresponding lexing
+algorithm looks as follows:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{blexer}\;r\,s$ & $\dn$ &
+$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
+& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+& & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+\noindent
+In this definition $\_\backslash s$ is the  generalisation  of the derivative
+operation from characters to strings (just like the derivatives for un-annotated
+regular expressions).
+Remember tha one of the important reasons we introduced bitcodes
+is that they can make simplification more structured and therefore guaranteeing
+the correctness.
+\subsection*{Our Simplification Rules}
+In this section we introduce aggressive (in terms of size) simplification rules
+on annotated regular expressions
+in order to keep derivatives small. Such simplifications are promising
+as we have
+generated test data that show
+that a good tight bound can be achieved. Obviously we could only
+partially cover  the search space as there are infinitely many regular
+expressions and strings.
+One modification we introduced is to allow a list of annotated regular
+expressions in the $\sum$ constructor. This allows us to not just
+delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
+also unnecessary ``copies'' of regular expressions (very similar to
+simplifying $r + r$ to just $r$, but in a more general setting). Another
+modification is that we use simplification rules inspired by Antimirov's
+work on partial derivatives. They maintain the idea that only the first
+``copy'' of a regular expression in an alternative contributes to the
+calculation of a POSIX value. All subsequent copies can be pruned away from
+the regular expression. A recursive definition of our  simplification function
+that looks somewhat similar to our Scala code is given below:
+%\comment{Use $\ZERO$, $\ONE$ and so on.
+%Is it $ALTS$ or $ALTS$?}\\
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
+&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
+&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
+&&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
+&&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
+&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
+$\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
+&&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
+&&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
+&&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\
+$\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
+\end{tabular}
+\end{center}
+\noindent
+The simplification does a pattern matching on the regular expression.
+When it detected that the regular expression is an alternative or
+sequence, it will try to simplify its children regular expressions
+recursively and then see if one of the children turn into $\ZERO$ or
+$\ONE$, which might trigger further simplification at the current level.
+The most involved part is the $\sum$ clause, where we use two
+auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
+alternatives and reduce as many duplicates as possible. Function
+$\textit{distinct}$  keeps the first occurring copy only and remove all later ones
+when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
+Its recursive definition is given below:
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
+(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
+$\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
+$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise)
+\end{tabular}
+\end{center}
+\noindent
+Here $\textit{flatten}$ behaves like the traditional functional programming flatten
+function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
+removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
+Having defined the $\simp$ function,
+we can use the previous notation of  natural
+extension from derivative w.r.t.~character to derivative
+w.r.t.~string:%\comment{simp in  the [] case?}
+\begin{center}
+\begin{tabular}{lcl}
+$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
+$r \backslash_{simp} [\,] $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+\noindent
+to obtain an optimised version of the algorithm:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
+$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
+& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+& & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+\noindent
+This algorithm keeps the regular expression size small, for example,
+with this simplification our previous $(a + aa)^*$ example's 8000 nodes
+will be reduced to just 6 and stays constant, no matter how long the
+input string is.
 Derivatives give a simple solution
 to the problem of matching a string $s$ with a regular
 expression $r$: if the derivative of $r$ w.r.t.\ (in
 succession) all the characters of the string matches the empty string,
 allowing a ReDoS (regular expression denial-of-service ) attack.
 %----------------------------------------------------------------------------------------
-\section{Existing Practical Approaches}
+\section{Engineering and Academic Approaches to Deal with Catastrophic Backtracking}
 The reason behind is that regex libraries in popular language engines
 often want to support richer constructs
 than  the most basic regular expressions, and lexing rather than matching
 is needed for sub-match extraction.
+There is also static analysis work on regular expression that
+have potential expoential behavious. Rathnayake and Thielecke
+\parencite{Rathnayake2014StaticAF} proposed an algorithm
+that detects regular expressions triggering exponential
+behavious on backtracking matchers.
+People also developed static analysis methods for
+generating non-linear polynomial worst-time estimates
+for regexes, attack string that exploit the worst-time
+scenario, and "attack automata" that generates
+attack strings.
+For a comprehensive analysis, please refer to Weideman's thesis
+\parencite{Weideman2017Static}.
 \subsection{DFA Approach}
 Exponential states.

changeset 465	2e7c7111c0be
parent 456	26a5e640cdd7