lexing: comparison ChengsongTanPhdThesis/Chapters/Chapter2.tex

equal deleted inserted replaced

-:2c907b118f78
+:28751de4b4ba
 % Chapter Template
-\chapter{Regular Expressions and Sulzmann and Lu's Lexing Algorithm Without Bitcodes} % Main chapter title
+\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
 \label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts
 %and notations we
 %use for describing the lexing algorithm by Sulzmann and Lu,
 %and then give the algorithm and its variant, and discuss
 %why more aggressive simplifications are needed.
-\section{Preliminaries}
+\section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
+We have a built-in datatype char, made up of characters, which we do not define
+on top of anything else.
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\
+\end{tabular}
+\end{center}
+They can form strings by lists:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{string}$ & $\dn$ & $[] | c  :: cs$\\
+& & $(c\; \text{has char type})$
+\end{tabular}
+\end{center}
+And strings can be concatenated to form longer strings:
+\begin{center}
+\begin{tabular}{lcl}
+$s_1 @ s_2$ & $\rightarrow$ & $s'$\\
+\end{tabular}
+\end{center}
+A set of strings can operate with another set of strings:
+\begin{center}
+\begin{tabular}{lcl}
+$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
+\end{tabular}
+\end{center}
+We also call the above "language concatenation".
+The power of a language is defined recursively, using the
+concatenation operator $@$:
+\begin{center}
+\begin{tabular}{lcl}
+$A^0 $ & $\dn$ & $\{ [] \}$\\
+$A^{n+1}$ & $\dn$ & $A^n @ A$
+\end{tabular}
+\end{center}
+The infinite set of all the power of a language unioned together
+is defined using the power operator, also in recursive function:
+\begin{center}
+\begin{tabular}{lcl}
+$A^*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$\\
+\end{tabular}
+\end{center}
+We also define an operation of chopping off a character from all the strings
+in a set:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
+\end{tabular}
+\end{center}
+With the above definitions, it becomes natural to define regular expressions
+as a concise way for expressing the languages.
+\section{Regular Expressions and Their Language Interpretation}
 Suppose we have an alphabet $\Sigma$, the strings  whose characters
 are from $\Sigma$
 can be expressed as $\Sigma^*$.
 We use patterns to define a set of strings concisely. Regular expressions
 \begin{center}
 $\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
 \end{center}
 Mathematically, it can be expressed as the
-If the $\textit{StringSet}$ happen to have some structure, for example,
+If the $\textit{A}$ happen to have some structure, for example,
 if it is regular, then we have that it
 % Derivatives of a
 %regular expression, written $r \backslash c$, give a simple solution
 %to the problem of matching a string $s$ with a regular
 %expression $r$: if the derivative of $r$ w.r.t.\ (in
 %succession) all the characters of the string matches the empty string,
 %then $r$ matches $s$ (and {\em vice versa}).
+\section{Brzozowski Derivatives of Regular Expressions}
 The the derivative of regular expression, denoted as
 $r \backslash c$, is a function that takes parameters
 $r$ and $c$, and returns another regular expression $r'$,
 which is computed by the following recursive function:
 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
 \end{tabular}
 \end{center}
 \noindent
-\noindent
+The function derivative, written $r\backslash c$,
-The $\nullable$ function tests whether the empty string $""$
-is in the language of $r$:
-\begin{center}
-		\begin{tabular}{lcl}
-			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\
-			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
-			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
-			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
-			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
-			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
-		\end{tabular}
-\end{center}
-\noindent
-The empty set does not contain any string and
-therefore not the empty string, the empty string
-regular expression contains the empty string
-by definition, the character regular expression
-is the singleton that contains character only,
-and therefore does not contain the empty string,
-the alternative regular expression(or "or" expression)
-might have one of its children regular expressions
-being nullable and any one of its children being nullable
-would suffice. The sequence regular expression
-would require both children to have the empty string
-to compose an empty string and the Kleene star
-operation naturally introduced the empty string.
-We can give the meaning of regular expressions derivatives
-by language interpretation:
-\begin{center}
-\begin{tabular}{lcl}
-		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
-		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
-		$d \backslash c$     & $\dn$ &
-		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
-$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
-$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
-	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
-	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
-	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
-\end{tabular}
-\end{center}
-\noindent
-\noindent
-The function derivative, written $\backslash c$,
 defines how a regular expression evolves into
 a new regular expression after all the string it contains
 is chopped off a certain head character $c$.
 The most involved cases are the sequence
 and star case.
 regular expression and attaches the star regular expression
 to the sequence's second element to make sure a copy is retained
 for possible more iterations in later phases of lexing.
+The $\nullable$ function tests whether the empty string $""$
+is in the language of $r$:
+\begin{center}
+		\begin{tabular}{lcl}
+			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\
+			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
+			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
+			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
+			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
+			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
+		\end{tabular}
+\end{center}
+\noindent
+The empty set does not contain any string and
+therefore not the empty string, the empty string
+regular expression contains the empty string
+by definition, the character regular expression
+is the singleton that contains character only,
+and therefore does not contain the empty string,
+the alternative regular expression (or "or" expression)
+might have one of its children regular expressions
+being nullable and any one of its children being nullable
+would suffice. The sequence regular expression
+would require both children to have the empty string
+to compose an empty string and the Kleene star
+operation naturally introduced the empty string.
+We have the following property where the derivative on regular
+expressions coincides with the derivative on a set of strings:
+\begin{lemma}
+$\textit{Der} \; c \; L(r) = L (r\backslash c)$
+\end{lemma}
+\noindent
 The main property of the derivative operation
 that enables us to reason about the correctness of
 an algorithm using derivatives is
 \begin{center}
 \end{center}
 \noindent
 and then define Brzozowski's  regular-expression matching algorithm as:
-\[
+\begin{definition}
-match\;s\;r \;\dn\; nullable(r\backslash s)
+$match\;s\;r \;\dn\; nullable(r\backslash s)$
-\]
+\end{definition}
 \noindent
 Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$,
 this algorithm presented graphically is as follows:
 w.r.t.~the character $a$, one obtains a derivative regular expression
 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
 The reason why $(a + aa) ^*$ explodes so drastically is that without
 pruning, the algorithm will keep records of all possible ways of matching:
 \begin{center}
-$(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
+$(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
 \end{center}
 \noindent
 Each of the above alternative branches correspond to the match
 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
 More simplifications are possible, such as deleting duplicates
 and opening up nested alternatives to trigger even more simplifications.
 And suppose we apply simplification after each derivative step, and compose
 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
 \textit{simp}(a \backslash c)$. Then we can build
-a matcher without having  cumbersome regular expressions.
+a matcher with simpler regular expressions.
 If we want the size of derivatives in the algorithm to
 stay even lower, we would need more aggressive simplifications.
 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
 deleting duplicates whenever possible. For example, the parentheses in
 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
-$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
+$a^*+a+\ONE$.  These more aggressive simplification rules are for
-to achieve a very tight size bound, namely,
+a very tight size bound, possibly as low
-the same size bound as that of the \emph{partial derivatives}.
+as that of the \emph{partial derivatives}\parencite{Antimirov1995}.
 Building derivatives and then simplify them.
 So far so good. But what if we want to
-do lexing instead of just a YES/NO answer?
+do lexing instead of just  getting a YES/NO answer?
 This requires us to go back again to the world
 without simplification first for a moment.
 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
 elegant(arguably as beautiful as the original
 derivatives definition) solution for this.
 $\ldots$, and vice versa.
 Even for the regular expression matching a certain string, there could
 still be more than one value corresponding to it.
 Take the example where $r= (a^*\cdot a^*)^*$ and the string
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-The number of different ways of matching
+If we do not allow any empty iterations in its lexical values,
-without allowing any value under a star to be flattened
+there will be $n - 1$ "splitting points" on $s$ we can choose to
-to an empty string can be given by the following formula:
+split or not so that each sub-string
-\begin{equation}
+segmented by those chosen splitting points will form different iterations:
-	C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}
+\begin{center}
-\end{equation}
+\begin{tabular}{lcr}
-and a closed form formula can be calculated to be
+$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$\\
-\begin{equation}
+$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$\\
-	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
+$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$\\
-\end{equation}
+& $\textit{etc}.$ &
-which is clearly in exponential order.
+\end{tabular}
+\end{center}
+And for each iteration, there are still multiple ways to split
+between the two $a^*$s.
+It is not surprising there are exponentially many lexical values
+that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
 A lexer aimed at getting all the possible values has an exponential
 worst case runtime. Therefore it is impractical to try to generate
 all possible matches in a run. In practice, we are usually
 interested about POSIX values, which by intuition always
 $\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$
 \end{tabular}
 \end{center}
-%----------------------------------------------------------------------------------------
-%	SECTION  correctness proof
-%----------------------------------------------------------------------------------------
-\section{Correctness of Bit-coded Algorithm (Without Simplification)}
-We now give the proof the correctness of the algorithm with bit-codes.
-Ausaf and Urban cleverly defined an auxiliary function called $\flex$,
-defined as
-\[
-\flex \; r \; f \; [] \; v \; = \; f\; v
-\flex \; r \; f \; c :: s \; v =  \flex r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v
-\]
-which accumulates the characters that needs to be injected back,
-and does the injection in a stack-like manner (last taken derivative first injected).
-$\flex$ is connected to the $\lexer$:
-\begin{lemma}
-$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
-\end{lemma}
-$\flex$ provides us a bridge between $\lexer$ and $\blexer$.
-What is even better about $\flex$ is that it allows us to
-directly operate on the value $\mkeps (r\backslash v)$,
-which is pivotal in the definition of  $\lexer $ and $\blexer$, but not visible as an argument.
-When the value created by $\mkeps$ becomes available, one can
-prove some stepwise properties of lexing nicely:
-\begin{lemma}\label{flexStepwise}
-$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
-\end{lemma}
-And for $\blexer$ we have a function with stepwise properties like $\flex$ as well,
-called $\retrieve$\ref{retrieveDef}.
-$\retrieve$ takes bit-codes from annotated regular expressions
-guided by a value.
-$\retrieve$ is connected to the $\blexer$ in the following way:
-\begin{lemma}\label{blexer_retrieve}
-$\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
-\end{lemma}
-If you take derivative of an annotated regular expression,
-you can $\retrieve$ the same bit-codes as before the derivative took place,
-provided that you use the corresponding value:
-\begin{lemma}\label{retrieveStepwise}
-$\retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
-\end{lemma}
-The other good thing about $\retrieve$ is that it can be connected to $\flex$:
-%centralLemma1
-\begin{lemma}\label{flex_retrieve}
-$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
-\end{lemma}
-\begin{proof}
-By induction on $s$. The induction tactic is reverse induction on strings.
-$v$ is allowed to be arbitrary.
-The crucial point is to rewrite
-\[
-\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c])
-\]
-as
-\[
-\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\;  \mkeps (r \backslash s@[c]))
-\].
-This enables us to equate
-\[
-\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c])
-\]
-with
-\[
-\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
-\],
-which in turn can be rewritten as
-\[
-\flex \; r \; \textit{id} \; s@[c] \;  (\mkeps (r\backslash s@[c]))
-\].
-\end{proof}
-With the above lemma we can now link $\flex$ and $\blexer$.
-\begin{lemma}\label{flex_blexer}
-$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$
-\end{lemma}
-\begin{proof}
-Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
-\end{proof}
-Finally

changeset 528	28751de4b4ba
parent 526	cb702fb4227f
child 529	96e93df60954