ChengsongTanPhdThesis/Chapters/Chapter2.tex
changeset 528 28751de4b4ba
parent 526 cb702fb4227f
child 529 96e93df60954
--- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Sat May 28 17:17:18 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Mon May 30 14:41:09 2022 +0100
@@ -1,6 +1,6 @@
 % Chapter Template
 
-\chapter{Regular Expressions and Sulzmann and Lu's Lexing Algorithm Without Bitcodes} % Main chapter title
+\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
 
 \label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts
 %and notations we 
@@ -9,8 +9,60 @@
 %why more aggressive simplifications are needed. 
 
 
-\section{Preliminaries}
+\section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
+We have a built-in datatype char, made up of characters, which we do not define
+on top of anything else.
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\
+\end{tabular}
+\end{center}
+They can form strings by lists:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{string}$ & $\dn$ & $[] | c  :: cs$\\
+& & $(c\; \text{has char type})$
+\end{tabular}
+\end{center}
+And strings can be concatenated to form longer strings:
+\begin{center}
+\begin{tabular}{lcl}
+$s_1 @ s_2$ & $\rightarrow$ & $s'$\\
+\end{tabular}
+\end{center}
 
+A set of strings can operate with another set of strings:
+\begin{center}
+\begin{tabular}{lcl}
+$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
+\end{tabular}
+\end{center}
+We also call the above "language concatenation".
+The power of a language is defined recursively, using the 
+concatenation operator $@$:
+\begin{center}
+\begin{tabular}{lcl}
+$A^0 $ & $\dn$ & $\{ [] \}$\\
+$A^{n+1}$ & $\dn$ & $A^n @ A$
+\end{tabular}
+\end{center}
+The infinite set of all the power of a language unioned together 
+is defined using the power operator, also in recursive function:
+\begin{center}
+\begin{tabular}{lcl}
+$A^*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$\\
+\end{tabular}
+\end{center}
+We also define an operation of chopping off a character from all the strings
+in a set:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
+\end{tabular}
+\end{center}
+With the above definitions, it becomes natural to define regular expressions
+as a concise way for expressing the languages.
+\section{Regular Expressions and Their Language Interpretation}
 Suppose we have an alphabet $\Sigma$, the strings  whose characters
 are from $\Sigma$
 can be expressed as $\Sigma^*$.
@@ -49,7 +101,7 @@
 \end{center}
 Mathematically, it can be expressed as the 
 
-If the $\textit{StringSet}$ happen to have some structure, for example,
+If the $\textit{A}$ happen to have some structure, for example,
 if it is regular, then we have that it
 
 % Derivatives of a
@@ -59,6 +111,8 @@
 %succession) all the characters of the string matches the empty string,
 %then $r$ matches $s$ (and {\em vice versa}).  
 
+
+\section{Brzozowski Derivatives of Regular Expressions}
 The the derivative of regular expression, denoted as
 $r \backslash c$, is a function that takes parameters
 $r$ and $c$, and returns another regular expression $r'$,
@@ -78,7 +132,21 @@
 \end{tabular}
 \end{center}
 \noindent
-\noindent
+The function derivative, written $r\backslash c$, 
+defines how a regular expression evolves into
+a new regular expression after all the string it contains
+is chopped off a certain head character $c$.
+The most involved cases are the sequence 
+and star case.
+The sequence case says that if the first regular expression
+contains an empty string then the second component of the sequence
+might be chosen as the target regular expression to be chopped
+off its head character.
+The star regular expression's derivative unwraps the iteration of
+regular expression and attaches the star regular expression
+to the sequence's second element to make sure a copy is retained
+for possible more iterations in later phases of lexing.
+
 
 The $\nullable$ function tests whether the empty string $""$ 
 is in the language of $r$:
@@ -101,49 +169,22 @@
 by definition, the character regular expression
 is the singleton that contains character only,
 and therefore does not contain the empty string,
-the alternative regular expression(or "or" expression)
+the alternative regular expression (or "or" expression)
 might have one of its children regular expressions
 being nullable and any one of its children being nullable
 would suffice. The sequence regular expression
 would require both children to have the empty string
 to compose an empty string and the Kleene star
-operation naturally introduced the empty string.
-
-We can give the meaning of regular expressions derivatives
-by language interpretation:
-
-
- 
+operation naturally introduced the empty string. 
   
-\begin{center}
-\begin{tabular}{lcl}
-		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
-		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
-		$d \backslash c$     & $\dn$ & 
-		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
-$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
-$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
-	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
-	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
-	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
-\end{tabular}
-\end{center}
-\noindent
+We have the following property where the derivative on regular 
+expressions coincides with the derivative on a set of strings:
+
+\begin{lemma}
+$\textit{Der} \; c \; L(r) = L (r\backslash c)$
+\end{lemma}
+
 \noindent
-The function derivative, written $\backslash c$, 
-defines how a regular expression evolves into
-a new regular expression after all the string it contains
-is chopped off a certain head character $c$.
-The most involved cases are the sequence 
-and star case.
-The sequence case says that if the first regular expression
-contains an empty string then the second component of the sequence
-might be chosen as the target regular expression to be chopped
-off its head character.
-The star regular expression's derivative unwraps the iteration of
-regular expression and attaches the star regular expression
-to the sequence's second element to make sure a copy is retained
-for possible more iterations in later phases of lexing.
 
 
 The main property of the derivative operation
@@ -169,9 +210,9 @@
 \noindent
 and then define Brzozowski's  regular-expression matching algorithm as:
 
-\[
-match\;s\;r \;\dn\; nullable(r\backslash s)
-\]
+\begin{definition}
+$match\;s\;r \;\dn\; nullable(r\backslash s)$
+\end{definition}
 
 \noindent
 Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, 
@@ -234,7 +275,7 @@
 The reason why $(a + aa) ^*$ explodes so drastically is that without
 pruning, the algorithm will keep records of all possible ways of matching:
 \begin{center}
-$(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
+$(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
 \end{center}
 
 \noindent
@@ -260,8 +301,7 @@
 And suppose we apply simplification after each derivative step, and compose
 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
 \textit{simp}(a \backslash c)$. Then we can build
-a matcher without having  cumbersome regular expressions.
-
+a matcher with simpler regular expressions.
 
 If we want the size of derivatives in the algorithm to
 stay even lower, we would need more aggressive simplifications.
@@ -270,13 +310,13 @@
 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
-$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
-to achieve a very tight size bound, namely,
- the same size bound as that of the \emph{partial derivatives}. 
+$a^*+a+\ONE$.  These more aggressive simplification rules are for
+ a very tight size bound, possibly as low
+  as that of the \emph{partial derivatives}\parencite{Antimirov1995}. 
 
 Building derivatives and then simplify them.
 So far so good. But what if we want to 
-do lexing instead of just a YES/NO answer?
+do lexing instead of just  getting a YES/NO answer?
 This requires us to go back again to the world 
 without simplification first for a moment.
 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
@@ -342,17 +382,24 @@
 still be more than one value corresponding to it.
 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-The number of different ways of matching 
-without allowing any value under a star to be flattened
-to an empty string can be given by the following formula:
-\begin{equation}
-	C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}
-\end{equation}
-and a closed form formula can be calculated to be
-\begin{equation}
-	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
-\end{equation}
-which is clearly in exponential order.
+If we do not allow any empty iterations in its lexical values,
+there will be $n - 1$ "splitting points" on $s$ we can choose to 
+split or not so that each sub-string
+segmented by those chosen splitting points will form different iterations:
+\begin{center}
+\begin{tabular}{lcr}
+$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$\\
+$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$\\
+$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$\\
+ & $\textit{etc}.$ &
+ \end{tabular}
+\end{center}
+
+And for each iteration, there are still multiple ways to split
+between the two $a^*$s.
+It is not surprising there are exponentially many lexical values
+that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
 
 A lexer aimed at getting all the possible values has an exponential
 worst case runtime. Therefore it is impractical to try to generate
@@ -909,89 +956,5 @@
 \end{center}
 
 
-%----------------------------------------------------------------------------------------
-%	SECTION  correctness proof
-%----------------------------------------------------------------------------------------
-\section{Correctness of Bit-coded Algorithm (Without Simplification)}
-We now give the proof the correctness of the algorithm with bit-codes.
-
-Ausaf and Urban cleverly defined an auxiliary function called $\flex$,
-defined as
-\[
-\flex \; r \; f \; [] \; v \; = \; f\; v
-\flex \; r \; f \; c :: s \; v =  \flex r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v
-\]
-which accumulates the characters that needs to be injected back, 
-and does the injection in a stack-like manner (last taken derivative first injected).
-$\flex$ is connected to the $\lexer$:
-\begin{lemma}
-$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
-\end{lemma}
-$\flex$ provides us a bridge between $\lexer$ and $\blexer$.
-What is even better about $\flex$ is that it allows us to 
-directly operate on the value $\mkeps (r\backslash v)$,
-which is pivotal in the definition of  $\lexer $ and $\blexer$, but not visible as an argument.
-When the value created by $\mkeps$ becomes available, one can 
-prove some stepwise properties of lexing nicely:
-\begin{lemma}\label{flexStepwise}
-$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
-\end{lemma}
-
-And for $\blexer$ we have a function with stepwise properties like $\flex$ as well,
-called $\retrieve$\ref{retrieveDef}.
-$\retrieve$ takes bit-codes from annotated regular expressions
-guided by a value.
-$\retrieve$ is connected to the $\blexer$ in the following way:
-\begin{lemma}\label{blexer_retrieve}
-$\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
-\end{lemma}
-If you take derivative of an annotated regular expression, 
-you can $\retrieve$ the same bit-codes as before the derivative took place,
-provided that you use the corresponding value:
-
-\begin{lemma}\label{retrieveStepwise}
-$\retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
-\end{lemma}
-The other good thing about $\retrieve$ is that it can be connected to $\flex$:
-%centralLemma1
-\begin{lemma}\label{flex_retrieve}
-$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
-\end{lemma}
-\begin{proof}
-By induction on $s$. The induction tactic is reverse induction on strings.
-$v$ is allowed to be arbitrary.
-The crucial point is to rewrite 
-\[
-\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
-\]
-as
-\[
-\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\;  \mkeps (r \backslash s@[c]))
-\].
-This enables us to equate 
-\[
-\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
-\] 
-with 
-\[
-\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
-\],
-which in turn can be rewritten as
-\[
-\flex \; r \; \textit{id} \; s@[c] \;  (\mkeps (r\backslash s@[c]))
-\].
-\end{proof}
-
-With the above lemma we can now link $\flex$ and $\blexer$.
-
-\begin{lemma}\label{flex_blexer}
-$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$
-\end{lemma}
-\begin{proof}
-Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
-\end{proof}
-Finally 
-
-