lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:3a1fd5ea2484
+:5bf9f94c02e1
 %and then give the algorithm and its variant, and discuss
 %why more aggressive simplifications are needed.
 In this chapter, we define the basic notions
 for regular languages and regular expressions.
+This is essentially a description in "English"
+of your formalisation in Isabelle/HOL.
 We also give the definition of what $\POSIX$ lexing means.
 \section{Basic Concepts}
-Usually in formal language theory there is an alphabet
+Usually formal language theory starts with an alphabet
 denoting a set of characters.
-Here we only use the datatype of characters from Isabelle,
+Here we just use the datatype of characters from Isabelle,
-which roughly corresponds to the ASCII character.
+which roughly corresponds to the ASCII characters.
-Then using the usual $[]$ notation for lists,
+In what follows we shall leave the information about the alphabet
-we can define strings using chars:
+implicit.
-\begin{center}
+Then using the usual bracket notation for lists,
-\begin{tabular}{lcl}
+we can define strings made up of characters:
-$\textit{string}$ & $\dn$ & $[] | c  :: cs$\\
+\begin{center}
-& & $(c\; \text{has char type})$
+\begin{tabular}{lcl}
-\end{tabular}
+$\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
-\end{center}
+\end{tabular}
-And strings can be concatenated to form longer strings,
+\end{center}
-in the same way as we concatenate two lists,
+Where $c$ is a variable ranging over characters.
-which we denote as $@$. We omit the precise
+Strings can be concatenated to form longer strings in the same
+way as we concatenate two lists, which we write as @.
+We omit the precise
 recursive definition here.
 We overload this concatenation operator for two sets of strings:
 \begin{center}
 \begin{tabular}{lcl}
-$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
+$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
 \end{tabular}
 \end{center}
 We also call the above \emph{language concatenation}.
 The power of a language is defined recursively, using the
 concatenation operator $@$:
 \begin{center}
 \begin{tabular}{lcl}
 $A^0 $ & $\dn$ & $\{ [] \}$\\
-$A^{n+1}$ & $\dn$ & $A^n @ A$
+$A^{n+1}$ & $\dn$ & $A @ A^n$
 \end{tabular}
 \end{center}
 The union of all the natural number powers of a language
-is defined as the Kleene star operator:
+is usually defined as the Kleene star operator:
 \begin{center}
 \begin{tabular}{lcl}
 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
 \end{tabular}
 \end{center}
 \inferrule{}{[] \in A*\\}
 \inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*}
 \end{mathpar}
 \end{center}
+\ChristianComment{Yes, used the inferrule command in mathpar}
-We also define an operation of "chopping of" a character from
+We also define an operation of "chopping off" a character from
-a language, which we call $\Der$, meaning "Derivative for a language":
+a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
 \end{tabular}
 \end{center}
 \noindent
 This can be generalised to "chopping off" a string from all strings within set $A$,
-with the help of the concatenation operator:
+namely:
 \begin{center}
 \begin{tabular}{lcl}
-$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
+$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
 \end{tabular}
 \end{center}
 \noindent
-which is essentially the left quotient $A \backslash L'$ of $A$ against
+which is essentially the left quotient $A \backslash L$ of $A$ against
-the singleton language $L' = \{w\}$
+the singleton language with $L = \{w\}$
 in formal language theory.
-For this dissertation the $\textit{Ders}$ definition with
+However for the purposes here, the $\textit{Ders}$ definition with
-a single string suffices.
+a single string is sufficient.
 With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
 we have a  few properties of how the language derivative can be defined using
 sub-languages.
 \begin{lemma}
 The reason why we are interested in $\POSIX$ values is that they can
 be practically used in the lexing phase of a compiler front end.
 For instance, when lexing a code snippet
 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
 as an identifier rather than a keyword.
+\ChristianComment{Do I also introduce lexical values $LV$ here?}
+We know that $\POSIX$ values are also part of the normal values:
+\begin{lemma}
+$(r, s) \rightarrow v \implies \vdash v: r$
+\end{lemma}
+\noindent
 The good property about a $\POSIX$ value is that
 given the same regular expression $r$ and string $s$,
 one can always uniquely determine the $\POSIX$ value for it:
 \begin{lemma}
 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
 \end{tabular}
 \end{center}
 \noindent
 The central property of the $\lexer$ is that it gives the correct result by
 $\POSIX$ standards:
-\begin{lemma}
+\begin{theorem}
 \begin{tabular}{l}
-$s \in L(r) \Longleftrightarrow  (\exists v. \; r \; s = \Some(v) \land (r, \; s) \rightarrow v)$\\
+$\lexer \; r \; s = \Some(v) \Longleftrightarrow (r, \; s) \rightarrow v$\\
-$s \notin L(r) \Longleftrightarrow (\lexer \; r\; s = \None)$
+$\lexer \;r \; s = \None \Longleftrightarrow \neg(\exists v. (r, s) \rightarrow v)$
 \end{tabular}
-\end{lemma}
+\end{theorem}
 \begin{proof}
 By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
 The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
 by lemma \ref{injPosix}.
 \end{proof}
-Pictorially, the algorithm is as follows (
+We now give a pictorial view of the algorithm (
 For convenience, we employ the following notations: the regular
 expression we start with is $r_0$, and the given string $s$ is composed
 of characters $c_0 c_1 \ldots c_{n-1}$. The
 values built incrementally by \emph{injecting} back the characters into the
 earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
 \end{tikzpicture}
 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
 \end{figure}\label{fig:BetterWaterloo}
 That is because our lexing algorithm currently keeps a lot of
-"useless values that will never not be used.
+"useless" values that will not be used.
 These different ways of matching will grow exponentially with the string length.
 For $r= (a^*\cdot a^*)^*$ and
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
 if we do not allow any empty iterations in its lexical values,
 there will be $n - 1$ "splitting points" on $s$ we can independently choose to
 split or not so that each sub-string
 segmented by those chosen splitting points will form different iterations.
-For example when $n=4$,
+For example when $n=4$, we give out a few of the many possibilities of splitting:
 \begin{center}
 \begin{tabular}{lcr}
-$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration, this iteration will be divided between the inner sequence $a^*\cdot a^*$)\\
+$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\
 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
 $a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
 & $\textit{etc}.$ &

changeset 541	5bf9f94c02e1
parent 539	7cf9f17aa179
child 543	b2bea5968b89