lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:e71a6e2aca2d
+:35df9cdd36ca
 \begin{center}
 	$s \in S_{start} \iff [] \in S_{end}$
 \end{center}
 \noindent
 Brzozowski noticed that this operation
-can be ``mirrored" on regular expressions which
+can be ``mirrored'' on regular expressions which
 he calls the derivative of a regular expression $r$
 with respect to a character $c$, written
 $r \backslash c$. This infix operator
 takes an original regular expression $r$ as input
 and a character as a right operand and
 outputs a result, which is a new regular expression.
 The derivative operation on regular expression
 is defined such that the language of the derivative result
 coincides with the language of the original
-regular expression's language being taken the language
+regular expression being taken
 derivative with respect to the same character:
-\begin{center}
+\begin{property}
+\[
+	L \; (r \backslash c) = \Der \; c \; (L \; r)
+\]
+\end{property}
+\noindent
+Pictorially, this looks as follows:
 \parskip \baselineskip
 \def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
 \def\rlwd{.5pt}
 \newcommand\notate[3]{%
 \unskip\def\useanchorwidth{T}%
 \Longstack{
 	\notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1,
 \;\ldots\}$}
 }
 \Longstack{
-	$\stackrel{\backslash c}{\longrightarrow}$
+	$\stackrel{\backslash c}{\xrightarrow{\hspace*{8cm}}}$
 }
 \Longstack{
-	\notate{$r\backslash c$}{2}{$L \; (r\backslash c)=
+	\notate{$r\backslash c$}{1}{$L \; (r\backslash c)=
 	\{\ldots,\;s_1,\;\ldots\}$}
 }
-\end{center}
+\\
-\begin{center}
+The derivatives on regular expression can again be
+generalised to a string.
-\[
+One could compute $r\backslash s$  and test membership of $s$
-L(r \backslash c) = \Der \; c \; L(r)
+in $L \; r$ by checking
-\]
-\end{center}
-\noindent
-where we do derivatives on the regular expression
-$r$ and test membership of $s$ by checking
 whether the empty string is in the language of
 $r\backslash s$.
-For example in the sequence case we have
+\Longstack{
+	\notate{$r_{start}$}{4}{
+		\Longstack{$L \; r_{start} = \{\ldots, \;$
+			\notate{$s$}{1}{$c_1::s_1$}
+		$, \ldots\} $
+		}
+	}
+}
+\Longstack{
+	$\stackrel{\backslash c_1}{ \xrightarrow{\hspace*{1.8cm}} }$
+}
+\Longstack{
+	\notate{$r_1$}{3}{
+		$r_1 = r_{start}\backslash c_1$,
+		$L \; r_1 = $
+	\Longstack{
+		$\{ \ldots,\;$  \notate{$s_1$}{1}{$c_2::s_2$}
+		$,\; \ldots \}$
+	}
+}
+}
+\Longstack{
+	$\stackrel{\backslash c_2}{\xrightarrow{\hspace*{1.8cm}}}$
+}
+\Longstack{
+	$r_2$
+}
+\Longstack{
+	$  \xdashrightarrow{\hspace*{0.3cm} \backslash c_3 \ldots \ldots \ldots \hspace*{0.3cm}} $
+}
+\Longstack{
+	\notate{$r_{end}$}{1}{
+	$L \; r_{end} = \{\ldots, \; [], \ldots\}$}
+}
+\begin{property}
+	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
+\end{property}
+\noindent
+Now we give the recursive definition of derivative on
+regular expressions, so that it satisfies the properties above.
+The derivative function, written $r\backslash c$,
+defines how a regular expression evolves into
+a new one after all the string it contains is acted on:
+if it starts with $c$, then the character is chopped of,
+if not, that string is removed.
+\begin{center}
+\begin{tabular}{lcl}
+		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
+		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
+		$d \backslash c$     & $\dn$ &
+		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
+	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
+	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
+	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
+\end{tabular}
+\end{center}
+\noindent
+The most involved cases are the sequence case
+and the star case.
+The sequence case says that if the first regular expression
+contains an empty string, then the second component of the sequence
+needs to be considered, as its derivative will contribute to the
+result of this derivative:
+\begin{center}
+	\begin{tabular}{lcl}
+		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1))\; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
+		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
+	\end{tabular}
+\end{center}
+\noindent
+Notice how this closely resembles
+the language derivative operation $\Der$:
 \begin{center}
 	\begin{tabular}{lcl}
 		$\Der \; c \; (A @ B)$ & $\dn$ &
 		$ \textit{if} \;\,  [] \in A \;
 		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup
 		\Der \; c\; B$\\
 		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
 	\end{tabular}
 \end{center}
 \noindent
-This can be translated to
+The star regular expression $r^*$'s derivative
-regular expressions in the following
+unwraps one iteration of $r$, turns it into $r\backslash c$,
-manner:
+and attaches the original $r^*$
-\begin{center}
+after $r\backslash c$, so that
-	\begin{tabular}{lcl}
+we can further unfold it as many times as needed:
-		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
+\[
-		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
+	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
-	\end{tabular}
+\]
-\end{center}
+Again,
+the structure is the same as the semantic derivative of Kleene star:
-\noindent
-And similarly, the Kleene star's semantic derivative
-can be expressed as
 \[
 	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
-\]
-which translates to
-\[
-	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
 \]
 In the above definition of $(r_1\cdot r_2) \backslash c$,
 the $\textit{if}$ clause's
 boolean condition
 $[] \in L(r_1)$ needs to be
 if at least one of its children is nullable.
 The sequence regular expression
 would require both children to have the empty string
 to compose an empty string, and the Kleene star
 is always nullable because it naturally
 contains the empty string.
+\noindent
-The derivative function, written $r\backslash c$,
-defines how a regular expression evolves into
-a new one after all the string it contains is acted on:
-if it starts with $c$, then the character is chopped of,
-if not, that string is removed.
-\begin{center}
-\begin{tabular}{lcl}
-		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
-		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
-		$d \backslash c$     & $\dn$ &
-		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
-$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
-$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
-	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
-	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
-	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
-\end{tabular}
-\end{center}
-\noindent
-The most involved cases are the sequence case
-and the star case.
-The sequence case says that if the first regular expression
-contains an empty string, then the second component of the sequence
-needs to be considered, as its derivative will contribute to the
-result of this derivative.
-The star regular expression $r^*$'s derivative
-unwraps one iteration of $r$, turns it into $r\backslash c$,
-and attaches the original $r^*$
-after $r\backslash c$, so that
-we can further unfold it as many times as needed.
 We have the following correspondence between
 derivatives on regular expressions and
 derivatives on a set of strings:
 \begin{lemma}\label{derDer}
+	\begin{itemize}
+		\item
 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
+\item
+	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
+	\end{itemize}
 \end{lemma}
+\begin{proof}
+	By induction on $r$.
+\end{proof}
 \noindent
 The main property of the derivative operation
 (that enables us to reason about the correctness of
 derivative-based matching)
 is
-\begin{lemma}\label{derStepwise}
-	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
-\end{lemma}
 \noindent
 We can generalise the derivative operation shown above for single characters
 to strings as follows:
 relatively  easily shown that this matcher is correct:
 \begin{lemma}
 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
 \end{lemma}
 \begin{proof}
-	By the stepwise property of derivatives (lemma \ref{derStepwise})
+	By the stepwise property of derivatives
-	and lemma \ref{derDer}.
+	(lemma \ref{derDer}).
 \end{proof}
 \noindent
 \begin{center}
 	\begin{figure}
 \begin{tikzpicture}
 	(v::vs)}
 \end{mathpar}
 \caption{POSIX Lexing Rules}
 \end{figure}
 \noindent
+\begin{figure}
+\begin{tikzpicture}[]
+\node [minimum width = 6cm, rectangle split, rectangle split horizontal,
+	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
+	    (node1)
+	    {$r_{token1}$
+	    \nodepart{two}  $\;\;\; \quad r_{token2}\;\;\;\quad$ };
+	    %\node [left = 6.0cm of node1] (start1) {hi};
+	    \node [left = 0.2cm of node1] (middle) {$v.s.$};
+\node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal,
+	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
+	    (node2)
+	    {$\quad\;\;\;r_{token1}\quad\;\;\;$
+	    \nodepart{two}  $r_{token2}$ };
+	    \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
+		\node [above = 1.5cm of middle, minimum width = 6cm,
+			rectangle, style={draw, rounded corners, inner sep=10pt}]
+			(topNode) {$s$};
+	    \path[->,draw]
+	        (topNode) edge node {split $A$} (node2)
+	        (topNode) edge node {split $B$} (node1)
+		;
+\end{tikzpicture}
+\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
+\end{figure}
 The above $\POSIX$ rules follows the intuition described below:
 \begin{itemize}
 	\item (Left Priority)\\
 		Match the leftmost regular expression when multiple options of matching
 		are available.
 	\item (Maximum munch)\\
 		Always match a subpart as much as possible before proceeding
 		to the next token.
+		For example, when the string $s$ matches
+		$r_{token1}\cdot r_{token2}$, and we have two ways $s$ can be split:
+		Then the split that matches a longer string for the first token
+		$r_{token1}$ is preferred by this maximum munch rule (See
+		\ref{munch} for an illustration).
+\noindent
 \end{itemize}
 \noindent
 These disambiguation strategies can be
 quite practical.
 For instance, when lexing a code snippet
 that will result in a POSIX value is only
 a minority among the many other terms,
 and one can remove ones that are absolutely not possible to
 be POSIX.
 In the above example,
-\[
+\begin{equation}\label{eqn:growth2}
 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
-\]
+\end{equation}
 can be further simplified by
 removing the underlined term first,
 which would open up possibilities
 of further simplification that removes the
 underbraced part.

changeset 579	35df9cdd36ca
parent 577	f47fc4840579
child 583	4aabb0629e4b