ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 579 35df9cdd36ca
parent 577 f47fc4840579
child 583 4aabb0629e4b
equal deleted inserted replaced
578:e71a6e2aca2d 579:35df9cdd36ca
   276 \begin{center}
   276 \begin{center}
   277 	$s \in S_{start} \iff [] \in S_{end}$
   277 	$s \in S_{start} \iff [] \in S_{end}$
   278 \end{center}
   278 \end{center}
   279 \noindent
   279 \noindent
   280 Brzozowski noticed that this operation
   280 Brzozowski noticed that this operation
   281 can be ``mirrored" on regular expressions which
   281 can be ``mirrored'' on regular expressions which
   282 he calls the derivative of a regular expression $r$
   282 he calls the derivative of a regular expression $r$
   283 with respect to a character $c$, written
   283 with respect to a character $c$, written
   284 $r \backslash c$. This infix operator
   284 $r \backslash c$. This infix operator
   285 takes an original regular expression $r$ as input
   285 takes an original regular expression $r$ as input
   286 and a character as a right operand and
   286 and a character as a right operand and
   287 outputs a result, which is a new regular expression.
   287 outputs a result, which is a new regular expression.
   288 The derivative operation on regular expression
   288 The derivative operation on regular expression
   289 is defined such that the language of the derivative result 
   289 is defined such that the language of the derivative result 
   290 coincides with the language of the original 
   290 coincides with the language of the original 
   291 regular expression's language being taken the language 
   291 regular expression being taken 
   292 derivative with respect to the same character:
   292 derivative with respect to the same character:
   293 \begin{center}
   293 \begin{property}
       
   294 
       
   295 \[
       
   296 	L \; (r \backslash c) = \Der \; c \; (L \; r)
       
   297 \]
       
   298 \end{property}
       
   299 \noindent
       
   300 Pictorially, this looks as follows:
       
   301 
   294 \parskip \baselineskip
   302 \parskip \baselineskip
   295 \def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
   303 \def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
   296 \def\rlwd{.5pt}
   304 \def\rlwd{.5pt}
   297 \newcommand\notate[3]{%
   305 \newcommand\notate[3]{%
   298   \unskip\def\useanchorwidth{T}%
   306   \unskip\def\useanchorwidth{T}%
   306 \Longstack{
   314 \Longstack{
   307 	\notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1, 
   315 	\notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1, 
   308 \;\ldots\}$}
   316 \;\ldots\}$}
   309 }
   317 }
   310 \Longstack{
   318 \Longstack{
   311 	$\stackrel{\backslash c}{\longrightarrow}$
   319 	$\stackrel{\backslash c}{\xrightarrow{\hspace*{8cm}}}$
   312 }
   320 }
   313 \Longstack{
   321 \Longstack{
   314 	\notate{$r\backslash c$}{2}{$L \; (r\backslash c)=
   322 	\notate{$r\backslash c$}{1}{$L \; (r\backslash c)=
   315 	\{\ldots,\;s_1,\;\ldots\}$}
   323 	\{\ldots,\;s_1,\;\ldots\}$}
   316 }
   324 }
   317 \end{center}
   325 \\
   318 \begin{center}
   326 The derivatives on regular expression can again be 
   319 
   327 generalised to a string.
   320 \[
   328 One could compute $r\backslash s$  and test membership of $s$
   321 L(r \backslash c) = \Der \; c \; L(r)
   329 in $L \; r$ by checking
   322 \]
       
   323 \end{center}
       
   324 \noindent
       
   325 where we do derivatives on the regular expression
       
   326 $r$ and test membership of $s$ by checking
       
   327 whether the empty string is in the language of 
   330 whether the empty string is in the language of 
   328 $r\backslash s$.
   331 $r\backslash s$.
   329 For example in the sequence case we have 
   332 
       
   333 
       
   334 \Longstack{
       
   335 	\notate{$r_{start}$}{4}{
       
   336 		\Longstack{$L \; r_{start} = \{\ldots, \;$ 
       
   337 			\notate{$s$}{1}{$c_1::s_1$} 
       
   338 		$, \ldots\} $
       
   339 		}
       
   340 	} 
       
   341 }
       
   342 \Longstack{
       
   343 	$\stackrel{\backslash c_1}{ \xrightarrow{\hspace*{1.8cm}} }$
       
   344 }
       
   345 \Longstack{
       
   346 	\notate{$r_1$}{3}{
       
   347 		$r_1 = r_{start}\backslash c_1$,
       
   348 		$L \; r_1 = $
       
   349 	\Longstack{
       
   350 		$\{ \ldots,\;$  \notate{$s_1$}{1}{$c_2::s_2$} 
       
   351 		$,\; \ldots \}$
       
   352 	}
       
   353 }
       
   354 }
       
   355 \Longstack{
       
   356 	$\stackrel{\backslash c_2}{\xrightarrow{\hspace*{1.8cm}}}$ 
       
   357 }
       
   358 \Longstack{
       
   359 	$r_2$	
       
   360 }
       
   361 \Longstack{
       
   362 	$  \xdashrightarrow{\hspace*{0.3cm} \backslash c_3 \ldots \ldots \ldots \hspace*{0.3cm}} $	
       
   363 }
       
   364 \Longstack{
       
   365 	\notate{$r_{end}$}{1}{
       
   366 	$L \; r_{end} = \{\ldots, \; [], \ldots\}$}
       
   367 }
       
   368 
       
   369 
       
   370 
       
   371 \begin{property}
       
   372 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
       
   373 \end{property}
       
   374 \noindent
       
   375 Now we give the recursive definition of derivative on
       
   376 regular expressions, so that it satisfies the properties above.
       
   377 The derivative function, written $r\backslash c$, 
       
   378 defines how a regular expression evolves into
       
   379 a new one after all the string it contains is acted on: 
       
   380 if it starts with $c$, then the character is chopped of,
       
   381 if not, that string is removed.
       
   382 \begin{center}
       
   383 \begin{tabular}{lcl}
       
   384 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   385 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   386 		$d \backslash c$     & $\dn$ & 
       
   387 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   388 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   389 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
       
   390 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   391 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   392 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   393 \end{tabular}
       
   394 \end{center}
       
   395 \noindent
       
   396 The most involved cases are the sequence case
       
   397 and the star case.
       
   398 The sequence case says that if the first regular expression
       
   399 contains an empty string, then the second component of the sequence
       
   400 needs to be considered, as its derivative will contribute to the
       
   401 result of this derivative:
       
   402 \begin{center}
       
   403 	\begin{tabular}{lcl}
       
   404 		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1))\; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
       
   405 		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
       
   406 	\end{tabular}
       
   407 \end{center}
       
   408 \noindent
       
   409 Notice how this closely resembles
       
   410 the language derivative operation $\Der$:
   330 \begin{center}
   411 \begin{center}
   331 	\begin{tabular}{lcl}
   412 	\begin{tabular}{lcl}
   332 		$\Der \; c \; (A @ B)$ & $\dn$ & 
   413 		$\Der \; c \; (A @ B)$ & $\dn$ & 
   333 		$ \textit{if} \;\,  [] \in A \; 
   414 		$ \textit{if} \;\,  [] \in A \; 
   334 		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup 
   415 		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup 
   335 		\Der \; c\; B$\\
   416 		\Der \; c\; B$\\
   336 		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
   417 		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
   337 	\end{tabular}
   418 	\end{tabular}
   338 \end{center}
   419 \end{center}
   339 \noindent
   420 \noindent
   340 This can be translated to  
   421 The star regular expression $r^*$'s derivative 
   341 regular expressions in the following 
   422 unwraps one iteration of $r$, turns it into $r\backslash c$,
   342 manner:
   423 and attaches the original $r^*$
   343 \begin{center}
   424 after $r\backslash c$, so that 
   344 	\begin{tabular}{lcl}
   425 we can further unfold it as many times as needed:
   345 		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
   426 \[
   346 		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
   427 	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
   347 	\end{tabular}
   428 \]
   348 \end{center}
   429 Again,
   349 
   430 the structure is the same as the semantic derivative of Kleene star: 
   350 \noindent
       
   351 And similarly, the Kleene star's semantic derivative
       
   352 can be expressed as
       
   353 \[
   431 \[
   354 	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
   432 	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
   355 \]
       
   356 which translates to
       
   357 \[
       
   358 	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
       
   359 \]
   433 \]
   360 In the above definition of $(r_1\cdot r_2) \backslash c$,
   434 In the above definition of $(r_1\cdot r_2) \backslash c$,
   361 the $\textit{if}$ clause's
   435 the $\textit{if}$ clause's
   362 boolean condition 
   436 boolean condition 
   363 $[] \in L(r_1)$ needs to be 
   437 $[] \in L(r_1)$ needs to be 
   388 if at least one of its children is nullable.
   462 if at least one of its children is nullable.
   389 The sequence regular expression
   463 The sequence regular expression
   390 would require both children to have the empty string
   464 would require both children to have the empty string
   391 to compose an empty string, and the Kleene star
   465 to compose an empty string, and the Kleene star
   392 is always nullable because it naturally
   466 is always nullable because it naturally
   393 contains the empty string. 
   467 contains the empty string.  
   394   
   468 \noindent
   395 The derivative function, written $r\backslash c$, 
       
   396 defines how a regular expression evolves into
       
   397 a new one after all the string it contains is acted on: 
       
   398 if it starts with $c$, then the character is chopped of,
       
   399 if not, that string is removed.
       
   400 \begin{center}
       
   401 \begin{tabular}{lcl}
       
   402 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   403 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   404 		$d \backslash c$     & $\dn$ & 
       
   405 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   406 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   407 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
       
   408 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   409 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   410 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   411 \end{tabular}
       
   412 \end{center}
       
   413 \noindent
       
   414 The most involved cases are the sequence case
       
   415 and the star case.
       
   416 The sequence case says that if the first regular expression
       
   417 contains an empty string, then the second component of the sequence
       
   418 needs to be considered, as its derivative will contribute to the
       
   419 result of this derivative.
       
   420 The star regular expression $r^*$'s derivative 
       
   421 unwraps one iteration of $r$, turns it into $r\backslash c$,
       
   422 and attaches the original $r^*$
       
   423 after $r\backslash c$, so that 
       
   424 we can further unfold it as many times as needed.
       
   425 We have the following correspondence between 
   469 We have the following correspondence between 
   426 derivatives on regular expressions and
   470 derivatives on regular expressions and
   427 derivatives on a set of strings:
   471 derivatives on a set of strings:
   428 \begin{lemma}\label{derDer}
   472 \begin{lemma}\label{derDer}
       
   473 	\begin{itemize}
       
   474 		\item
   429 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
   475 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
       
   476 \item
       
   477 	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
       
   478 	\end{itemize}
   430 \end{lemma}
   479 \end{lemma}
   431 
   480 \begin{proof}
       
   481 	By induction on $r$.
       
   482 \end{proof}
   432 \noindent
   483 \noindent
   433 The main property of the derivative operation
   484 The main property of the derivative operation
   434 (that enables us to reason about the correctness of
   485 (that enables us to reason about the correctness of
   435 derivative-based matching)
   486 derivative-based matching)
   436 is 
   487 is 
   437 
   488 
   438 \begin{lemma}\label{derStepwise}
       
   439 	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
       
   440 \end{lemma}
       
   441 
   489 
   442 \noindent
   490 \noindent
   443 We can generalise the derivative operation shown above for single characters
   491 We can generalise the derivative operation shown above for single characters
   444 to strings as follows:
   492 to strings as follows:
   445 
   493 
   476 relatively  easily shown that this matcher is correct:
   524 relatively  easily shown that this matcher is correct:
   477 \begin{lemma}
   525 \begin{lemma}
   478 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
   526 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
   479 \end{lemma}
   527 \end{lemma}
   480 \begin{proof}
   528 \begin{proof}
   481 	By the stepwise property of derivatives (lemma \ref{derStepwise})
   529 	By the stepwise property of derivatives 
   482 	and lemma \ref{derDer}. 
   530 	(lemma \ref{derDer}).
   483 \end{proof}
   531 \end{proof}
   484 \noindent
   532 \noindent
   485 \begin{center}
   533 \begin{center}
   486 	\begin{figure}
   534 	\begin{figure}
   487 \begin{tikzpicture}
   535 \begin{tikzpicture}
   687 	(v::vs)}
   735 	(v::vs)}
   688 \end{mathpar}
   736 \end{mathpar}
   689 \caption{POSIX Lexing Rules}
   737 \caption{POSIX Lexing Rules}
   690 \end{figure}
   738 \end{figure}
   691 \noindent
   739 \noindent
       
   740 
       
   741 \begin{figure}
       
   742 \begin{tikzpicture}[]
       
   743     \node [minimum width = 6cm, rectangle split, rectangle split horizontal, 
       
   744 	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
       
   745 	    (node1)
       
   746 	    {$r_{token1}$
       
   747 	    \nodepart{two}  $\;\;\; \quad r_{token2}\;\;\;\quad$ };
       
   748 	    %\node [left = 6.0cm of node1] (start1) {hi};
       
   749 	    \node [left = 0.2cm of node1] (middle) {$v.s.$};
       
   750     \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, 
       
   751 	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
       
   752 	    (node2)
       
   753 	    {$\quad\;\;\;r_{token1}\quad\;\;\;$
       
   754 	    \nodepart{two}  $r_{token2}$ };
       
   755 	    \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
       
   756 		\node [above = 1.5cm of middle, minimum width = 6cm, 
       
   757 			rectangle, style={draw, rounded corners, inner sep=10pt}] 
       
   758 			(topNode) {$s$};
       
   759 	    \path[->,draw]
       
   760 	        (topNode) edge node {split $A$} (node2)
       
   761 	        (topNode) edge node {split $B$} (node1)
       
   762 		;
       
   763 			
       
   764 
       
   765 \end{tikzpicture}
       
   766 \caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
       
   767 \end{figure}
   692 The above $\POSIX$ rules follows the intuition described below: 
   768 The above $\POSIX$ rules follows the intuition described below: 
   693 \begin{itemize}
   769 \begin{itemize}
   694 	\item (Left Priority)\\
   770 	\item (Left Priority)\\
   695 		Match the leftmost regular expression when multiple options of matching
   771 		Match the leftmost regular expression when multiple options of matching
   696 		are available.
   772 		are available.
   697 	\item (Maximum munch)\\
   773 	\item (Maximum munch)\\
   698 		Always match a subpart as much as possible before proceeding
   774 		Always match a subpart as much as possible before proceeding
   699 		to the next token.
   775 		to the next token.
       
   776 		For example, when the string $s$ matches 
       
   777 		$r_{token1}\cdot r_{token2}$, and we have two ways $s$ can be split:
       
   778 		Then the split that matches a longer string for the first token
       
   779 		$r_{token1}$ is preferred by this maximum munch rule (See
       
   780 		\ref{munch} for an illustration).
       
   781 \noindent
       
   782 
   700 \end{itemize}
   783 \end{itemize}
   701 \noindent
   784 \noindent
   702 These disambiguation strategies can be 
   785 These disambiguation strategies can be 
   703 quite practical.
   786 quite practical.
   704 For instance, when lexing a code snippet 
   787 For instance, when lexing a code snippet 
  1203 that will result in a POSIX value is only
  1286 that will result in a POSIX value is only
  1204 a minority among the many other terms,
  1287 a minority among the many other terms,
  1205 and one can remove ones that are absolutely not possible to 
  1288 and one can remove ones that are absolutely not possible to 
  1206 be POSIX.
  1289 be POSIX.
  1207 In the above example,
  1290 In the above example,
  1208 \[
  1291 \begin{equation}\label{eqn:growth2}
  1209 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
  1292 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
  1210 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
  1293 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
  1211 \]
  1294 \end{equation}
  1212 can be further simplified by 
  1295 can be further simplified by 
  1213 removing the underlined term first,
  1296 removing the underlined term first,
  1214 which would open up possibilities
  1297 which would open up possibilities
  1215 of further simplification that removes the
  1298 of further simplification that removes the
  1216 underbraced part.
  1299 underbraced part.