ChengsongTanPhdThesis/Chapters/Chapter2.tex
changeset 530 823d9b19d21c
parent 529 96e93df60954
child 531 c334f0b3ef52
equal deleted inserted replaced
529:96e93df60954 530:823d9b19d21c
   275 \begin{definition}
   275 \begin{definition}
   276 $match\;s\;r \;\dn\; nullable(r\backslash s)$
   276 $match\;s\;r \;\dn\; nullable(r\backslash s)$
   277 \end{definition}
   277 \end{definition}
   278 
   278 
   279 \noindent
   279 \noindent
   280 Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, 
   280 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
   281 this algorithm presented graphically is as follows:
   281 this algorithm presented graphically is as follows:
   282 
   282 
   283 \begin{equation}\label{graph:*}
   283 \begin{equation}\label{graph:successive_ders}
   284 \begin{tikzcd}
   284 \begin{tikzcd}
   285 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
   285 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
   286 \end{tikzcd}
   286 \end{tikzcd}
   287 \end{equation}
   287 \end{equation}
   288 
   288 
   475 are available  
   475 are available  
   476 \item 
   476 \item 
   477 always match a subpart as much as possible before proceeding
   477 always match a subpart as much as possible before proceeding
   478 to the next token.
   478 to the next token.
   479 \end{itemize}
   479 \end{itemize}
   480 The formal definition of a $\POSIX$ value can be described 
   480 The formal definition of a $\POSIX$ value $v$ for a regular expression
       
   481 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
   481 in the following set of rules:
   482 in the following set of rules:
   482 \
   483 (TODO: write the entire set of inference rules for POSIX )
   483 
   484 \newcommand*{\inference}[3][t]{%
   484 
   485    \begingroup
   485  For example, the above example has the POSIX value
   486    \def\and{\\}%
   486 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
   487    \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
   487 The output of an algorithm we want would be a POSIX matching
   488    #2 \\
   488 encoded as a value.
   489    \hline
       
   490    #3
       
   491    \end{tabular}%
       
   492    \endgroup
       
   493 }
       
   494 \begin{center}
       
   495 \inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
       
   496 \end{center}
       
   497 
   489 The reason why we are interested in $\POSIX$ values is that they can
   498 The reason why we are interested in $\POSIX$ values is that they can
   490 be practically used in the lexing phase of a compiler front end.
   499 be practically used in the lexing phase of a compiler front end.
   491 For instance, when lexing a code snippet 
   500 For instance, when lexing a code snippet 
   492 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
   501 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
   493 as an identifier rather than a keyword.
   502 as an identifier rather than a keyword.
   494 
   503 
       
   504  For example, the above $r= (a^*\cdot a^*)^*$  and 
       
   505 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
       
   506 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
       
   507 The output of an algorithm we want would be a POSIX matching
       
   508 encoded as a value.
       
   509 
       
   510 
       
   511 
       
   512 
   495 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   513 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   496 algorithm by a second phase (the first phase being building successive
   514 algorithm by a second phase (the first phase being building successive
   497 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
   515 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
   498 is generated in case the regular expression matches  the string. 
   516 is generated in case the regular expression matches  the string.
   499 Pictorially, the Sulzmann and Lu algorithm is as follows:
   517 How can we construct a value out of regular expressions and character
       
   518 sequences only?
       
   519 Two functions are involved: $\inj$ and $\mkeps$.
       
   520 The function $\mkeps$ constructs a value from the last
       
   521 one of all the successive derivatives:
       
   522 \begin{ceqn}
       
   523 \begin{equation}\label{graph:mkeps}
       
   524 \begin{tikzcd}
       
   525 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
       
   526 	        & 	              & 	            & v_n       
       
   527 \end{tikzcd}
       
   528 \end{equation}
       
   529 \end{ceqn}
       
   530 
       
   531 It tells us how can an empty string be matched by a 
       
   532 regular expression, in a $\POSIX$ way:
       
   533 
       
   534 	\begin{center}
       
   535 		\begin{tabular}{lcl}
       
   536 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
   537 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
   538 			& \textit{if} $\nullable(r_{1})$\\ 
       
   539 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
   540 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
   541 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
   542 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
   543 		\end{tabular}
       
   544 	\end{center}
       
   545 
       
   546 
       
   547 \noindent 
       
   548 We favour the left to match an empty string in case there is a choice.
       
   549 When there is a star for us to match the empty string,
       
   550 we simply give the $\Stars$ constructor an empty list, meaning
       
   551 no iterations are taken.
       
   552 
       
   553 
       
   554 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
   555 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
   556 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
   557 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
   558 matches $s$. The POSIX value is maintained throught out the process.
       
   559 For this Sulzmann and Lu defined a function that reverses
       
   560 the ``chopping off'' of characters during the derivative phase. The
       
   561 corresponding function is called \emph{injection}, written
       
   562 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   563 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   564 is a character ${c_{i-1}}$, the character we want to inject and the
       
   565 third argument is the value ${v_i}$, into which one wants to inject the
       
   566 character (it corresponds to the regular expression after the character
       
   567 has been chopped off). The result of this function is a new value. 
       
   568 \begin{ceqn}
       
   569 \begin{equation}\label{graph:inj}
       
   570 \begin{tikzcd}
       
   571 r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
       
   572 v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
       
   573 \end{tikzcd}
       
   574 \end{equation}
       
   575 \end{ceqn}
       
   576 
       
   577 
       
   578 \noindent
       
   579 The
       
   580 definition of $\textit{inj}$ is as follows: 
       
   581 
       
   582 \begin{center}
       
   583 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   584   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   585   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   586   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   587   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   588   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   589   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   590   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   591 \end{tabular}
       
   592 \end{center}
       
   593 
       
   594 \noindent This definition is by recursion on the ``shape'' of regular
       
   595 expressions and values. 
       
   596 The clauses basically do one thing--identifying the ``holes'' on 
       
   597 value to inject the character back into.
       
   598 For instance, in the last clause for injecting back to a value
       
   599 that would turn into a new star value that corresponds to a star,
       
   600 we know it must be a sequence value. And we know that the first 
       
   601 value of that sequence corresponds to the child regex of the star
       
   602 with the first character being chopped off--an iteration of the star
       
   603 that had just been unfolded. This value is followed by the already
       
   604 matched star iterations we collected before. So we inject the character 
       
   605 back to the first value and form a new value with this new iteration
       
   606 being added to the previous list of iterations, all under the $\Stars$
       
   607 top level.
       
   608 
       
   609 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
       
   610 we have a lexer with the following recursive definition:
       
   611 \begin{center}
       
   612 \begin{tabular}{lcr}
       
   613 $\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\
       
   614 $\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$
       
   615 \end{tabular}
       
   616 \end{center}
       
   617  
       
   618 Pictorially, the algorithm is as follows:
   500 
   619 
   501 \begin{ceqn}
   620 \begin{ceqn}
   502 \begin{equation}\label{graph:2}
   621 \begin{equation}\label{graph:2}
   503 \begin{tikzcd}
   622 \begin{tikzcd}
   504 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   623 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   522 algorithm from the right to left. For the first value $v_n$, we call the
   641 algorithm from the right to left. For the first value $v_n$, we call the
   523 function $\textit{mkeps}$, which builds a POSIX lexical value
   642 function $\textit{mkeps}$, which builds a POSIX lexical value
   524 for how the empty string has been matched by the (nullable) regular
   643 for how the empty string has been matched by the (nullable) regular
   525 expression $r_n$. This function is defined as
   644 expression $r_n$. This function is defined as
   526 
   645 
   527 	\begin{center}
   646 
   528 		\begin{tabular}{lcl}
       
   529 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
   530 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
   531 			& \textit{if} $\nullable(r_{1})$\\ 
       
   532 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
   533 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
   534 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
   535 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
   536 		\end{tabular}
       
   537 	\end{center}
       
   538 
       
   539 
       
   540 \noindent 
       
   541 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
   542 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
   543 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
   544 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
   545 matches $s$. The POSIX value is maintained throught out the process.
       
   546 For this Sulzmann and Lu defined a function that reverses
       
   547 the ``chopping off'' of characters during the derivative phase. The
       
   548 corresponding function is called \emph{injection}, written
       
   549 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   550 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   551 is a character ${c_{i-1}}$, the character we want to inject and the
       
   552 third argument is the value ${v_i}$, into which one wants to inject the
       
   553 character (it corresponds to the regular expression after the character
       
   554 has been chopped off). The result of this function is a new value. The
       
   555 definition of $\textit{inj}$ is as follows: 
       
   556 
       
   557 \begin{center}
       
   558 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   559   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   560   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   561   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   562   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   563   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   564   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   565   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   566 \end{tabular}
       
   567 \end{center}
       
   568 
       
   569 \noindent This definition is by recursion on the ``shape'' of regular
       
   570 expressions and values. 
       
   571 The clauses basically do one thing--identifying the ``holes'' on 
       
   572 value to inject the character back into.
       
   573 For instance, in the last clause for injecting back to a value
       
   574 that would turn into a new star value that corresponds to a star,
       
   575 we know it must be a sequence value. And we know that the first 
       
   576 value of that sequence corresponds to the child regex of the star
       
   577 with the first character being chopped off--an iteration of the star
       
   578 that had just been unfolded. This value is followed by the already
       
   579 matched star iterations we collected before. So we inject the character 
       
   580 back to the first value and form a new value with this new iteration
       
   581 being added to the previous list of iterations, all under the $Stars$
       
   582 top level.
       
   583 
   647 
   584 We have mentioned before that derivatives without simplification 
   648 We have mentioned before that derivatives without simplification 
   585 can get clumsy, and this is true for values as well--they reflect
   649 can get clumsy, and this is true for values as well--they reflect
   586 the regular expressions size by definition.
   650 the regular expressions size by definition.
   587 
   651 
   615 
   679 
   616 Sulzmann and Lu solved this problem by
   680 Sulzmann and Lu solved this problem by
   617 introducing additional informtaion to the 
   681 introducing additional informtaion to the 
   618 regular expressions called \emph{bitcodes}.
   682 regular expressions called \emph{bitcodes}.
   619 
   683 
   620 \subsection*{Bit-coded Algorithm}
   684 
   621 Bits and bitcodes (lists of bits) are defined as:
   685 
   622 
       
   623 \begin{center}
       
   624 		$b ::=   1 \mid  0 \qquad
       
   625 bs ::= [] \mid b::bs    
       
   626 $
       
   627 \end{center}
       
   628 
       
   629 \noindent
       
   630 The $1$ and $0$ are not in bold in order to avoid 
       
   631 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
       
   632 bit-lists) can be used to encode values (or potentially incomplete values) in a
       
   633 compact form. This can be straightforwardly seen in the following
       
   634 coding function from values to bitcodes: 
       
   635 
       
   636 \begin{center}
       
   637 \begin{tabular}{lcl}
       
   638   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
       
   639   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
       
   640   $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
       
   641   $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
       
   642   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
       
   643   $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
       
   644   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
       
   645                                                  code(\Stars\,vs)$
       
   646 \end{tabular}    
       
   647 \end{center} 
       
   648 
       
   649 \noindent
       
   650 Here $\textit{code}$ encodes a value into a bitcodes by converting
       
   651 $\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
       
   652 star iteration by $1$. The border where a local star terminates
       
   653 is marked by $0$. This coding is lossy, as it throws away the information about
       
   654 characters, and also does not encode the ``boundary'' between two
       
   655 sequence values. Moreover, with only the bitcode we cannot even tell
       
   656 whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
       
   657 reason for choosing this compact way of storing information is that the
       
   658 relatively small size of bits can be easily manipulated and ``moved
       
   659 around'' in a regular expression. In order to recover values, we will 
       
   660 need the corresponding regular expression as an extra information. This
       
   661 means the decoding function is defined as:
       
   662 
       
   663 
       
   664 %\begin{definition}[Bitdecoding of Values]\mbox{}
       
   665 \begin{center}
       
   666 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
   667   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
       
   668   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
       
   669   $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   670      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       
   671        (\Left\,v, bs_1)$\\
       
   672   $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   673      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       
   674        (\Right\,v, bs_1)$\\                           
       
   675   $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
       
   676         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
       
   677   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
       
   678   & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
       
   679   $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
       
   680   $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
       
   681          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   682   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
       
   683   & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
       
   684   
       
   685   $\textit{decode}\,bs\,r$ & $\dn$ &
       
   686      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   687   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       
   688        \textit{else}\;\textit{None}$                       
       
   689 \end{tabular}    
       
   690 \end{center}    
       
   691 %\end{definition}
       
   692 
       
   693 Sulzmann and Lu's integrated the bitcodes into regular expressions to
       
   694 create annotated regular expressions \cite{Sulzmann2014}.
       
   695 \emph{Annotated regular expressions} are defined by the following
       
   696 grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
       
   697 
       
   698 \begin{center}
       
   699 \begin{tabular}{lcl}
       
   700   $\textit{a}$ & $::=$  & $\ZERO$\\
       
   701                   & $\mid$ & $_{bs}\ONE$\\
       
   702                   & $\mid$ & $_{bs}{\bf c}$\\
       
   703                   & $\mid$ & $_{bs}\sum\,as$\\
       
   704                   & $\mid$ & $_{bs}a_1\cdot a_2$\\
       
   705                   & $\mid$ & $_{bs}a^*$
       
   706 \end{tabular}    
       
   707 \end{center}  
       
   708 %(in \textit{ALTS})
       
   709 
       
   710 \noindent
       
   711 where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
       
   712 expressions and $as$ for a list of annotated regular expressions.
       
   713 The alternative constructor($\sum$) has been generalized to 
       
   714 accept a list of annotated regular expressions rather than just 2.
       
   715 We will show that these bitcodes encode information about
       
   716 the (POSIX) value that should be generated by the Sulzmann and Lu
       
   717 algorithm.
       
   718 
       
   719 
       
   720 To do lexing using annotated regular expressions, we shall first
       
   721 transform the usual (un-annotated) regular expressions into annotated
       
   722 regular expressions. This operation is called \emph{internalisation} and
       
   723 defined as follows:
       
   724 
       
   725 %\begin{definition}
       
   726 \begin{center}
       
   727 \begin{tabular}{lcl}
       
   728   $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
       
   729   $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
       
   730   $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
       
   731   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
   732   $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
       
   733   \textit{fuse}\,[1]\,r_2^\uparrow]$\\
       
   734   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
   735          $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
       
   736   $(r^*)^\uparrow$ & $\dn$ &
       
   737          $_{[]}(r^\uparrow)^*$\\
       
   738 \end{tabular}    
       
   739 \end{center}    
       
   740 %\end{definition}
       
   741 
       
   742 \noindent
       
   743 We use up arrows here to indicate that the basic un-annotated regular
       
   744 expressions are ``lifted up'' into something slightly more complex. In the
       
   745 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
       
   746 attach bits to the front of an annotated regular expression. Its
       
   747 definition is as follows:
       
   748 
       
   749 \begin{center}
       
   750 \begin{tabular}{lcl}
       
   751   $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
       
   752   $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
       
   753      $_{bs @ bs'}\ONE$\\
       
   754   $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
       
   755      $_{bs@bs'}{\bf c}$\\
       
   756   $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
       
   757      $_{bs@bs'}\sum\textit{as}$\\
       
   758   $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
       
   759      $_{bs@bs'}a_1 \cdot a_2$\\
       
   760   $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
       
   761      $_{bs @ bs'}a^*$
       
   762 \end{tabular}    
       
   763 \end{center}  
       
   764 
       
   765 \noindent
       
   766 After internalising the regular expression, we perform successive
       
   767 derivative operations on the annotated regular expressions. This
       
   768 derivative operation is the same as what we had previously for the
       
   769 basic regular expressions, except that we beed to take care of
       
   770 the bitcodes:
       
   771 
       
   772 
       
   773 \iffalse
       
   774  %\begin{definition}{bder}
       
   775 \begin{center}
       
   776   \begin{tabular}{@{}lcl@{}}
       
   777   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   778   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   779   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
       
   780         $\textit{if}\;c=d\; \;\textit{then}\;
       
   781          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
       
   782   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
       
   783   $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
       
   784   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
   785      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   786 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
   787 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
   788   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
       
   789   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
       
   790       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
       
   791        (\textit{STAR}\,[]\,r)$
       
   792 \end{tabular}    
       
   793 \end{center}    
       
   794 %\end{definition}
       
   795 
       
   796 \begin{center}
       
   797   \begin{tabular}{@{}lcl@{}}
       
   798   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   799   $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   800   $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
       
   801         $\textit{if}\;c=d\; \;\textit{then}\;
       
   802          _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
       
   803   $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
       
   804   $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
       
   805   $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
       
   806      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   807 					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
       
   808 					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
   809   & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
       
   810   $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
       
   811       $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
       
   812        (_{bs}\textit{STAR}\,[]\,r)$
       
   813 \end{tabular}    
       
   814 \end{center}    
       
   815 %\end{definition}
       
   816 \fi
       
   817 
       
   818 \begin{center}
       
   819   \begin{tabular}{@{}lcl@{}}
       
   820   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
   821   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
   822   $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
       
   823         $\textit{if}\;c=d\; \;\textit{then}\;
       
   824          _{bs}\ONE\;\textit{else}\;\ZERO$\\  
       
   825   $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
       
   826   $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
       
   827   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
       
   828      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   829 					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
       
   830 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
       
   831   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
       
   832   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
       
   833       $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
       
   834        (_{[]}r^*))$
       
   835 \end{tabular}    
       
   836 \end{center}    
       
   837 
       
   838 %\end{definition}
       
   839 \noindent
       
   840 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
       
   841 we need to unfold it into a sequence,
       
   842 and attach an additional bit $0$ to the front of $r \backslash c$
       
   843 to indicate one more star iteration. Also the sequence clause
       
   844 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
       
   845 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
       
   846 that it is for annotated regular expressions, therefore we omit the
       
   847 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
       
   848 $a_1$ matches the string prior to character $c$ (more on this later),
       
   849 then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
       
   850 \backslash c)$ will collapse the regular expression $a_1$(as it has
       
   851 already been fully matched) and store the parsing information at the
       
   852 head of the regular expression $a_2 \backslash c$ by fusing to it. The
       
   853 bitsequence $\textit{bs}$, which was initially attached to the
       
   854 first element of the sequence $a_1 \cdot a_2$, has
       
   855 now been elevated to the top-level of $\sum$, as this information will be
       
   856 needed whichever way the sequence is matched---no matter whether $c$ belongs
       
   857 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
       
   858 the lexing information, we complete the lexing by collecting the
       
   859 bitcodes using a generalised version of the $\textit{mkeps}$ function
       
   860 for annotated regular expressions, called $\textit{bmkeps}$:
       
   861 
       
   862 
       
   863 %\begin{definition}[\textit{bmkeps}]\mbox{}
       
   864 \begin{center}
       
   865 \begin{tabular}{lcl}
       
   866   $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
       
   867   $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
       
   868      $\textit{if}\;\textit{bnullable}\,a$\\
       
   869   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
       
   870   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
       
   871   $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
       
   872      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
       
   873   $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
       
   874      $bs \,@\, [0]$
       
   875 \end{tabular}    
       
   876 \end{center}    
       
   877 %\end{definition}
       
   878 
       
   879 \noindent
       
   880 This function completes the value information by travelling along the
       
   881 path of the regular expression that corresponds to a POSIX value and
       
   882 collecting all the bitcodes, and using $S$ to indicate the end of star
       
   883 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
       
   884 decode them, we get the value we expect. The corresponding lexing
       
   885 algorithm looks as follows:
       
   886 
       
   887 \begin{center}
       
   888 \begin{tabular}{lcl}
       
   889   $\textit{blexer}\;r\,s$ & $\dn$ &
       
   890       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
   891   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   892   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   893   & & $\;\;\textit{else}\;\textit{None}$
       
   894 \end{tabular}
       
   895 \end{center}
       
   896 
       
   897 \noindent
       
   898 In this definition $\_\backslash s$ is the  generalisation  of the derivative
       
   899 operation from characters to strings (just like the derivatives for un-annotated
       
   900 regular expressions).
       
   901 
       
   902 Now we introduce the simplifications, which is why we introduce the 
       
   903 bitcodes in the first place.
       
   904 
       
   905 \subsection*{Simplification Rules}
       
   906 
       
   907 This section introduces aggressive (in terms of size) simplification rules
       
   908 on annotated regular expressions
       
   909 to keep derivatives small. Such simplifications are promising
       
   910 as we have
       
   911 generated test data that show
       
   912 that a good tight bound can be achieved. We could only
       
   913 partially cover the search space as there are infinitely many regular
       
   914 expressions and strings. 
       
   915 
       
   916 One modification we introduced is to allow a list of annotated regular
       
   917 expressions in the $\sum$ constructor. This allows us to not just
       
   918 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
       
   919 also unnecessary ``copies'' of regular expressions (very similar to
       
   920 simplifying $r + r$ to just $r$, but in a more general setting). Another
       
   921 modification is that we use simplification rules inspired by Antimirov's
       
   922 work on partial derivatives. They maintain the idea that only the first
       
   923 ``copy'' of a regular expression in an alternative contributes to the
       
   924 calculation of a POSIX value. All subsequent copies can be pruned away from
       
   925 the regular expression. A recursive definition of our  simplification function 
       
   926 that looks somewhat similar to our Scala code is given below:
       
   927 %\comment{Use $\ZERO$, $\ONE$ and so on. 
       
   928 %Is it $ALTS$ or $ALTS$?}\\
       
   929 
       
   930 \begin{center}
       
   931   \begin{tabular}{@{}lcl@{}}
       
   932    
       
   933   $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
   934    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
   935    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
   936    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
   937    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
   938    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
       
   939 
       
   940   $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
       
   941   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
   942    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
   943    &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
       
   944 
       
   945    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
   946 \end{tabular}    
       
   947 \end{center}    
       
   948 
       
   949 \noindent
       
   950 The simplification does a pattern matching on the regular expression.
       
   951 When it detected that the regular expression is an alternative or
       
   952 sequence, it will try to simplify its child regular expressions
       
   953 recursively and then see if one of the children turns into $\ZERO$ or
       
   954 $\ONE$, which might trigger further simplification at the current level.
       
   955 The most involved part is the $\sum$ clause, where we use two
       
   956 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
       
   957 alternatives and reduce as many duplicates as possible. Function
       
   958 $\textit{distinct}$  keeps the first occurring copy only and removes all later ones
       
   959 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
       
   960 Its recursive definition is given below:
       
   961 
       
   962  \begin{center}
       
   963   \begin{tabular}{@{}lcl@{}}
       
   964   $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
       
   965      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
       
   966   $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
       
   967     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
       
   968 \end{tabular}    
       
   969 \end{center}  
       
   970 
       
   971 \noindent
       
   972 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
       
   973 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
       
   974 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
       
   975 
       
   976 Having defined the $\simp$ function,
       
   977 we can use the previous notation of  natural
       
   978 extension from derivative w.r.t.~character to derivative
       
   979 w.r.t.~string:%\comment{simp in  the [] case?}
       
   980 
       
   981 \begin{center}
       
   982 \begin{tabular}{lcl}
       
   983 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
   984 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
   985 \end{tabular}
       
   986 \end{center}
       
   987 
       
   988 \noindent
       
   989 to obtain an optimised version of the algorithm:
       
   990 
       
   991  \begin{center}
       
   992 \begin{tabular}{lcl}
       
   993   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
   994       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
   995   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   996   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   997   & & $\;\;\textit{else}\;\textit{None}$
       
   998 \end{tabular}
       
   999 \end{center}
       
  1000 
       
  1001 \noindent
       
  1002 This algorithm keeps the regular expression size small, for example,
       
  1003 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
  1004 will be reduced to just 6 and stays constant, no matter how long the
       
  1005 input string is.
       
  1006 
       
  1007 
       
  1008 
       
  1009 
       
  1010 
       
  1011 
       
  1012 
       
  1013 %-----------------------------------
       
  1014 %	SUBSECTION 1
       
  1015 %-----------------------------------
       
  1016 \section{Specifications of Certain Functions to be Used}
       
  1017 Here we give some functions' definitions, 
       
  1018 which we will use later.
       
  1019 \begin{center}
       
  1020 \begin{tabular}{ccc}
       
  1021 $\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$
       
  1022 \end{tabular}
       
  1023 \end{center}
       
  1024 
   686 
  1025 
   687 
  1026 
   688 
  1027 	
   689