ChengsongTanPhdThesis/Chapters/Bitcoded1.tex
changeset 640 bd1354127574
parent 639 80cc6dc4c98b
child 649 ef2b8abcbc55
equal deleted inserted replaced
639:80cc6dc4c98b 640:bd1354127574
    12 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of 
    12 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of 
    13 derivatives of
    13 derivatives of
    14 regular expressions. 
    14 regular expressions. 
    15 We have implemented their algorithm in Scala and Isabelle, 
    15 We have implemented their algorithm in Scala and Isabelle, 
    16 and found problems 
    16 and found problems 
    17 in their algorithm such as de-duplication not working properly and redundant
    17 in their algorithm, such as de-duplication not working properly and redundant
    18 fixpoint construction. 
    18 fixpoint construction. 
    19 \section{The Motivation Behind Using Bitcodes}
    19 \section{The Motivation Behind Using Bitcodes}
    20 Let us give again the definition of $\lexer$ from Chapter \ref{Inj}:
    20 Let us give again the definition of $\lexer$ from Chapter \ref{Inj}:
    21 \begin{center}
    21 \begin{center}
    22 \begin{tabular}{lcl}
    22 \begin{tabular}{lcl}
   111 
   111 
   112 \end{tikzpicture}
   112 \end{tikzpicture}
   113 \caption{Second derivative taken}
   113 \caption{Second derivative taken}
   114 \end{figure}
   114 \end{figure}
   115 \noindent
   115 \noindent
   116 As the number of derivative steps increase,
   116 As the number of derivative steps increases,
   117 the stack would increase:
   117 the stack would increase:
   118 
   118 
   119 \begin{figure}[H]
   119 \begin{figure}[H]
   120 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] 
   120 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] 
   121 %\draw (-6,-6) grid (6,6);
   121 %\draw (-6,-6) grid (6,6);
   325 The bitcodes are then carried with the regular
   325 The bitcodes are then carried with the regular
   326 expression, and augmented or moved around
   326 expression, and augmented or moved around
   327 as the lexing proceeds.
   327 as the lexing proceeds.
   328 It becomes unnecessary
   328 It becomes unnecessary
   329 to remember all the 
   329 to remember all the 
   330 intermediate expresssions, but only the most recent one
   330 intermediate expressions, but only the most recent one
   331 with this bit-carrying regular expression.
   331 with this bit-carrying regular expression.
   332 Annotated regular expressions 
   332 Annotated regular expressions 
   333 are defined as the following 
   333 are defined as the following 
   334 Isabelle datatype:\footnote{ We use subscript notation to indicate
   334 Isabelle datatype:\footnote{ We use subscript notation to indicate
   335 	that the bitcodes are auxiliary information that do not
   335 	that the bitcodes are auxiliary information that does not
   336 interfere with the structure of the regular expressions }
   336 interfere with the structure of the regular expressions }
   337 \begin{center}
   337 \begin{center}
   338 \begin{tabular}{lcl}
   338 \begin{tabular}{lcl}
   339   $\textit{a}$ & $::=$  & $\ZERO$\\
   339   $\textit{a}$ & $::=$  & $\ZERO$\\
   340                   & $\mid$ & $_{bs}\ONE$\\
   340                   & $\mid$ & $_{bs}\ONE$\\
   450 
   450 
   451 
   451 
   452 \end{tikzpicture}
   452 \end{tikzpicture}
   453 \caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand
   453 \caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand
   454 	for the annotated regular expressions in each derivative step.
   454 	for the annotated regular expressions in each derivative step.
   455 	The characters used in each derivative is written as $c_i$.
   455 	The characters used in each derivative are written as $c_i$.
   456 The size of the derivatives typically increase during each derivative step,
   456 The size of the derivatives typically increases during each derivative step.
   457 therefore we draw the larger circles to indicate that.
       
   458 The process starts with an internalise phase and concludes with a decoding phase.}
   457 The process starts with an internalise phase and concludes with a decoding phase.}
   459 \end{figure}
   458 \end{figure}
   460 \noindent
   459 \noindent
   461 The plain regular expressions
   460 The plain regular expressions
   462 are first ``lifted'' to an annotated regular expression,
   461 are first ``lifted'' to an annotated regular expression,
   538 \end{center}    
   537 \end{center}    
   539 \noindent
   538 \noindent
   540 The opposite of $\textit{internalise}$ is
   539 The opposite of $\textit{internalise}$ is
   541 $\erase$, where all bit-codes are removed,
   540 $\erase$, where all bit-codes are removed,
   542 and the alternative operator $\sum$ for annotated
   541 and the alternative operator $\sum$ for annotated
   543 regular expressions is transformed to the binary version 
   542 regular expressions is transformed into the binary version 
   544 in (plain) regular expressions. This can be defined as follows:
   543 in (plain) regular expressions. This can be defined as follows:
   545 \begin{center}\label{eraseDef}
   544 \begin{center}\label{eraseDef}
   546 	\begin{tabular}{lcl}
   545 	\begin{tabular}{lcl}
   547 		$\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\
   546 		$\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\
   548 		$( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\
   547 		$( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\
   646 %because at least one of $\decode'$'s parameters will go down in terms
   645 %because at least one of $\decode'$'s parameters will go down in terms
   647 %of size.\\
   646 %of size.\\
   648 The reverse operation of $\decode$ is $\code$.
   647 The reverse operation of $\decode$ is $\code$.
   649 This function encodes a value into a bitcode by converting
   648 This function encodes a value into a bitcode by converting
   650 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
   649 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
   651 star iteration by $S$. The border where a star iteration
   650 star iteration by $S$. $ Z$ marks the border where a star iteration
   652 terminates is marked by $Z$. 
   651 terminates. 
   653 This coding is lossy, as it throws away the information about
   652 This coding is lossy, as it throws away the information about
   654 characters, and does not encode the ``boundary'' between two
   653 characters, and does not encode the ``boundary'' between two
   655 sequence values. Moreover, with only the bitcode we cannot even tell
   654 sequence values. Moreover, with only the bitcode we cannot even tell
   656 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$,
   655 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$,
   657 but this will not be necessary in our correctness proof.
   656 but this will not be necessary for our correctness proof.
   658 \begin{center}
   657 \begin{center}
   659 \begin{tabular}{lcl}
   658 \begin{tabular}{lcl}
   660   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
   659   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
   661   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
   660   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
   662   $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
   661   $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
   676 \[If \vdash v : r \; then \;\decode \; (\code \; v) \; r = \textit{Some}(v) \]
   675 \[If \vdash v : r \; then \;\decode \; (\code \; v) \; r = \textit{Some}(v) \]
   677 \end{lemma}
   676 \end{lemma}
   678 \begin{proof}
   677 \begin{proof}
   679 By proving a more general version of the lemma, on $\decode'$:
   678 By proving a more general version of the lemma, on $\decode'$:
   680 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
   679 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
   681 Then setting $ds$ to be $[]$ and unfolding $\decode$ definition,
   680 Then we obtain the property by setting $ds$ to be $[]$ and unfolding the $\decode$ definition.
   682 we obtain the property.
       
   683 \end{proof}
   681 \end{proof}
   684 \noindent
   682 \noindent
   685 Now we define the central part of Sulzmann and Lu's second lexing algorithm,
   683 Now we define the central part of Sulzmann and Lu's second lexing algorithm,
   686 the $\bder$ function (which stands for \emph{b}itcoded-derivative):
   684 the $\bder$ function (which stands for \emph{b}itcoded-derivative):
   687 \begin{center}
   685 \begin{center}
   854 plain regular expression $r$, and then builds successive derivatives
   852 plain regular expression $r$, and then builds successive derivatives
   855 with respect to the input string $s$, and
   853 with respect to the input string $s$, and
   856 then test whether the result is (b)nullable.
   854 then test whether the result is (b)nullable.
   857 If yes, then extract the bitcodes from the nullable expression,
   855 If yes, then extract the bitcodes from the nullable expression,
   858 and decodes the bitcodes into a lexical value.
   856 and decodes the bitcodes into a lexical value.
   859 If there does not exists a match between $r$ and $s$ the lexer
   857 If there does not exist a match between $r$ and $s$, the lexer
   860 outputs $\None$ indicating a mismatch.
   858 outputs $\None$ indicating a mismatch.
   861 Ausaf and Urban formally proved the correctness of the $\blexer$, namely
   859 Ausaf and Urban formally proved the correctness of the $\blexer$, namely
   862 \begin{property}
   860 \begin{property}
   863 $\blexer \;r \; s = \lexer \; r \; s$.
   861 $\blexer \;r \; s = \lexer \; r \; s$.
   864 \end{property}
   862 \end{property}
  1066 We will elaborate on this, with the help of
  1064 We will elaborate on this, with the help of
  1067 some helper functions such as $\retrieve$ and
  1065 some helper functions such as $\retrieve$ and
  1068 $\flex$.
  1066 $\flex$.
  1069 \subsection{Specifications of Some Helper Functions}
  1067 \subsection{Specifications of Some Helper Functions}
  1070 The functions we introduce will give a more detailed view into 
  1068 The functions we introduce will give a more detailed view into 
  1071 the lexing process, which is not be possible
  1069 the lexing process, which is not possible
  1072 using $\lexer$ or $\blexer$ alone.
  1070 using $\lexer$ or $\blexer$ alone.
  1073 \subsubsection{$\textit{Retrieve}$}
  1071 \subsubsection{$\textit{Retrieve}$}
  1074 The first function we shall introduce is $\retrieve$.
  1072 The first function we shall introduce is $\retrieve$.
  1075 Sulzmann and Lu gave its definition, and
  1073 Sulzmann and Lu gave its definition, and
  1076 Ausaf and Urban found it
  1074 Ausaf and Urban found it
  1230 provided that the corresponding values are used respectively:
  1228 provided that the corresponding values are used respectively:
  1231 \begin{lemma}\label{retrieveStepwise}
  1229 \begin{lemma}\label{retrieveStepwise}
  1232 	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
  1230 	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
  1233 \end{lemma}
  1231 \end{lemma}
  1234 \begin{proof}
  1232 \begin{proof}
  1235 	By induction on $r$, generalising over $v$.
  1233 	We do an induction on $r$, generalising over $v$.
  1236 	The induction principle in our formalisation
  1234 	The induction principle in our formalisation
  1237 	is function $\erase$'s cases.
  1235 	is function $\erase$'s cases.
  1238 \end{proof}
  1236 \end{proof}
  1239 \noindent
  1237 \noindent
  1240 Before we move on to the next
  1238 Before we move on to the next
  1260 $\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
  1258 $\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
  1261 $\flex \; r \; f \; (c :: s) \; v$ &  $=$ &   $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$
  1259 $\flex \; r \; f \; (c :: s) \; v$ &  $=$ &   $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$
  1262 \end{tabular}
  1260 \end{tabular}
  1263 \end{center}
  1261 \end{center}
  1264 which accumulates the characters that need to be injected back, 
  1262 which accumulates the characters that need to be injected back, 
  1265 and does the injection in a stack-like manner (the last taken derivative is first injected).
  1263 and does the injection in a stack-like manner (the last character being chopped off in the derivatives phase is first injected).
  1266 The function
  1264 The function
  1267 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression
  1265 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression
  1268 $r$, the identity function $id$, the input string $s$ and the value
  1266 $r$, the identity function $id$, the input string $s$ and the value
  1269 $v_n= \mkeps \; (r\backslash s)$:
  1267 $v_n= \mkeps \; (r\backslash s)$:
  1270 \begin{lemma}
  1268 \begin{lemma}