thys2/Paper/Paper.thy
changeset 420 b66a4305749c
parent 418 41a2a3b63853
child 423 b7199d6c672d
equal deleted inserted replaced
419:6de6bc551a8b 420:b66a4305749c
    21   "bder_syn r c \<equiv> bder c r"  
    21   "bder_syn r c \<equiv> bder c r"  
    22 
    22 
    23 notation (latex output)
    23 notation (latex output)
    24   der_syn ("_\\_" [79, 1000] 76) and
    24   der_syn ("_\\_" [79, 1000] 76) and
    25   bder_syn ("_\\_" [79, 1000] 76) and
    25   bder_syn ("_\\_" [79, 1000] 76) and
       
    26   bders ("_\\_" [79, 1000] 76) and
       
    27   bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and
    26 
    28 
    27   ZERO ("\<^bold>0" 81) and 
    29   ZERO ("\<^bold>0" 81) and 
    28   ONE ("\<^bold>1" 81) and 
    30   ONE ("\<^bold>1" 81) and 
    29   CH ("_" [1000] 80) and
    31   CH ("_" [1000] 80) and
    30   ALT ("_ + _" [77,77] 78) and
    32   ALT ("_ + _" [77,77] 78) and
    36   val.Left ("Left _" [79] 78) and
    38   val.Left ("Left _" [79] 78) and
    37   val.Right ("Right _" [1000] 78) and
    39   val.Right ("Right _" [1000] 78) and
    38   val.Seq ("Seq _ _" [79,79] 78) and
    40   val.Seq ("Seq _ _" [79,79] 78) and
    39   val.Stars ("Stars _" [79] 78) and
    41   val.Stars ("Stars _" [79] 78) and
    40 
    42 
       
    43   Prf ("\<turnstile> _ : _" [75,75] 75) and  
    41   Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
    44   Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
    42 
    45 
    43   flat ("|_|" [75] 74) and
    46   flat ("|_|" [75] 74) and
    44   flats ("|_|" [72] 74) and
    47   flats ("|_|" [72] 74) and
    45   injval ("inj _ _ _" [79,77,79] 76) and 
    48   injval ("inj _ _ _" [79,77,79] 76) and 
    81 expressions have sparked quite a bit of interest in the functional
    84 expressions have sparked quite a bit of interest in the functional
    82 programming and theorem prover communities.  The beauty of
    85 programming and theorem prover communities.  The beauty of
    83 Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly
    86 Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly
    84 expressible in any functional language, and easily definable and
    87 expressible in any functional language, and easily definable and
    85 reasoned about in theorem provers---the definitions just consist of
    88 reasoned about in theorem provers---the definitions just consist of
    86 inductive datatypes and simple recursive functions. A mechanised
    89 inductive datatypes and simple recursive functions.  Derivatives of a
    87 correctness proof of Brzozowski's matcher in for example HOL4 has been
    90 regular expression, written @{term "der c r"}, give a simple solution
    88 mentioned by Owens and Slind~\cite{Owens2008}. Another one in
    91 to the problem of matching a string @{term s} with a regular
    89 Isabelle/HOL is part of the work by Krauss and Nipkow
    92 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
    90 \cite{Krauss2011}.  And another one in Coq is given by Coquand and
    93 succession) all the characters of the string matches the empty string,
    91 Siles \cite{Coquand2012}.
    94 then @{term r} matches @{term s} (and {\em vice versa}).  We are aware
    92 
    95 of a mechanised correctness proof of Brzozowski's matcher in HOL4 by
    93 
    96 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
    94 The notion of derivatives
    97 of the work by Krauss and Nipkow \cite{Krauss2011}.  And another one
    95 \cite{Brzozowski1964}, written @{term "der c r"}, of a regular
    98 in Coq is given by Coquand and Siles \cite{Coquand2012}.
    96 expression give a simple solution to the problem of matching a string
    99 
    97 @{term s} with a regular expression @{term r}: if the derivative of
   100 There are two difficulties with derivative-based matchers and also
    98 @{term r} w.r.t.\ (in succession) all the characters of the string
   101 lexers: First, Brzozowski's original matcher only generates a yes/no
    99 matches the empty string, then @{term r} matches @{term s} (and {\em
   102 answer for whether a regular expression matches a string or not.
   100 vice versa}). The derivative has the property (which may almost be
   103 Sulzmann and Lu~\cite{Sulzmann2014} overcome this difficulty by
   101 regarded as its specification) that, for every string @{term s} and
   104 cleverly extending Brzozowski's matching algorithm to POSIX
   102 regular expression @{term r} and character @{term c}, one has @{term
   105 lexing. This extended version generates additional information on
   103 "cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}.
   106 \emph{how} a regular expression matches a string. They achieve this by
       
   107 
       
   108 
       
   109 
       
   110 
       
   111 The second problem is that Brzozowski's derivatives can 
       
   112 grow to arbitrarily big sizes. For example if we start with the
       
   113 regular expression \mbox{@{text "(a + aa)\<^sup>*"}} and take
       
   114 successive derivatives according to the character $a$, we end up with
       
   115 a sequence of ever-growing derivatives like 
       
   116 
       
   117 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
       
   118 \begin{center}
       
   119 \begin{tabular}{rll}
       
   120 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   121 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   122 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
       
   123 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   124 & $\ll$ & \ldots
       
   125 \end{tabular}
       
   126 \end{center}
       
   127  
       
   128 \noindent where after around 35 steps we run out of memory on a
       
   129 typical computer (we define the precise details of the derivative
       
   130 operation later).  Clearly, the notation involving $\ZERO$s and
       
   131 $\ONE$s already suggests simplification rules that can be applied to
       
   132 regular regular expressions, for example $\ZERO{}r \Rightarrow \ZERO$,
       
   133 $\ONE{}r \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r
       
   134 \Rightarrow r$. While such simple-minded reductions have been proved
       
   135 in our earlier work to preserve the correctness of Sulzmann and Lu's
       
   136 algorithm, they unfortunately do \emph{not} help with limiting the
       
   137 gowth of the derivatives shown above: yes, the growth is slowed, but the
       
   138 derivatives can still grow beyond any finite bound.
       
   139 
       
   140 
       
   141 
       
   142 Sulzmann and Lu introduce a
       
   143 bitcoded version of their lexing algorithm. They make some claims
       
   144 about the correctness and speed of this version, but do
       
   145 not provide any supporting proof arguments, not even
       
   146 ``pencil-and-paper'' arguments. They wrote about their bitcoded
       
   147 ``incremental parsing method'' (that is the algorithm to be studied in this
       
   148 section):
       
   149 
       
   150 \begin{quote}\it
       
   151   ``Correctness Claim: We further claim that the incremental parsing
       
   152   method [..] in combination with the simplification steps [..]
       
   153   yields POSIX parse trees. We have tested this claim
       
   154   extensively [..] but yet
       
   155   have to work out all proof details.''
       
   156 \end{quote}  
       
   157 
       
   158 
       
   159 
   104 
   160 
   105 
   161 
   106 If a regular expression matches a string, then in general there is more
   162 If a regular expression matches a string, then in general there is more
   107 than one way of how the string is matched. There are two commonly used
   163 than one way of how the string is matched. There are two commonly used
   108 disambiguation strategies to generate a unique answer: one is called
   164 disambiguation strategies to generate a unique answer: one is called
   115 first case corresponds to GREEDY matching, which first matches with the
   171 first case corresponds to GREEDY matching, which first matches with the
   116 left-most symbol and only matches the next symbol in case of a mismatch
   172 left-most symbol and only matches the next symbol in case of a mismatch
   117 (this is greedy in the sense of preferring instant gratification to
   173 (this is greedy in the sense of preferring instant gratification to
   118 delayed repletion). The second case is POSIX matching, which prefers the
   174 delayed repletion). The second case is POSIX matching, which prefers the
   119 longest match.
   175 longest match.
       
   176 
       
   177 
       
   178 The derivative has the property (which may almost be
       
   179 regarded as its specification) that, for every string @{term s} and
       
   180 regular expression @{term r} and character @{term c}, one has @{term
       
   181 "cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}.
       
   182 
       
   183 
   120 
   184 
   121 
   185 
   122 \begin{center}
   186 \begin{center}
   123 \begin{tabular}{cc}
   187 \begin{tabular}{cc}
   124   \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
   188   \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
   186 *}
   250 *}
   187 
   251 
   188 section {* Background *}
   252 section {* Background *}
   189 
   253 
   190 text {*
   254 text {*
       
   255   In our Isabelle/HOL formalisation strings are lists of characters with
       
   256   the empty string being represented by the empty list, written $[]$,
       
   257   and list-cons being written as $\_\!\_\!::\!\_\!\_\,$; string
       
   258   concatenation is $\_\!\_ \,@\, \_\!\_\,$. Often we use the usual
       
   259   bracket notation for lists also for strings; for example a string
       
   260   consisting of just a single character $c$ is written $[c]$.   
       
   261   Our egular expressions are defined as usual as the elements of the following inductive
       
   262   datatype:
       
   263 
       
   264   \begin{center}
       
   265   @{text "r :="}
       
   266   @{const "ZERO"} $\mid$
       
   267   @{const "ONE"} $\mid$
       
   268   @{term "CH c"} $\mid$
       
   269   @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
       
   270   @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
       
   271   @{term "STAR r"} 
       
   272   \end{center}
       
   273 
       
   274   \noindent where @{const ZERO} stands for the regular expression that does
       
   275   not match any string, @{const ONE} for the regular expression that matches
       
   276   only the empty string and @{term c} for matching a character literal. The
       
   277   language of a regular expression, written $L$, is defined as usual
       
   278   (see for example \cite{AusafDyckhoffUrban2016}).
       
   279 
       
   280   Central to Brzozowski's regular expression matcher are two functions
       
   281   called $\nullable$ and \emph{derivative}. The latter is written
       
   282   $r\backslash c$ for the derivative of the regular expression $r$
       
   283   w.r.t.~the character $c$. Both functions are defined by recursion over
       
   284   regular expressions.  
       
   285 
       
   286 
       
   287 	\begin{center}
       
   288 		\begin{tabular}{lcl}
       
   289 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   290 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   291 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   292 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   293 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   294 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   295 		\end{tabular}
       
   296 	\end{center}
       
   297 
       
   298   \noindent
       
   299   The derivative function takes a regular expression, say $r$ and a
       
   300   character, say $c$, as input and returns the derivative regular
       
   301   expression.
       
   302 
       
   303 	\begin{center}
       
   304 		\begin{tabular}{lcl}
       
   305 			$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   306 			$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   307 			$d \backslash c$     & $\dn$ & 
       
   308 			$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   309 			$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   310 			$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\
       
   311 			&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   312 			&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   313 			$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   314 		\end{tabular}
       
   315 	\end{center}
       
   316 
       
   317 
       
   318   Sulzmann and Lu presented two lexing algorithms in their paper from 2014
       
   319   \cite{Sulzmann2014}. This first algorithm consists of two phases: first a
       
   320   matching phase (which is Brzozowski's algorithm) and then a value
       
   321   construction phase. The values encode \emph{how} a regular expression
       
   322   matches a string. \emph{Values} are defined as the inductive datatype
       
   323 
       
   324   \begin{center}
       
   325   @{text "v :="}
       
   326   @{const "Void"} $\mid$
       
   327   @{term "val.Char c"} $\mid$
       
   328   @{term "Left v"} $\mid$
       
   329   @{term "Right v"} $\mid$
       
   330   @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ 
       
   331   @{term "Stars vs"} 
       
   332   \end{center}  
       
   333 
       
   334   \noindent where we use @{term vs} to stand for a list of
       
   335   values.
       
   336 
       
   337 
       
   338 
       
   339 
       
   340   \noindent Sulzmann and Lu also define inductively an inhabitation relation
       
   341   that associates values to regular expressions:
       
   342 
       
   343   \begin{center}
       
   344   \begin{tabular}{c}
       
   345   \\[-8mm]
       
   346   @{thm[mode=Axiom] Prf.intros(4)} \qquad
       
   347   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm]
       
   348   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
       
   349   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm]
       
   350   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[4mm]
       
   351   %%@ { t h m[mode=Axiom] Prf.intros(6)[of "r"]} \qquad  
       
   352   @{thm[mode=Rule] Prf.intros(6)[of "r" "vs"]}
       
   353   \end{tabular}
       
   354   \end{center}
       
   355 
       
   356   \noindent Note that no values are associated with the regular expression
       
   357   @{term ZERO}. It is routine to establish how values ``inhabiting'' a regular
       
   358   expression correspond to the language of a regular expression, namely
       
   359 
       
   360   \begin{proposition}
       
   361   @{thm L_flat_Prf}
       
   362   \end{proposition}
       
   363 
   191   Sulzmann-Lu algorithm with inj. State that POSIX rules.
   364   Sulzmann-Lu algorithm with inj. State that POSIX rules.
   192   metion slg is correct.
   365   metion slg is correct.
   193 
   366 
   194 
   367 
   195   \begin{figure}[t]
   368   \begin{figure}[t]
   638 
   811 
   639 section {* Simplification *}
   812 section {* Simplification *}
   640 
   813 
   641 text {*
   814 text {*
   642 
   815 
       
   816      Derivatives as calculated by Brzozowski’s method are usually more
       
   817      complex regular expressions than the initial one; the result is
       
   818      that the derivative-based matching and lexing algorithms are
       
   819      often abysmally slow.
       
   820 
       
   821      However, as Sulzmann and
       
   822      Lu wrote, various optimisations are possible, such as the
       
   823      simplifications of 0 + r,r + 0,1 · r and r · 1 to r. These
       
   824      simplifications can speed up the algorithms considerably.
       
   825 
       
   826 
   643      \begin{lemma}
   827      \begin{lemma}
   644      @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]}
   828      @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]}
   645      \end{lemma}
   829      \end{lemma}
   646 
   830 
   647 
   831