thys/Paper/Paper.thy
changeset 110 267afb7fb700
parent 109 2c38f10643ae
child 111 289728193164
equal deleted inserted replaced
109:2c38f10643ae 110:267afb7fb700
    37 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
    37 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
    38 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a
    38 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a
    39 character~@{text c}, and showed that it gave a simple solution to the
    39 character~@{text c}, and showed that it gave a simple solution to the
    40 problem of matching a string @{term s} with a regular expression @{term r}:
    40 problem of matching a string @{term s} with a regular expression @{term r}:
    41 if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
    41 if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
    42 the string matches the empty string $\mts$, then @{term r} matches @{term s}
    42 the string matches the empty string, then @{term r} matches @{term s}
    43 (and {\em vice versa}). The derivative has the property (which may be
    43 (and {\em vice versa}). The derivative has the property (which may be
    44 regarded as its specification) that, for every string @{term s} and regular
    44 regarded as its specification) that, for every string @{term s} and regular
    45 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
    45 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
    46 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
    46 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
    47 derivatives is that they are neatly expressible in any functional language,
    47 derivatives is that they are neatly expressible in any functional language,
    74 approach (of which some of the proofs are not published in
    74 approach (of which some of the proofs are not published in
    75 \cite{Sulzmann2014}); perhaps more importantly, we give a simple inductive
    75 \cite{Sulzmann2014}); perhaps more importantly, we give a simple inductive
    76 (and algorithm-independent) definition of what we call being a {\em POSIX
    76 (and algorithm-independent) definition of what we call being a {\em POSIX
    77 value} for a regular expression @{term r} and a string @{term s}; we show
    77 value} for a regular expression @{term r} and a string @{term s}; we show
    78 that the algorithm computes such a value and that such a value is unique.
    78 that the algorithm computes such a value and that such a value is unique.
    79 Proofs are both done by hand and checked in {\em Isabelle/HOL}. The
    79 Proofs are both done by hand and checked in Isabelle/HOL. The
    80 experience of doing our proofs has been that this mechanical checking was
    80 experience of doing our proofs has been that this mechanical checking was
    81 absolutely essential: this subject area has hidden snares. This was also
    81 absolutely essential: this subject area has hidden snares. This was also
    82 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
    82 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
    83 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
    83 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
    84 
    84 
    85 If a regular expression matches a string, then in general there are more
    85 If a regular expression matches a string, then in general there are more
    86 than one possibility of how the string is matched. There are two commonly
    86 than one way of how the string is matched. There are two commonly used
    87 used disambiguation strategies to generate a unique answer: one is called
    87 disambiguation strategies to generate a unique answer: one is called GREEDY
    88 GREEDY matching \cite{Frisch2004} and the other is POSIX
    88 matching \cite{Frisch2004} and the other is POSIX
    89 matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string
    89 matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string
    90 @{term xy} and the regular expression @{term "STAR (ALT (ALT x y) xy)"}.
    90 @{term xy} and the regular expression \mbox{@{term "STAR (ALT (ALT x y) xy)"}}.
    91 Either the string can be matched in two `iterations' by the single
    91 Either the string can be matched in two `iterations' by the single
    92 letter-regular expressions @{term x} and @{term y}, or directly in one
    92 letter-regular expressions @{term x} and @{term y}, or directly in one
    93 iteration by @{term xy}. The first case corresponds to GREEDY matching,
    93 iteration by @{term xy}. The first case corresponds to GREEDY matching,
    94 which first matches with the left-most symbol and only matches the next
    94 which first matches with the left-most symbol and only matches the next
    95 symbol in case of a mismatch. The second case is POSIX matching, which
    95 symbol in case of a mismatch (this is greedy in the sense of preferring
    96 prefers the longest match. 
    96 instant gratification to delayed repletion). The second case is POSIX
    97 
    97 matching, which prefers the longest match.
    98 In the context of lexing, where an input string is separated into a sequence
    98 
    99 of tokens, POSIX is the more natural disambiguation strategy for what
    99 In the context of lexing, where an input string needs to be separated into a
   100 programmers consider basic syntactic building blocks in their programs.
   100 sequence of tokens, POSIX is the more natural disambiguation strategy for
   101 There are two underlying rules behind tokenising using POSIX matching:
   101 what programmers consider basic syntactic building blocks in their programs.
       
   102 These building blocks are often specified by some regular expressions, say @{text
       
   103 "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising
       
   104 keywords and identifiers, respectively. There are two underlying rules
       
   105 behind tokenising a string in a POSIX fashion:
   102 
   106 
   103 \begin{itemize} 
   107 \begin{itemize} 
   104 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}
   108 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}
   105 
   109 
   106 The longest initial substring matched by any regular expression is taken as
   110 The longest initial substring matched by any regular expression is taken as
   110 
   114 
   111 For a particular longest initial substring, the first regular expression
   115 For a particular longest initial substring, the first regular expression
   112 that can match determines the token.
   116 that can match determines the token.
   113 \end{itemize}
   117 \end{itemize}
   114  
   118  
   115 \noindent Consider for example a regular expression @{text
   119 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising
   116 "r\<^bsub>key\<^esub>"} that can recognise keywords, like @{text "if"},
   120 keywords such as @{text "if"}, @{text "then"} and so on; and @{text
   117 @{text "then"} and so on; and another regular expression @{text
   121 "r\<^bsub>id\<^esub>"} recognising identifiers (a single character followed
   118 "r\<^bsub>id\<^esub>"} that can recognise identifiers (such as single
   122 by characters or numbers). Then we can form the regular expression @{text
   119 characters followed by characters or numbers). Then we can form the regular
   123 "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
   120 expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and
   124 matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the
   121 use POSIX matching to, for example, tokenise the strings @{text "iffoo"} and
   125 first case we obtain by the longest match rule a single identifier token,
   122 @{text "if"}. In the first case we obtain by the longest match rule a
   126 not a keyword followed by identifier. In the second case we obtain by rule
   123 single identifier token, not a keyword and identifier. In the second case we
   127 priority a keyword token, not an identifier token---even if @{text
   124 obtain by rule priority a keyword token, not an identifier token.
   128 "r\<^bsub>id\<^esub>"} matches also.\bigskip
   125 \bigskip
       
   126 
   129 
   127 \noindent\textcolor{green}{Not Done Yet}
   130 \noindent\textcolor{green}{Not Done Yet}
   128 
   131 
   129 
   132 
   130 \medskip\noindent
   133 \medskip\noindent
   152 
   155 
   153 text {* \noindent Strings in Isabelle/HOL are lists of characters with
   156 text {* \noindent Strings in Isabelle/HOL are lists of characters with
   154   the empty string being represented by the empty list, written @{term
   157   the empty string being represented by the empty list, written @{term
   155   "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}.
   158   "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}.
   156   Often we use the usual bracket notation for strings; for example a
   159   Often we use the usual bracket notation for strings; for example a
   157   string consiting of a single character is written @{term "[c]"}.  By
   160   string consisting of a single character is written @{term "[c]"}.  By
   158   using the type @{type char} for characters we have a supply of
   161   using the type @{type char} for characters we have a supply of
   159   finitely many characters roughly corresponding to the ASCII
   162   finitely many characters roughly corresponding to the ASCII
   160   character set.  Regular expression are as usual and defined as the
   163   character set.  Regular expressions are defined as usual as the
   161   following inductive datatype:
   164   following inductive datatype:
   162 
   165 
   163   \begin{center}
   166   \begin{center}
   164   @{text "r :="}
   167   @{text "r :="}
   165   @{const "ZERO"} $\mid$
   168   @{const "ZERO"} $\mid$
   169   @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
   172   @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
   170   @{term "STAR r"} 
   173   @{term "STAR r"} 
   171   \end{center}
   174   \end{center}
   172 
   175 
   173   \noindent where @{const ZERO} stands for the regular expression that
   176   \noindent where @{const ZERO} stands for the regular expression that
   174   does not macth any string and @{const ONE} for the regular
   177   does not match any string and @{const ONE} for the regular
   175   expression that matches only the empty string. The language of a regular expression
   178   expression that matches only the empty string. The language of a regular expression
   176   is again defined as usual by the following clauses
   179   is again defined routinely by the recursive function @{term L} with the
       
   180   clauses:
   177 
   181 
   178   \begin{center}
   182   \begin{center}
   179   \begin{tabular}{rcl}
   183   \begin{tabular}{rcl}
   180   @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
   184   @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
   181   @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
   185   @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
   184   @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
   188   @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
   185   @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
   189   @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
   186   \end{tabular}
   190   \end{tabular}
   187   \end{center}
   191   \end{center}
   188   
   192   
   189   \noindent We use the star-notation for regular expressions and sets of strings.
   193   \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the
   190   The Kleene-star on sets is defined inductively.
   194   concatenation of two languages. We use the star-notation for regular
       
   195   expressions and sets of strings (in the last clause). The star on sets is
       
   196   defined inductively as usual by two clauses for the empty string being in
       
   197   the star of a language and is @{term "s\<^sub>1"} is in a language and
       
   198   @{term "s\<^sub>2"} and in the star of this language then also @{term
       
   199   "s\<^sub>1 @ s\<^sub>2"} is in the star of this language.
   191   
   200   
   192   \emph{Semantic derivatives} of sets of strings are defined as
   201   \emph{Semantic derivatives} of sets of strings are defined as
   193 
   202 
   194   \begin{center}
   203   \begin{center}
   195   \begin{tabular}{lcl}
   204   \begin{tabular}{lcl}
   196   @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
   205   @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
   197   @{thm (lhs) Ders_def} & $\dn$ & @{thm (rhs) Ders_def}\\
       
   198   \end{tabular}
   206   \end{tabular}
   199   \end{center}
   207   \end{center}
   200   
   208   
   201   \noindent where the second definitions lifts the notion of semantic
   209   \noindent where the second definitions lifts the notion of semantic
   202   derivatives from characters to strings.  
   210   derivatives from characters to strings.  
   223   @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
   231   @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
   224   @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
   232   @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
   225   @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
   233   @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
   226   @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
   234   @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
   227   @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
   235   @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
   228   @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\medskip\\
   236   @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}
   229 
       
   230   @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\
       
   231   @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\
       
   232   \end{tabular}
   237   \end{tabular}
   233   \end{center}
   238   \end{center}
   234 
   239 
   235   It is a relatively easy exercise to prove that
   240   It is a relatively easy exercise to prove that
   236 
   241