thys/Paper/Paper.thy
changeset 109 2c38f10643ae
parent 108 73f7dc60c285
child 110 267afb7fb700
equal deleted inserted replaced
108:73f7dc60c285 109:2c38f10643ae
     9  "der_syn r c \<equiv> der c r"
     9  "der_syn r c \<equiv> der c r"
    10 
    10 
    11 notation (latex output)
    11 notation (latex output)
    12    If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
    12    If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
    13   Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and
    13   Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and
    14   ZERO ("\<^raw:\textrm{0}>" 78) and 
    14   ZERO ("\<^bold>0" 80) and 
    15   ONE ("\<^raw:\textrm{1}>" 78) and 
    15   ONE ("\<^bold>1" 80) and 
    16   CHAR ("_" [1000] 10) and
    16   CHAR ("_" [1000] 80) and
    17   ALT ("_ + _" [77,77] 78) and
    17   ALT ("_ + _" [77,77] 78) and
    18   SEQ ("_ \<cdot> _" [77,77] 78) and
    18   SEQ ("_ \<cdot> _" [77,77] 78) and
    19   STAR ("_\<^sup>\<star>" [1000] 78) and
    19   STAR ("_\<^sup>\<star>" [1000] 78) and
    20   val.Char ("Char _" [1000] 78) and
    20   val.Char ("Char _" [1000] 78) and
    21   val.Left ("Left _" [1000] 78) and
    21   val.Left ("Left _" [1000] 78) and
    22   val.Right ("Right _" [1000] 78) and
    22   val.Right ("Right _" [1000] 78) and
    23   L ("L'(_')" [10] 78) and
    23   L ("L'(_')" [10] 78) and
    24   der_syn ("_\\_" [1000, 1000] 78) and
    24   der_syn ("_\\_" [79, 1000] 76) and
    25   flat ("|_|" [70] 73) and
    25   flat ("|_|" [70] 73) and
    26   Sequ ("_ @ _" [78,77] 63) and
    26   Sequ ("_ @ _" [78,77] 63) and
    27   injval ("inj _ _ _" [1000,77,1000] 77) and 
    27   injval ("inj _ _ _" [1000,77,1000] 77) and 
    28   projval ("proj _ _ _" [1000,77,1000] 77) and 
    28   projval ("proj _ _ _" [1000,77,1000] 77) and 
    29   length ("len _" [78] 73) 
    29   length ("len _" [78] 73) 
    42 the string matches the empty string $\mts$, then @{term r} matches @{term s}
    42 the string matches the empty string $\mts$, then @{term r} matches @{term s}
    43 (and {\em vice versa}). The derivative has the property (which may be
    43 (and {\em vice versa}). The derivative has the property (which may be
    44 regarded as its specification) that, for every string @{term s} and regular
    44 regarded as its specification) that, for every string @{term s} and regular
    45 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
    45 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
    46 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
    46 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
    47 derivatives is that they are neatly expressible in any functional language, and
    47 derivatives is that they are neatly expressible in any functional language,
    48 very easy to be defined and reasoned about in a theorem prover---the
    48 and easily definable and reasoned about in theorem provers---the definitions
    49 definitions consist just of inductive datatypes and recursive functions. A
    49 just consist of inductive datatypes and simple recursive functions. A
    50 completely formalised proof of this matcher has for example been given in
    50 completely formalised correctness proof of this matcher in for example HOL4
    51 \cite{Owens2008}.
    51 has been given in~\cite{Owens2008}.
    52 
    52 
    53 One limitation of Brzozowski's matcher is that it only generates a YES/NO
    53 One limitation of Brzozowski's matcher is that it only generates a YES/NO
    54 answer for a string being matched by a regular expression. Sulzmann and Lu
    54 answer for whether a string is being matched by a regular expression.
    55 \cite{Sulzmann2014} extended this matcher to allow generation not just of a
    55 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
    56 YES/NO answer but of an actual matching, called a [lexical] {\em value}.
    56 generation not just of a YES/NO answer but of an actual matching, called a
    57 They give a still very simple algorithm to calculate a value that appears to
    57 [lexical] {\em value}. They give a simple algorithm to calculate a value
    58 be the value associated with POSIX lexing (posix) %%\cite{posix}. The
    58 that appears to be the value associated with POSIX lexing
    59 challenge then is to specify that value, in an algorithm-independent
    59 \cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify that
    60 fashion, and to show that Sulzamman and Lu's derivative-based algorithm does
    60 value, in an algorithm-independent fashion, and to show that Sulzamann and
    61 indeed calculate a value that is correct according to the specification.
    61 Lu's derivative-based algorithm does indeed calculate a value that is
    62 
    62 correct according to the specification.
    63 Inspired by work of Frisch and Cardelli \cite{Frisch2004} on a GREEDY
    63 
    64 regular expression matching algorithm, the answer given in
    64 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
    65 \cite{Sulzmann2014} is to define a relation (called an ``Order Relation'')
    65 relation (called an ``Order Relation'') on the set of values of @{term r},
    66 on the set of values of @{term r}, and to show that (once a string to be
    66 and to show that (once a string to be matched is chosen) there is a maximum
    67 matched is chosen) there is a maximum element and that it is computed by the
    67 element and that it is computed by their derivative-based algorithm. This
    68 derivative-based algorithm. Beginning with our observations that, without
    68 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a
    69 evidence that it is transitive, it cannot be called an ``order relation'',
    69 GREEDY regular expression matching algorithm. Beginning with our
    70 and that the relation is called a ``total order'' despite being evidently
    70 observations that, without evidence that it is transitive, it cannot be
    71 not total\footnote{\textcolor{green}{Why is it not total?}}, we identify
    71 called an ``order relation'', and that the relation is called a ``total
    72 problems with this approach (of which some of the proofs are not published
    72 order'' despite being evidently not total\footnote{\textcolor{green}{We
    73 in \cite{Sulzmann2014}); perhaps more importantly, we give a simple
    73 should give an argument as footnote}}, we identify problems with this
    74 inductive (and algorithm-independent) definition of what we call being a
    74 approach (of which some of the proofs are not published in
    75 {\em POSIX value} for a regular expression @{term r} and a string @{term s};
    75 \cite{Sulzmann2014}); perhaps more importantly, we give a simple inductive
    76 we show that the algorithm computes such a value and that such a value is
    76 (and algorithm-independent) definition of what we call being a {\em POSIX
    77 unique. Proofs are both done by hand and checked in {\em Isabelle}
    77 value} for a regular expression @{term r} and a string @{term s}; we show
    78 %\cite{isabelle}. Our experience of doing our proofs has been that this
    78 that the algorithm computes such a value and that such a value is unique.
    79 mechanical checking was absolutely essential: this subject area has hidden
    79 Proofs are both done by hand and checked in {\em Isabelle/HOL}. The
    80 snares. This was also noted by Kuklewitz \cite{Kuklewicz} who found that
    80 experience of doing our proofs has been that this mechanical checking was
    81 nearly all POSIX matching implementations are ``buggy'' \cite[Page
    81 absolutely essential: this subject area has hidden snares. This was also
    82 203]{Sulzmann2014}.
    82 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
    83 
    83 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
    84 \textcolor{green}{Say something about POSIX lexing}
    84 
       
    85 If a regular expression matches a string, then in general there are more
       
    86 than one possibility of how the string is matched. There are two commonly
       
    87 used disambiguation strategies to generate a unique answer: one is called
       
    88 GREEDY matching \cite{Frisch2004} and the other is POSIX
       
    89 matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string
       
    90 @{term xy} and the regular expression @{term "STAR (ALT (ALT x y) xy)"}.
       
    91 Either the string can be matched in two `iterations' by the single
       
    92 letter-regular expressions @{term x} and @{term y}, or directly in one
       
    93 iteration by @{term xy}. The first case corresponds to GREEDY matching,
       
    94 which first matches with the left-most symbol and only matches the next
       
    95 symbol in case of a mismatch. The second case is POSIX matching, which
       
    96 prefers the longest match. 
       
    97 
       
    98 In the context of lexing, where an input string is separated into a sequence
       
    99 of tokens, POSIX is the more natural disambiguation strategy for what
       
   100 programmers consider basic syntactic building blocks in their programs.
       
   101 There are two underlying rules behind tokenising using POSIX matching:
       
   102 
       
   103 \begin{itemize} 
       
   104 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}
       
   105 
       
   106 The longest initial substring matched by any regular expression is taken as
       
   107 next token.\smallskip
       
   108 
       
   109 \item[$\bullet$] \underline{Rule Priority:}
       
   110 
       
   111 For a particular longest initial substring, the first regular expression
       
   112 that can match determines the token.
       
   113 \end{itemize}
       
   114  
       
   115 \noindent Consider for example a regular expression @{text
       
   116 "r\<^bsub>key\<^esub>"} that can recognise keywords, like @{text "if"},
       
   117 @{text "then"} and so on; and another regular expression @{text
       
   118 "r\<^bsub>id\<^esub>"} that can recognise identifiers (such as single
       
   119 characters followed by characters or numbers). Then we can form the regular
       
   120 expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and
       
   121 use POSIX matching to, for example, tokenise the strings @{text "iffoo"} and
       
   122 @{text "if"}. In the first case we obtain by the longest match rule a
       
   123 single identifier token, not a keyword and identifier. In the second case we
       
   124 obtain by rule priority a keyword token, not an identifier token.
       
   125 \bigskip
       
   126 
       
   127 \noindent\textcolor{green}{Not Done Yet}
       
   128 
       
   129 
       
   130 \medskip\noindent
       
   131 {\bf Contributions:}
       
   132 
       
   133 Derivatives as calculated by Brzozowski's method are usually more complex
       
   134 regular expressions than the initial one; various optimisations are
       
   135 possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r
       
   136 ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the
       
   137 advantages of having a simple specification and correctness proof is that
       
   138 the latter can be refined to allow for such optimisations and simple
       
   139 correctness proof.
    85 
   140 
    86 An extended version of \cite{Sulzmann2014} is available at the website
   141 An extended version of \cite{Sulzmann2014} is available at the website
    87 of its first author; this includes some ``proofs'', claimed in
   142 of its first author; this includes some ``proofs'', claimed in
    88 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
   143 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
    89 final form, we make no comment thereon, preferring to give general
   144 final form, we make no comment thereon, preferring to give general
    90 reasons for our belief that the approach of \cite{Sulzmann2014} is
   145 reasons for our belief that the approach of \cite{Sulzmann2014} is
    91 problematic rather than to discuss details of unpublished work.
   146 problematic rather than to discuss details of unpublished work.
    92 
   147 
    93 Derivatives as calculated by Brzozowski's method are usually more
       
    94 complex regular expressions than the initial one; various optimisations
       
    95 are possible, such as the simplifications of @{term "ALT ZERO r"},
       
    96 @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
       
    97 @{term r}. One of the advantages of having a simple specification and
       
    98 correctness proof is that the latter can be refined to allow for such
       
    99 optimisations and simple correctness proof.
       
   100 
       
   101 
       
   102 Sulzmann and Lu \cite{Sulzmann2014}
       
   103 \cite{Brzozowski1964}
       
   104 
       
   105 there are two commonly used
       
   106 disambiguation strategies to create a unique matching tree:
       
   107 one is called \emph{greedy} matching \cite{Frisch2004} and the
       
   108 other is \emph{POSIX} matching~\cite{Kuklewicz,Sulzmann2014}.
       
   109 For the latter there are two rough rules:  
       
   110 
       
   111 \begin{itemize}
       
   112 \item The Longest Match Rule (or ``maximal munch rule''):\smallskip\\ The
       
   113       longest initial substring matched by any regular
       
   114       expression is taken as next token.
       
   115 
       
   116 \item Rule Priority:\smallskip\\ For a particular longest initial
       
   117       substring, the first regular expression that can match
       
   118       determines the token.
       
   119 \end{itemize}
       
   120 
       
   121 \noindent In the context of lexing, POSIX is the more
       
   122 interesting disambiguation strategy as it produces longest
       
   123 matches, which is necessary for tokenising programs. For
       
   124 example the string \textit{iffoo} should not match the keyword
       
   125 \textit{if} and the rest, but as one string \textit{iffoo},
       
   126 which might be a variable name in a program. As another
       
   127 example consider the string $xy$ and the regular expression
       
   128 \mbox{$(x + y + xy)^*$}. Either the input string can be
       
   129 matched in two `iterations' by the single letter-regular
       
   130 expressions $x$ and $y$, or directly in one iteration by $xy$.
       
   131 The first case corresponds to greedy matching, which first
       
   132 matches with the left-most symbol and only matches the next
       
   133 symbol in case of a mismatch. The second case is POSIX
       
   134 matching, which prefers the longest match. In case more than
       
   135 one (longest) matches exist, only then it prefers the
       
   136 left-most match. While POSIX matching seems natural, it turns
       
   137 out to be much more subtle than greedy matching in terms of
       
   138 implementations and in terms of proving properties about it.
       
   139 If POSIX matching is implemented using automata, then one has
       
   140 to follow transitions (according to the input string) until
       
   141 one finds an accepting state, record this state and look for
       
   142 further transition which might lead to another accepting state
       
   143 that represents a longer input initial substring to be
       
   144 matched. Only if none can be found, the last accepting state
       
   145 is returned.
       
   146 
       
   147 
       
   148 Sulzmann and Lu's paper \cite{Sulzmann2014} targets POSIX
       
   149 regular expression matching. They write that it is known to be
       
   150 to be a non-trivial problem and nearly all POSIX matching
       
   151 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
       
   152 For this they cite a study by Kuklewicz \cite{Kuklewicz}. My
       
   153 current work is about formalising the proofs in the paper by
       
   154 Sulzmann and Lu. Specifically, they propose in this paper a
       
   155 POSIX matching algorithm and give some details of a
       
   156 correctness proof for this algorithm inside the paper and some
       
   157 more details in an appendix. This correctness proof is
       
   158 unformalised, meaning it is just a ``pencil-and-paper'' proof,
       
   159 not done in a theorem prover. Though, the paper and presumably
       
   160 the proof have been peer-reviewed. Unfortunately their proof
       
   161 does not give enough details such that it can be
       
   162 straightforwardly implemented in a theorem prover, say
       
   163 Isabelle. In fact, the purported proof they outline does not
       
   164 work in central places. We discovered this when filling in
       
   165 many gaps and attempting to formalise the proof in Isabelle. 
       
   166 
       
   167 
       
   168 \medskip\noindent
       
   169 {\bf Contributions:}
       
   170 
   148 
   171 *}
   149 *}
   172 
   150 
   173 section {* Preliminaries *}
   151 section {* Preliminaries *}
   174 
   152