thys/Paper/Paper.thy
changeset 107 6adda4a667b1
parent 105 80218dddbb15
child 108 73f7dc60c285
equal deleted inserted replaced
106:489dfa0d7ec9 107:6adda4a667b1
     5 
     5 
     6 declare [[show_question_marks = false]]
     6 declare [[show_question_marks = false]]
     7 
     7 
     8 notation (latex output)
     8 notation (latex output)
     9    If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
     9    If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
    10   Cons ("_::_" [78,77] 73) and
    10   Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and
       
    11   ZERO ("\<^raw:\textrm{0}>" 78) and 
       
    12   ONE ("\<^raw:\textrm{1}>" 78) and 
       
    13   CHAR ("_" [1000] 10) and
       
    14   ALT ("_ + _" [1000,1000] 78) and
       
    15   SEQ ("_ \<cdot> _" [1000,1000] 78) and
       
    16   STAR ("_\<^sup>\<star>" [1000] 78) and
    11   val.Char ("Char _" [1000] 78) and
    17   val.Char ("Char _" [1000] 78) and
    12   val.Left ("Left _" [1000] 78) and
    18   val.Left ("Left _" [1000] 78) and
    13   val.Right ("Right _" [1000] 78) and
    19   val.Right ("Right _" [1000] 78) and
    14   L ("L _" [1000] 0) and
    20   L ("L _" [1000] 0) and
    15   flat ("|_|" [70] 73) and
    21   flat ("|_|" [70] 73) and
    21 (*>*)
    27 (*>*)
    22 
    28 
    23 section {* Introduction *}
    29 section {* Introduction *}
    24 
    30 
    25 text {*
    31 text {*
    26 
    32   
    27   \noindent
    33 
       
    34 Sulzmann and Lu \cite{Sulzmann2014}
       
    35 
       
    36 there are two commonly used
       
    37 disambiguation strategies to create a unique matching tree:
       
    38 one is called \emph{greedy} matching \cite{Frisch2004} and the
       
    39 other is \emph{POSIX} matching~\cite{Kuklewicz,Sulzmann2014}.
       
    40 For the latter there are two rough rules:  
       
    41 
       
    42 \begin{itemize}
       
    43 \item The Longest Match Rule (or ``maximal munch rule''):\smallskip\\ The
       
    44       longest initial substring matched by any regular
       
    45       expression is taken as next token.
       
    46 
       
    47 \item Rule Priority:\smallskip\\ For a particular longest initial
       
    48       substring, the first regular expression that can match
       
    49       determines the token.
       
    50 \end{itemize}
       
    51 
       
    52 \noindent In the context of lexing, POSIX is the more
       
    53 interesting disambiguation strategy as it produces longest
       
    54 matches, which is necessary for tokenising programs. For
       
    55 example the string \textit{iffoo} should not match the keyword
       
    56 \textit{if} and the rest, but as one string \textit{iffoo},
       
    57 which might be a variable name in a program. As another
       
    58 example consider the string $xy$ and the regular expression
       
    59 \mbox{$(x + y + xy)^*$}. Either the input string can be
       
    60 matched in two `iterations' by the single letter-regular
       
    61 expressions $x$ and $y$, or directly in one iteration by $xy$.
       
    62 The first case corresponds to greedy matching, which first
       
    63 matches with the left-most symbol and only matches the next
       
    64 symbol in case of a mismatch. The second case is POSIX
       
    65 matching, which prefers the longest match. In case more than
       
    66 one (longest) matches exist, only then it prefers the
       
    67 left-most match. While POSIX matching seems natural, it turns
       
    68 out to be much more subtle than greedy matching in terms of
       
    69 implementations and in terms of proving properties about it.
       
    70 If POSIX matching is implemented using automata, then one has
       
    71 to follow transitions (according to the input string) until
       
    72 one finds an accepting state, record this state and look for
       
    73 further transition which might lead to another accepting state
       
    74 that represents a longer input initial substring to be
       
    75 matched. Only if none can be found, the last accepting state
       
    76 is returned.
       
    77 
       
    78 
       
    79 Sulzmann and Lu's paper \cite{Sulzmann2014} targets POSIX
       
    80 regular expression matching. They write that it is known to be
       
    81 to be a non-trivial problem and nearly all POSIX matching
       
    82 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
       
    83 For this they cite a study by Kuklewicz \cite{Kuklewicz}. My
       
    84 current work is about formalising the proofs in the paper by
       
    85 Sulzmann and Lu. Specifically, they propose in this paper a
       
    86 POSIX matching algorithm and give some details of a
       
    87 correctness proof for this algorithm inside the paper and some
       
    88 more details in an appendix. This correctness proof is
       
    89 unformalised, meaning it is just a ``pencil-and-paper'' proof,
       
    90 not done in a theorem prover. Though, the paper and presumably
       
    91 the proof have been peer-reviewed. Unfortunately their proof
       
    92 does not give enough details such that it can be
       
    93 straightforwardly implemented in a theorem prover, say
       
    94 Isabelle. In fact, the purported proof they outline does not
       
    95 work in central places. We discovered this when filling in
       
    96 many gaps and attempting to formalise the proof in Isabelle. 
       
    97 
       
    98 
       
    99 
       
   100 {\bf Contributions:}
       
   101 
       
   102 *}
       
   103 
       
   104 section {* Preliminaries *}
       
   105 
       
   106 text {* \noindent Strings in Isabelle/HOL are lists of characters with
       
   107   the empty string being represented by the empty list, written @{term
       
   108   "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. By
       
   109   using the type char we have a supply of finitely many characters
       
   110   roughly corresponding to the ASCII character set.
    28   Regular exprtessions
   111   Regular exprtessions
    29 
   112 
    30   \begin{center}
   113   \begin{center}
    31   @{text "r :="}
   114   @{text "r :="}
    32   @{const "NULL"} $\mid$
   115   @{const "ZERO"} $\mid$
    33   @{const "EMPTY"} $\mid$
   116   @{const "ONE"} $\mid$
    34   @{term "CHAR c"} $\mid$
   117   @{term "CHAR c"} $\mid$
    35   @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
   118   @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
    36   @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
   119   @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
    37   @{term "STAR r"} 
   120   @{term "STAR r"} 
    38   \end{center}
   121   \end{center}
       
   122 
       
   123 *}
       
   124 
       
   125 section {* POSIX Regular Expression Matching *}
       
   126 
       
   127 section {* The Argument by Sulzmmann and Lu *}
       
   128 
       
   129 section {* Conclusion *}
       
   130 
       
   131 text {*
       
   132 
       
   133   Nipkow lexer from 2000
       
   134 
       
   135 *}
       
   136 
       
   137 
       
   138 text {*
       
   139 
       
   140 
       
   141 
    39 
   142 
    40   \noindent
   143   \noindent
    41   Values
   144   Values
    42 
   145 
    43   \begin{center}
   146   \begin{center}
   214   Values and non-problematic values
   317   Values and non-problematic values
   215 
   318 
   216   \begin{center}
   319   \begin{center}
   217   \begin{tabular}{c}
   320   \begin{tabular}{c}
   218   @{thm Values_def}\medskip\\
   321   @{thm Values_def}\medskip\\
   219   @{thm NValues_def}
       
   220   \end{tabular}
   322   \end{tabular}
   221   \end{center}
   323   \end{center}
   222 
   324 
   223 
   325 
   224   \noindent
   326   \noindent
   236 
   338 
   237   @{thm[mode=IfThen] mkeps_nullable}
   339   @{thm[mode=IfThen] mkeps_nullable}
   238 
   340 
   239   @{thm[mode=IfThen] mkeps_flat}
   341   @{thm[mode=IfThen] mkeps_flat}
   240 
   342 
   241   @{thm[mode=IfThen] v3}
   343   @{thm[mode=IfThen] Prf_injval}
   242 
   344 
   243   @{thm[mode=IfThen] v4}
   345   @{thm[mode=IfThen] Prf_injval_flat}
   244   
   346   
   245   @{thm[mode=IfThen] PMatch_mkeps}
   347   @{thm[mode=IfThen] PMatch_mkeps}
   246   
   348   
   247   @{thm[mode=IfThen] PMatch1(2)}
   349   @{thm[mode=IfThen] PMatch1(2)}
   248 
   350