thys/Paper/Paper.thy
changeset 107 6adda4a667b1
parent 105 80218dddbb15
child 108 73f7dc60c285
--- a/thys/Paper/Paper.thy	Thu Feb 25 20:13:41 2016 +0000
+++ b/thys/Paper/Paper.thy	Sun Feb 28 14:01:12 2016 +0000
@@ -7,7 +7,13 @@
 
 notation (latex output)
    If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
-  Cons ("_::_" [78,77] 73) and
+  Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and
+  ZERO ("\<^raw:\textrm{0}>" 78) and 
+  ONE ("\<^raw:\textrm{1}>" 78) and 
+  CHAR ("_" [1000] 10) and
+  ALT ("_ + _" [1000,1000] 78) and
+  SEQ ("_ \<cdot> _" [1000,1000] 78) and
+  STAR ("_\<^sup>\<star>" [1000] 78) and
   val.Char ("Char _" [1000] 78) and
   val.Left ("Left _" [1000] 78) and
   val.Right ("Right _" [1000] 78) and
@@ -23,20 +29,117 @@
 section {* Introduction *}
 
 text {*
+  
 
-  \noindent
+Sulzmann and Lu \cite{Sulzmann2014}
+
+there are two commonly used
+disambiguation strategies to create a unique matching tree:
+one is called \emph{greedy} matching \cite{Frisch2004} and the
+other is \emph{POSIX} matching~\cite{Kuklewicz,Sulzmann2014}.
+For the latter there are two rough rules:  
+
+\begin{itemize}
+\item The Longest Match Rule (or ``maximal munch rule''):\smallskip\\ The
+      longest initial substring matched by any regular
+      expression is taken as next token.
+
+\item Rule Priority:\smallskip\\ For a particular longest initial
+      substring, the first regular expression that can match
+      determines the token.
+\end{itemize}
+
+\noindent In the context of lexing, POSIX is the more
+interesting disambiguation strategy as it produces longest
+matches, which is necessary for tokenising programs. For
+example the string \textit{iffoo} should not match the keyword
+\textit{if} and the rest, but as one string \textit{iffoo},
+which might be a variable name in a program. As another
+example consider the string $xy$ and the regular expression
+\mbox{$(x + y + xy)^*$}. Either the input string can be
+matched in two `iterations' by the single letter-regular
+expressions $x$ and $y$, or directly in one iteration by $xy$.
+The first case corresponds to greedy matching, which first
+matches with the left-most symbol and only matches the next
+symbol in case of a mismatch. The second case is POSIX
+matching, which prefers the longest match. In case more than
+one (longest) matches exist, only then it prefers the
+left-most match. While POSIX matching seems natural, it turns
+out to be much more subtle than greedy matching in terms of
+implementations and in terms of proving properties about it.
+If POSIX matching is implemented using automata, then one has
+to follow transitions (according to the input string) until
+one finds an accepting state, record this state and look for
+further transition which might lead to another accepting state
+that represents a longer input initial substring to be
+matched. Only if none can be found, the last accepting state
+is returned.
+
+
+Sulzmann and Lu's paper \cite{Sulzmann2014} targets POSIX
+regular expression matching. They write that it is known to be
+to be a non-trivial problem and nearly all POSIX matching
+implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
+For this they cite a study by Kuklewicz \cite{Kuklewicz}. My
+current work is about formalising the proofs in the paper by
+Sulzmann and Lu. Specifically, they propose in this paper a
+POSIX matching algorithm and give some details of a
+correctness proof for this algorithm inside the paper and some
+more details in an appendix. This correctness proof is
+unformalised, meaning it is just a ``pencil-and-paper'' proof,
+not done in a theorem prover. Though, the paper and presumably
+the proof have been peer-reviewed. Unfortunately their proof
+does not give enough details such that it can be
+straightforwardly implemented in a theorem prover, say
+Isabelle. In fact, the purported proof they outline does not
+work in central places. We discovered this when filling in
+many gaps and attempting to formalise the proof in Isabelle. 
+
+
+
+{\bf Contributions:}
+
+*}
+
+section {* Preliminaries *}
+
+text {* \noindent Strings in Isabelle/HOL are lists of characters with
+  the empty string being represented by the empty list, written @{term
+  "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. By
+  using the type char we have a supply of finitely many characters
+  roughly corresponding to the ASCII character set.
   Regular exprtessions
 
   \begin{center}
   @{text "r :="}
-  @{const "NULL"} $\mid$
-  @{const "EMPTY"} $\mid$
+  @{const "ZERO"} $\mid$
+  @{const "ONE"} $\mid$
   @{term "CHAR c"} $\mid$
   @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
   @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
   @{term "STAR r"} 
   \end{center}
 
+*}
+
+section {* POSIX Regular Expression Matching *}
+
+section {* The Argument by Sulzmmann and Lu *}
+
+section {* Conclusion *}
+
+text {*
+
+  Nipkow lexer from 2000
+
+*}
+
+
+text {*
+
+
+
+
   \noindent
   Values
 
@@ -216,7 +319,6 @@
   \begin{center}
   \begin{tabular}{c}
   @{thm Values_def}\medskip\\
-  @{thm NValues_def}
   \end{tabular}
   \end{center}
 
@@ -238,9 +340,9 @@
 
   @{thm[mode=IfThen] mkeps_flat}
 
-  @{thm[mode=IfThen] v3}
+  @{thm[mode=IfThen] Prf_injval}
 
-  @{thm[mode=IfThen] v4}
+  @{thm[mode=IfThen] Prf_injval_flat}
   
   @{thm[mode=IfThen] PMatch_mkeps}