--- a/thys/Paper/Paper.thy Thu Feb 25 20:13:41 2016 +0000
+++ b/thys/Paper/Paper.thy Sun Feb 28 14:01:12 2016 +0000
@@ -7,7 +7,13 @@
notation (latex output)
If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
- Cons ("_::_" [78,77] 73) and
+ Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and
+ ZERO ("\<^raw:\textrm{0}>" 78) and
+ ONE ("\<^raw:\textrm{1}>" 78) and
+ CHAR ("_" [1000] 10) and
+ ALT ("_ + _" [1000,1000] 78) and
+ SEQ ("_ \<cdot> _" [1000,1000] 78) and
+ STAR ("_\<^sup>\<star>" [1000] 78) and
val.Char ("Char _" [1000] 78) and
val.Left ("Left _" [1000] 78) and
val.Right ("Right _" [1000] 78) and
@@ -23,20 +29,117 @@
section {* Introduction *}
text {*
+
- \noindent
+Sulzmann and Lu \cite{Sulzmann2014}
+
+there are two commonly used
+disambiguation strategies to create a unique matching tree:
+one is called \emph{greedy} matching \cite{Frisch2004} and the
+other is \emph{POSIX} matching~\cite{Kuklewicz,Sulzmann2014}.
+For the latter there are two rough rules:
+
+\begin{itemize}
+\item The Longest Match Rule (or ``maximal munch rule''):\smallskip\\ The
+ longest initial substring matched by any regular
+ expression is taken as next token.
+
+\item Rule Priority:\smallskip\\ For a particular longest initial
+ substring, the first regular expression that can match
+ determines the token.
+\end{itemize}
+
+\noindent In the context of lexing, POSIX is the more
+interesting disambiguation strategy as it produces longest
+matches, which is necessary for tokenising programs. For
+example the string \textit{iffoo} should not match the keyword
+\textit{if} and the rest, but as one string \textit{iffoo},
+which might be a variable name in a program. As another
+example consider the string $xy$ and the regular expression
+\mbox{$(x + y + xy)^*$}. Either the input string can be
+matched in two `iterations' by the single letter-regular
+expressions $x$ and $y$, or directly in one iteration by $xy$.
+The first case corresponds to greedy matching, which first
+matches with the left-most symbol and only matches the next
+symbol in case of a mismatch. The second case is POSIX
+matching, which prefers the longest match. In case more than
+one (longest) matches exist, only then it prefers the
+left-most match. While POSIX matching seems natural, it turns
+out to be much more subtle than greedy matching in terms of
+implementations and in terms of proving properties about it.
+If POSIX matching is implemented using automata, then one has
+to follow transitions (according to the input string) until
+one finds an accepting state, record this state and look for
+further transition which might lead to another accepting state
+that represents a longer input initial substring to be
+matched. Only if none can be found, the last accepting state
+is returned.
+
+
+Sulzmann and Lu's paper \cite{Sulzmann2014} targets POSIX
+regular expression matching. They write that it is known to be
+to be a non-trivial problem and nearly all POSIX matching
+implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
+For this they cite a study by Kuklewicz \cite{Kuklewicz}. My
+current work is about formalising the proofs in the paper by
+Sulzmann and Lu. Specifically, they propose in this paper a
+POSIX matching algorithm and give some details of a
+correctness proof for this algorithm inside the paper and some
+more details in an appendix. This correctness proof is
+unformalised, meaning it is just a ``pencil-and-paper'' proof,
+not done in a theorem prover. Though, the paper and presumably
+the proof have been peer-reviewed. Unfortunately their proof
+does not give enough details such that it can be
+straightforwardly implemented in a theorem prover, say
+Isabelle. In fact, the purported proof they outline does not
+work in central places. We discovered this when filling in
+many gaps and attempting to formalise the proof in Isabelle.
+
+
+
+{\bf Contributions:}
+
+*}
+
+section {* Preliminaries *}
+
+text {* \noindent Strings in Isabelle/HOL are lists of characters with
+ the empty string being represented by the empty list, written @{term
+ "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. By
+ using the type char we have a supply of finitely many characters
+ roughly corresponding to the ASCII character set.
Regular exprtessions
\begin{center}
@{text "r :="}
- @{const "NULL"} $\mid$
- @{const "EMPTY"} $\mid$
+ @{const "ZERO"} $\mid$
+ @{const "ONE"} $\mid$
@{term "CHAR c"} $\mid$
@{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
@{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
@{term "STAR r"}
\end{center}
+*}
+
+section {* POSIX Regular Expression Matching *}
+
+section {* The Argument by Sulzmmann and Lu *}
+
+section {* Conclusion *}
+
+text {*
+
+ Nipkow lexer from 2000
+
+*}
+
+
+text {*
+
+
+
+
\noindent
Values
@@ -216,7 +319,6 @@
\begin{center}
\begin{tabular}{c}
@{thm Values_def}\medskip\\
- @{thm NValues_def}
\end{tabular}
\end{center}
@@ -238,9 +340,9 @@
@{thm[mode=IfThen] mkeps_flat}
- @{thm[mode=IfThen] v3}
+ @{thm[mode=IfThen] Prf_injval}
- @{thm[mode=IfThen] v4}
+ @{thm[mode=IfThen] Prf_injval_flat}
@{thm[mode=IfThen] PMatch_mkeps}