diff -r 489dfa0d7ec9 -r 6adda4a667b1 thys/Paper/Paper.thy --- a/thys/Paper/Paper.thy Thu Feb 25 20:13:41 2016 +0000 +++ b/thys/Paper/Paper.thy Sun Feb 28 14:01:12 2016 +0000 @@ -7,7 +7,13 @@ notation (latex output) If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and - Cons ("_::_" [78,77] 73) and + Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and + ZERO ("\<^raw:\textrm{0}>" 78) and + ONE ("\<^raw:\textrm{1}>" 78) and + CHAR ("_" [1000] 10) and + ALT ("_ + _" [1000,1000] 78) and + SEQ ("_ \ _" [1000,1000] 78) and + STAR ("_\<^sup>\" [1000] 78) and val.Char ("Char _" [1000] 78) and val.Left ("Left _" [1000] 78) and val.Right ("Right _" [1000] 78) and @@ -23,20 +29,117 @@ section {* Introduction *} text {* + - \noindent +Sulzmann and Lu \cite{Sulzmann2014} + +there are two commonly used +disambiguation strategies to create a unique matching tree: +one is called \emph{greedy} matching \cite{Frisch2004} and the +other is \emph{POSIX} matching~\cite{Kuklewicz,Sulzmann2014}. +For the latter there are two rough rules: + +\begin{itemize} +\item The Longest Match Rule (or ``maximal munch rule''):\smallskip\\ The + longest initial substring matched by any regular + expression is taken as next token. + +\item Rule Priority:\smallskip\\ For a particular longest initial + substring, the first regular expression that can match + determines the token. +\end{itemize} + +\noindent In the context of lexing, POSIX is the more +interesting disambiguation strategy as it produces longest +matches, which is necessary for tokenising programs. For +example the string \textit{iffoo} should not match the keyword +\textit{if} and the rest, but as one string \textit{iffoo}, +which might be a variable name in a program. As another +example consider the string $xy$ and the regular expression +\mbox{$(x + y + xy)^*$}. Either the input string can be +matched in two `iterations' by the single letter-regular +expressions $x$ and $y$, or directly in one iteration by $xy$. +The first case corresponds to greedy matching, which first +matches with the left-most symbol and only matches the next +symbol in case of a mismatch. The second case is POSIX +matching, which prefers the longest match. In case more than +one (longest) matches exist, only then it prefers the +left-most match. While POSIX matching seems natural, it turns +out to be much more subtle than greedy matching in terms of +implementations and in terms of proving properties about it. +If POSIX matching is implemented using automata, then one has +to follow transitions (according to the input string) until +one finds an accepting state, record this state and look for +further transition which might lead to another accepting state +that represents a longer input initial substring to be +matched. Only if none can be found, the last accepting state +is returned. + + +Sulzmann and Lu's paper \cite{Sulzmann2014} targets POSIX +regular expression matching. They write that it is known to be +to be a non-trivial problem and nearly all POSIX matching +implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. +For this they cite a study by Kuklewicz \cite{Kuklewicz}. My +current work is about formalising the proofs in the paper by +Sulzmann and Lu. Specifically, they propose in this paper a +POSIX matching algorithm and give some details of a +correctness proof for this algorithm inside the paper and some +more details in an appendix. This correctness proof is +unformalised, meaning it is just a ``pencil-and-paper'' proof, +not done in a theorem prover. Though, the paper and presumably +the proof have been peer-reviewed. Unfortunately their proof +does not give enough details such that it can be +straightforwardly implemented in a theorem prover, say +Isabelle. In fact, the purported proof they outline does not +work in central places. We discovered this when filling in +many gaps and attempting to formalise the proof in Isabelle. + + + +{\bf Contributions:} + +*} + +section {* Preliminaries *} + +text {* \noindent Strings in Isabelle/HOL are lists of characters with + the empty string being represented by the empty list, written @{term + "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. By + using the type char we have a supply of finitely many characters + roughly corresponding to the ASCII character set. Regular exprtessions \begin{center} @{text "r :="} - @{const "NULL"} $\mid$ - @{const "EMPTY"} $\mid$ + @{const "ZERO"} $\mid$ + @{const "ONE"} $\mid$ @{term "CHAR c"} $\mid$ @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$ @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ @{term "STAR r"} \end{center} +*} + +section {* POSIX Regular Expression Matching *} + +section {* The Argument by Sulzmmann and Lu *} + +section {* Conclusion *} + +text {* + + Nipkow lexer from 2000 + +*} + + +text {* + + + + \noindent Values @@ -216,7 +319,6 @@ \begin{center} \begin{tabular}{c} @{thm Values_def}\medskip\\ - @{thm NValues_def} \end{tabular} \end{center} @@ -238,9 +340,9 @@ @{thm[mode=IfThen] mkeps_flat} - @{thm[mode=IfThen] v3} + @{thm[mode=IfThen] Prf_injval} - @{thm[mode=IfThen] v4} + @{thm[mode=IfThen] Prf_injval_flat} @{thm[mode=IfThen] PMatch_mkeps}