lexing: comparison thys/Paper/Paper.thy

equal deleted inserted replaced

-:489dfa0d7ec9
+:6adda4a667b1
 declare [[show_question_marks = false]]
 notation (latex output)
 If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
-Cons ("_::_" [78,77] 73) and
+Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and
+ZERO ("\<^raw:\textrm{0}>" 78) and
+ONE ("\<^raw:\textrm{1}>" 78) and
+CHAR ("_" [1000] 10) and
+ALT ("_ + _" [1000,1000] 78) and
+SEQ ("_ \<cdot> _" [1000,1000] 78) and
+STAR ("_\<^sup>\<star>" [1000] 78) and
 val.Char ("Char _" [1000] 78) and
 val.Left ("Left _" [1000] 78) and
 val.Right ("Right _" [1000] 78) and
 L ("L _" [1000] 0) and
 flat ("|_|" [70] 73) and
 (*>*)
 section {* Introduction *}
 text {*
-\noindent
+Sulzmann and Lu \cite{Sulzmann2014}
+there are two commonly used
+disambiguation strategies to create a unique matching tree:
+one is called \emph{greedy} matching \cite{Frisch2004} and the
+other is \emph{POSIX} matching~\cite{Kuklewicz,Sulzmann2014}.
+For the latter there are two rough rules:
+\begin{itemize}
+\item The Longest Match Rule (or ``maximal munch rule''):\smallskip\\ The
+longest initial substring matched by any regular
+expression is taken as next token.
+\item Rule Priority:\smallskip\\ For a particular longest initial
+substring, the first regular expression that can match
+determines the token.
+\end{itemize}
+\noindent In the context of lexing, POSIX is the more
+interesting disambiguation strategy as it produces longest
+matches, which is necessary for tokenising programs. For
+example the string \textit{iffoo} should not match the keyword
+\textit{if} and the rest, but as one string \textit{iffoo},
+which might be a variable name in a program. As another
+example consider the string $xy$ and the regular expression
+\mbox{$(x + y + xy)^*$}. Either the input string can be
+matched in two `iterations' by the single letter-regular
+expressions $x$ and $y$, or directly in one iteration by $xy$.
+The first case corresponds to greedy matching, which first
+matches with the left-most symbol and only matches the next
+symbol in case of a mismatch. The second case is POSIX
+matching, which prefers the longest match. In case more than
+one (longest) matches exist, only then it prefers the
+left-most match. While POSIX matching seems natural, it turns
+out to be much more subtle than greedy matching in terms of
+implementations and in terms of proving properties about it.
+If POSIX matching is implemented using automata, then one has
+to follow transitions (according to the input string) until
+one finds an accepting state, record this state and look for
+further transition which might lead to another accepting state
+that represents a longer input initial substring to be
+matched. Only if none can be found, the last accepting state
+is returned.
+Sulzmann and Lu's paper \cite{Sulzmann2014} targets POSIX
+regular expression matching. They write that it is known to be
+to be a non-trivial problem and nearly all POSIX matching
+implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
+For this they cite a study by Kuklewicz \cite{Kuklewicz}. My
+current work is about formalising the proofs in the paper by
+Sulzmann and Lu. Specifically, they propose in this paper a
+POSIX matching algorithm and give some details of a
+correctness proof for this algorithm inside the paper and some
+more details in an appendix. This correctness proof is
+unformalised, meaning it is just a ``pencil-and-paper'' proof,
+not done in a theorem prover. Though, the paper and presumably
+the proof have been peer-reviewed. Unfortunately their proof
+does not give enough details such that it can be
+straightforwardly implemented in a theorem prover, say
+Isabelle. In fact, the purported proof they outline does not
+work in central places. We discovered this when filling in
+many gaps and attempting to formalise the proof in Isabelle.
+{\bf Contributions:}
+*}
+section {* Preliminaries *}
+text {* \noindent Strings in Isabelle/HOL are lists of characters with
+the empty string being represented by the empty list, written @{term
+"[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. By
+using the type char we have a supply of finitely many characters
+roughly corresponding to the ASCII character set.
 Regular exprtessions
 \begin{center}
 @{text "r :="}
-@{const "NULL"} $\mid$
+@{const "ZERO"} $\mid$
-@{const "EMPTY"} $\mid$
+@{const "ONE"} $\mid$
 @{term "CHAR c"} $\mid$
 @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
 @{term "STAR r"}
 \end{center}
+*}
+section {* POSIX Regular Expression Matching *}
+section {* The Argument by Sulzmmann and Lu *}
+section {* Conclusion *}
+text {*
+Nipkow lexer from 2000
+*}
+text {*
 \noindent
 Values
 \begin{center}
 Values and non-problematic values
 \begin{center}
 \begin{tabular}{c}
 @{thm Values_def}\medskip\\
-@{thm NValues_def}
 \end{tabular}
 \end{center}
 \noindent
 @{thm[mode=IfThen] mkeps_nullable}
 @{thm[mode=IfThen] mkeps_flat}
-@{thm[mode=IfThen] v3}
+@{thm[mode=IfThen] Prf_injval}
-@{thm[mode=IfThen] v4}
+@{thm[mode=IfThen] Prf_injval_flat}
 @{thm[mode=IfThen] PMatch_mkeps}
 @{thm[mode=IfThen] PMatch1(2)}

changeset 107	6adda4a667b1
parent 105	80218dddbb15
child 108	73f7dc60c285