(*<*)theory Paperimports "../ReStar" "~~/src/HOL/Library/LaTeXsugar"begindeclare [[show_question_marks = false]]abbreviation "der_syn r c \<equiv> der c r"abbreviation "ders_syn r s \<equiv> ders s r"notation (latex output) If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [75,73] 73) and ZERO ("\<^bold>0" 78) and ONE ("\<^bold>1" 78) and CHAR ("_" [1000] 80) and ALT ("_ + _" [77,77] 78) and SEQ ("_ \<cdot> _" [77,77] 78) and STAR ("_\<^sup>\<star>" [1000] 78) and val.Void ("'(')" 79) and val.Char ("Char _" [1000] 79) and val.Left ("Left _" [79] 78) and val.Right ("Right _" [79] 78) and val.Seq ("Seq _ _" [79,79] 78) and val.Stars ("Stars _" [79] 78) and L ("L'(_')" [10] 78) and der_syn ("_\\_" [79, 1000] 76) and ders_syn ("_\\_" [79, 1000] 76) and flat ("|_|" [75] 74) and Sequ ("_ @ _" [78,77] 63) and injval ("inj _ _ _" [79,77,79] 76) and mkeps ("mkeps _" [79] 76) and projval ("proj _ _ _" [1000,77,1000] 77) and length ("len _" [78] 73) and matcher ("lexer _ _" [78,78] 77) and Prf ("_ : _" [75,75] 75) and PMatch ("'(_, _') \<rightarrow> _" [63,75,75] 75) (* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *)definition "match r s \<equiv> nullable (ders s r)"(*>*)section {* Introduction *}text {*Brzozowski \cite{Brzozowski1964} introduced the notion of the {\emderivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ acharacter~@{text c}, and showed that it gave a simple solution to theproblem of matching a string @{term s} with a regular expression @{term r}:if the derivative of @{term r} w.r.t.\ (in succession) all the characters ofthe string matches the empty string, then @{term r} matches @{term s} (and{\em vice versa}). The derivative has the property (which may almost beregarded as its specification) that, for every string @{term s} and regularexpression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} ifand only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski'sderivatives is that they are neatly expressible in any functional language,and easily definable and reasoned about in theorem provers---the definitionsjust consist of inductive datatypes and simple recursive functions. Acompletely formalised correctness proof of this matcher in for example HOL4has been mentioned in~\cite{Owens2008}. Another one in Isabelle/HOL is partof the work in \cite{Krauss2011}.One limitation of Brzozowski's matcher is that it only generates a YES/NOanswer for whether a string is being matched by a regular expression.Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allowgeneration not just of a YES/NO answer but of an actual matching, called a[lexical] {\em value}. They give a simple algorithm to calculate a valuethat appears to be the value associated with POSIX matching\cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify thatvalue, in an algorithm-independent fashion, and to show that Sulzmann andLu's derivative-based algorithm does indeed calculate a value that iscorrect according to the specification.The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define arelation (called an ``order relation'') on the set of values of @{term r},and to show that (once a string to be matched is chosen) there is a maximumelement and that it is computed by their derivative-based algorithm. Thisproof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on aGREEDY regular expression matching algorithm. Beginning with ourobservations that, without evidence that it is transitive, it cannot becalled an ``order relation'', and that the relation is called a ``totalorder'' despite being evidently not total\footnote{The relation @{text"\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on thevalues for the regular expression @{term r}; but it only holds between@{term v} and @{term "v'"} in cases where @{term v} and @{term "v'"} havethe same flattening (underlying string). So a counterexample to totality isgiven by taking two values @{term v} and @{term "v'"} for @{term r} thathave different flattenings (see Section~\ref{posixsec}). A differentrelation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r}with flattening @{term s} is definable by the same approach, and is indeedtotal; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}, weidentify problems with this approach (of which some of the proofs are notpublished in \cite{Sulzmann2014}); perhaps more importantly, we give asimple inductive (and algorithm-independent) definition of what we callbeing a {\em POSIX value} for a regular expression @{term r} and a string@{term s}; we show that the algorithm computes such a value and that such avalue is unique. Proofs are both done by hand and checked in Isabelle/HOL.The experience of doing our proofs has been that this mechanical checkingwas absolutely essential: this subject area has hidden snares. This was alsonoted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matchingimplementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.If a regular expression matches a string, then in general there is more thanone way of how the string is matched. There are two commonly useddisambiguation strategies to generate a unique answer: one is called GREEDYmatching \cite{Frisch2004} and the other is POSIXmatching~\cite{Kuklewicz,Sulzmann2014,Vansummeren2006}. For example considerthe string @{term xy} and the regular expression \mbox{@{term "STAR (ALT(ALT x y) xy)"}}. Either the string can be matched in two `iterations' bythe single letter-regular expressions @{term x} and @{term y}, or directlyin one iteration by @{term xy}. The first case corresponds to GREEDYmatching, which first matches with the left-most symbol and only matches thenext symbol in case of a mismatch (this is greedy in the sense of preferringinstant gratification to delayed repletion). The second case is POSIXmatching, which prefers the longest match.In the context of lexing, where an input string needs to be split up into asequence of tokens, POSIX is the more natural disambiguation strategy forwhat programmers consider basic syntactic building blocks in their programs.These building blocks are often specified by some regular expressions, say@{text "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising keywords andidentifiers, respectively. There are two underlying (informal) rules behindtokenising a string in a POSIX fashion:\begin{itemize} \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}The longest initial substring matched by any regular expression is taken asnext token.\smallskip\item[$\bullet$] \underline{Priority Rule:}For a particular longest initial substring, the first regular expressionthat can match determines the token.\end{itemize}\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as@{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognisingidentifiers (say, a single character followed by characters or numbers).Then we can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and usePOSIX matching to tokenise strings, say @{text "iffoo"} and @{text "if"}.For @{text "iffoo"} we obtain by the longest match rule a single identifiertoken, not a keyword followed by an identifier. For @{text "if"} we obtain bythe priority rule a keyword token, not an identifier token---even if @{text"r\<^bsub>id\<^esub>"} matches also.\bigskip\noindent {\bf Contributions:} (NOT DONE YET) We have implemented inIsabelle/HOL the derivative-based regular expression matching algorithm asdescribed by Sulzmann and Lu \cite{Sulzmann2014}. We have proved thecorrectness of this algorithm according to our specification of what a POSIXvalue is. Sulzmann and Lu sketch in \cite{Sulzmann2014} an informalcorrectness proof: but to us it contains unfillable gaps.informal correctness proof given in \cite{Sulzmann2014} is in finalform\footnote{} and to us contains unfillable gaps.Our specification of a POSIX value consists of a simple inductive definitionthat given a string and a regular expression uniquely determines this value.Derivatives as calculated by Brzozowski's method are usually more complexregular expressions than the initial one; various optimisations arepossible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT rZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of theadvantages of having a simple specification and correctness proof is thatthe latter can be refined to allow for such optimisations and simplecorrectness proof.An extended version of \cite{Sulzmann2014} is available at the website ofits first author; this includes some ``proofs'', claimed in\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not infinal form, we make no comment thereon, preferring to give general reasonsfor our belief that the approach of \cite{Sulzmann2014} is problematicrather than to discuss details of unpublished work.*}section {* Preliminaries *}text {* \noindent Strings in Isabelle/HOL are lists of characters with theempty string being represented by the empty list, written @{term "[]"}, andlist-cons being written as @{term "DUMMY # DUMMY"}. Often we use the usualbracket notation for lists also for strings; for example a string consistingof just a single character @{term c} is written @{term "[c]"}. By using thetype @{type char} for characters we have a supply of finitely manycharacters roughly corresponding to the ASCII character set. Regularexpressions are defined as usual as the elements of the following inductivedatatype: \begin{center} @{text "r :="} @{const "ZERO"} $\mid$ @{const "ONE"} $\mid$ @{term "CHAR c"} $\mid$ @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$ @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ @{term "STAR r"} \end{center} \noindent where @{const ZERO} stands for the regular expression that does not match any string, @{const ONE} for the regular expression that matches only the empty string and @{term c} for matching a character literal. The language of a regular expression is also defined as usual by the recursive function @{term L} with the clauses: \begin{center} \begin{tabular}{l@ {\hspace{5mm}}rcl} (1) & @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\ (2) & @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\ (3) & @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\ (4) & @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ (5) & @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ (6) & @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\ \end{tabular} \end{center} \noindent In clause (4) we use the operation @{term "DUMMY ;; DUMMY"} for the concatenation of two languages (it is also list-append for strings). We use the star-notation for regular expressions and for languages (in the last clause above). The star for languages is defined inductively by two clauses: @{text "(i)"} the empty string being in the star of a language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a language and @{term "s\<^sub>2"} in the star of this language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in the star of this language. It will also be convenient to use the following notion of a \emph{semantic derivative} (or \emph{left quotient}) of a language defined as: \begin{center} \begin{tabular}{lcl} @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\ \end{tabular} \end{center} \noindent For semantic derivatives we have the following equations (for example mechanically proved in \cite{Krauss2011}): \begin{equation}\label{SemDer} \begin{array}{lcl} @{thm (lhs) Der_null} & \dn & @{thm (rhs) Der_null}\\ @{thm (lhs) Der_empty} & \dn & @{thm (rhs) Der_empty}\\ @{thm (lhs) Der_char} & \dn & @{thm (rhs) Der_char}\\ @{thm (lhs) Der_union} & \dn & @{thm (rhs) Der_union}\\ @{thm (lhs) Der_Sequ} & \dn & @{thm (rhs) Der_Sequ}\\ @{thm (lhs) Der_star} & \dn & @{thm (rhs) Der_star} \end{array} \end{equation} \noindent \emph{\Brz's derivatives} of regular expressions \cite{Brzozowski1964} can be easily defined by two recursive functions: the first is from regular expressions to booleans (implementing a test when a regular expression can match the empty string), and the second takes a regular expression and a character to a (derivative) regular expression: \begin{center} \begin{tabular}{lcl} @{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\ @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\ @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\ @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\ @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\ @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\ @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\ @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\ @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)} \end{tabular} \end{center} \noindent We may extend this definition to give derivatives w.r.t.~strings: \begin{center} \begin{tabular}{lcl} @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\ @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\ \end{tabular} \end{center} \noindent Given the equations in \eqref{SemDer}, it is a relatively easy exercise in mechanical reasoning to establish that \begin{proposition}\mbox{}\\ \begin{tabular}{ll} @{text "(1)"} & @{thm (lhs) nullable_correctness} if and only if @{thm (rhs) nullable_correctness}, and \\ @{text "(2)"} & @{thm[mode=IfThen] der_correctness}. \end{tabular} \end{proposition} \noindent With this in place it is also very routine to prove that the regular expression matcher defined as \begin{center} @{thm match_def} \end{center} \noindent gives a positive answer if and only if @{term "s \<in> L r"}. Consequently, this regular expression matching algorithm satisfies the usual specification for regular expression matching. While the matcher above calculates a provably correct YES/NO answer for whether a regular expression matches a string or not, the novel idea of Sulzmann and Lu \cite{Sulzmann2014} is to append another phase to this algorithm in order to calculate a [lexical] value. We will explain the details next.*}section {* POSIX Regular Expression Matching\label{posixsec} *}text {* The clever idea in \cite{Sulzmann2014} is to introduce values for encoding \emph{how} a regular expression matches a string and then define a function on values that mirrors (but inverts) the construction of the derivative on regular expressions. \emph{Values} are defined as the inductive datatype \begin{center} @{text "v :="} @{const "Void"} $\mid$ @{term "val.Char c"} $\mid$ @{term "Left v"} $\mid$ @{term "Right v"} $\mid$ @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ @{term "Stars vs"} \end{center} \noindent where we use @{term vs} to stand for a list of values. (This is similar to the approach taken by Frisch and Cardelli for GREEDY matching \cite{Frisch2004}, and Sulzmann and Lu \cite{Sulzmann2014} for POSIX matching). The string underlying a value can be calculated by the @{const flat} function, written @{term "flat DUMMY"} and defined as: \begin{center} \begin{tabular}{lcl} @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\ @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\ @{thm (lhs) flat.simps(3)} & $\dn$ & @{thm (rhs) flat.simps(3)}\\ @{thm (lhs) flat.simps(4)} & $\dn$ & @{thm (rhs) flat.simps(4)}\\ @{thm (lhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\ @{thm (lhs) flat.simps(6)} & $\dn$ & @{thm (rhs) flat.simps(6)}\\ @{thm (lhs) flat.simps(7)} & $\dn$ & @{thm (rhs) flat.simps(7)}\\ \end{tabular} \end{center} \noindent Sulzmann and Lu also define inductively an inhabitation relation that associates values to regular expressions: \begin{center} \begin{tabular}{c} @{thm[mode=Axiom] Prf.intros(4)} \qquad @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\ @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\ @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\ \end{tabular} \end{center} \noindent Note that no values are associated with the regular expression @{term ZERO}, and that the only value associated with the regular expression @{term ONE} is @{term Void}, pronounced (if one must) as @{text "Void"}. It is routine to establish how values ``inhabiting'' a regular expression correspond to the language of a regular expression, namely \begin{proposition} @{thm L_flat_Prf} \end{proposition} In general there is more than one value associated with a regular expression. In case of POSIX matching the problem is to calculate the unique value that satisfies the (informal) POSIX rules from the Introduction. Graphically the POSIX value calculation algorithm by Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz} where the path from the left to the right involving @{term derivatives}/@{const nullable} is the first phase of the algorithm (calculating successive \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to left, the second phase. This picture shows the steps required when a regular expression, say @{text "r\<^sub>1"}, matches the string @{term "[a,b,c]"}. We first build the three derivatives (according to @{term a}, @{term b} and @{term c}). We then use @{const nullable} to find out whether the resulting derivative regular expression @{term "r\<^sub>4"} can match the empty string. If yes, we call the function @{const mkeps} that produces a value @{term "v\<^sub>4"} for how @{term "r\<^sub>4"} can match the empty string (taking into account the POSIX rules in case there are several ways). This functions is defined by the clauses:\begin{figure}[t]\begin{center}\begin{tikzpicture}[scale=2,node distance=1.3cm, every node/.style={minimum size=7mm}]\node (r1) {@{term "r\<^sub>1"}};\node (r2) [right=of r1]{@{term "r\<^sub>2"}};\draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}};\node (r3) [right=of r2]{@{term "r\<^sub>3"}};\draw[->,line width=1mm](r2)--(r3) node[above,midway] {@{term "der b DUMMY"}};\node (r4) [right=of r3]{@{term "r\<^sub>4"}};\draw[->,line width=1mm](r3)--(r4) node[above,midway] {@{term "der c DUMMY"}};\draw (r4) node[anchor=west] {\;\raisebox{3mm}{@{term nullable}}};\node (v4) [below=of r4]{@{term "v\<^sub>4"}};\draw[->,line width=1mm](r4) -- (v4);\node (v3) [left=of v4] {@{term "v\<^sub>3"}};\draw[->,line width=1mm](v4)--(v3) node[below,midway] {@{text "inj r\<^sub>3 c"}};\node (v2) [left=of v3]{@{term "v\<^sub>2"}};\draw[->,line width=1mm](v3)--(v2) node[below,midway] {@{text "inj r\<^sub>2 b"}};\node (v1) [left=of v2] {@{term "v\<^sub>1"}};\draw[->,line width=1mm](v2)--(v1) node[below,midway] {@{text "inj r\<^sub>1 a"}};\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}};\end{tikzpicture}\end{center}\caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014},matching the string @{term "[a,b,c]"}. The first phase (the arrows from left to right) is \Brz's matcher building succesive derivatives. If the last regular expression is @{term nullable}, then the functions of the second phase are called (the top-down and right-to-left arrows): first @{term mkeps} calculates a value witnessinghow the empty string has been recognised by @{term "r\<^sub>4"}. Afterthat the function @{term inj} `injects back' the characters of the string intothe values.\label{Sulz}}\end{figure} \begin{center} \begin{tabular}{lcl} @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ \end{tabular} \end{center} \noindent Note that this function needs only to be partially defined, namely only for regular expressions that are nullable. In case @{const nullable} fails, the string @{term "[a,b,c]"} cannot be matched by @{term "r\<^sub>1"} and an error is raised instead. Note also how this function makes some subtle choices leading to a POSIX value: for example if an alternative regular expression, say @{term "ALT r\<^sub>1 r\<^sub>2"}, can match the empty string and furthermore @{term "r\<^sub>1"} can match the empty string, then we return a @{text Left}-value. The @{text Right}-value will only be returned if @{term "r\<^sub>1"} cannot match the empty string. The most interesting idea from Sulzmann and Lu \cite{Sulzmann2014} is the construction of a value for how @{term "r\<^sub>1"} can match the string @{term "[a,b,c]"} from the value how the last derivative, @{term "r\<^sub>4"} in Fig~\ref{Sulz}, can match the empty string. Sulzmann and Lu achieve this by stepwise ``injecting back'' the characters into the values thus inverting the operation of building derivatives on the level of values. The corresponding function, called @{term inj}, takes three arguments, a regular expression, a character and a value. For example in the first (or right-most) @{term inj}-step in Fig~\ref{Sulz} the regular expression @{term "r\<^sub>3"}, the character @{term c} from the last derivative step and @{term "v\<^sub>4"}, which is the value corresponding to the derivative regular expression @{term "r\<^sub>4"}. The result is the new value @{term "v\<^sub>3"}. The final result of the algorithm is the value @{term "v\<^sub>1"} corresponding to the input regular expression. The @{term inj} function is by recursion on the regular expressions and by analysing the shape of values (corresponding to the derivative regular expressions). \begin{center} \begin{tabular}{l@ {\hspace{5mm}}lcl} (1) & @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\ (2) & @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\ (3) & @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ (4) & @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ (5) & @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ (6) & @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ (7) & @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\ \end{tabular} \end{center} \noindent To better understand what is going on in this definition it might be instructive to look first at the three sequence cases (clauses (4)--(6)). In each case we need to construct an ``injected value'' for @{term "SEQ r\<^sub>1 r\<^sub>2"}. This must be a value of the form @{term "Seq DUMMY DUMMY"}. Recall the clause of the @{text derivative}-function for sequence regular expressions: \begin{center} @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} $\dn$ @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} \end{center} \noindent Consider first the else-branch where the derivative is @{term "SEQ (der c r\<^sub>1) r\<^sub>2"}. The corresponding value must therefore be the form @{term "Seq v\<^sub>1 v\<^sub>2"}, which matches the left-hand side in clause (4) of @{term inj}. In the if-branch the derivative is an alternative, namely @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c r\<^sub>2)"}. This means we either have to consider a @{text Left}- or @{text Right}-value. In case of the @{text Left}-value we know further it must be a value for a sequence regular expression. Therefore the pattern we match in the clause (5) is @{term "Left (Seq v\<^sub>1 v\<^sub>2)"}, while in (6) it is just @{term "Right v\<^sub>2"}. One more interesting point is in the right-hand side of clause (6): since in this case the regular expression @{text "r\<^sub>1"} does not ``contribute'' to matching the string, that means it only matches the empty string, we need to call @{const mkeps} in order to construct a value for how @{term "r\<^sub>1"} can match this empty string. A similar argument applies for why we can expect in the left-hand side of clause (7) that the value is of the form @{term "Seq v (Stars vs)"}---the derivative of a star is @{term "SEQ r (STAR r)"}. Finally, the reason for why we can ignore the second argument in clause (1) of @{term inj} is that it will only ever be called in cases where @{term "c=d"}, but the usual linearity restrictions in patterns do not allow is to build this constraint explicitly into our function definition.\footnote{Sulzmann and Lu state this clause as @{thm (lhs) injval.simps(1)[of "c" "c"]} $\dn$ @{thm (rhs) injval.simps(1)[of "c"]}, but our deviation is harmless.} The idea of @{term inj} to ``inject back'' a character into a value can be made precise by the first part of the following lemma; the second part shows that the underlying string of an @{const mkeps}-value is always the empty string. \begin{lemma}\mbox{}\smallskip\\\label{Prf_injval_flat} \begin{tabular}{ll} (1) & @{thm[mode=IfThen] Prf_injval_flat}\\ (2) & @{thm[mode=IfThen] mkeps_flat} \end{tabular} \end{lemma} \begin{proof} Both properties are by routine inductions: the first one, for example, by an induction over the definition of @{term derivatives}; the second by induction on @{term r}. There are no interesting cases.\qed \end{proof} Having defined the @{const mkeps} and @{text inj} function we can extend \Brz's matcher so that a [lexical] value is constructed (assuming the regular expression matches the string). The clauses of the lexer are \begin{center} \begin{tabular}{lcl} @{thm (lhs) matcher.simps(1)} & $\dn$ & @{thm (rhs) matcher.simps(1)}\\ @{thm (lhs) matcher.simps(2)} & $\dn$ & @{text "case"} @{term "matcher (der c r) s"} @{text of}\\ & & \phantom{$|$} @{term "None"} @{text "\<Rightarrow>"} @{term None}\\ & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"} \end{tabular} \end{center} \noindent If the regular expression does not match the string, @{const None} is returned, indicating an error is raised. If the regular expression does match the string, then @{const Some} value is returned. One important virtue of this algorithm is that it can be implemented with ease in a functional programming language and also in Isabelle/HOL. In the remaining part of this section we prove that this algorithm is correct. The well-known idea of POSIX matching is informally defined by the longest match and priority rule; as correctly argued in \cite{Sulzmann2014}, this needs formal specification. Sulzmann and Lu define a \emph{dominance} relation\footnote{Sulzmann and Lu call it an ordering relation, but without giving evidence that it is transitive.} between values and argue that there is a maximum value, as given by the derivative-based algorithm. In contrast, we shall introduce a simple inductive definition that specifies directly what a \emph{POSIX value} is, incorporating the POSIX-specific choices into the side-conditions of our rules. Our definition is inspired by the matching relation given in \cite{Vansummeren2006}. The relation we define is ternary and written as \mbox{@{term "s \<in> r \<rightarrow> v"}}, relating strings, regular expressions and values. \begin{center} \begin{tabular}{c} @{thm[mode=Axiom] PMatch.intros(1)}@{text "P"}@{term "ONE"} \qquad @{thm[mode=Axiom] PMatch.intros(2)}@{text "P"}@{term "c"}\bigskip\\ @{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}@{text "P+L"}\qquad @{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}@{text "P+R"}\bigskip\\ $\mprset{flushleft} \inferrule {@{thm (prem 1) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad @{thm (prem 2) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\ @{thm (prem 3) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}} {@{thm (concl) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$@{text "PS"}\\ @{thm[mode=Axiom] PMatch.intros(7)}@{text "P[]"}\bigskip\\ $\mprset{flushleft} \inferrule {@{thm (prem 1) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad @{thm (prem 2) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad @{thm (prem 3) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ @{thm (prem 4) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} {@{thm (concl) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$@{text "P\<star>"} \end{tabular} \end{center} \noindent We claim that this relation captures the idea behind the two informal POSIX rules shown in the Introduction: Consider for example the rules @{text "P+L"} and @{text "P+R"} where the POSIX value for an alternative regular expression is specified---it is always a @{text "Left"}-value, \emph{except} when the string to be matched is not in the language of @{term "r\<^sub>1"}; only then it is a @{text Right}-value (see the side-condition in @{text "P+R"}). Interesting is also the rule for sequence regular expressions (@{text "PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"} are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"} respectively. Consider now the third premise and note that the POSIX value of this rule should match the string @{term "s\<^sub>1 @ s\<^sub>2"}. According to the longest match rule, we want that the @{term "s\<^sub>1"} is the longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} such that @{term "s\<^sub>2"} is still recognised by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there \emph{exists} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"} can be split up into a non-empty @{term "s\<^sub>3"} and @{term "s\<^sub>4"}. Moreover the longer @{term "s\<^sub>1 @ s\<^sub>3"} can be matched by @{text "r\<^sub>1"} and the shorter @{term "s\<^sub>4"} can still be matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would not be the longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and therefore @{term "Seq v\<^sub>1 v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. A similar condition is imposed onto the POSIX value in the @{text "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value @{term v} cannot be flatten to the empty string. In effect, we require that in each ``iteration'' of the star, some parts of the string need to be ``nibbled'' away; only in case of the empty string weBy accept @{term "Stars []"} as the POSIX value. We can prove that given a string @{term s} and regular expression @{term r}, the POSIX value @{term v} is uniquely determined by @{term "s \<in> r \<rightarrow> v"}. \begin{theorem} @{thm[mode=IfThen] PMatch_determ(1)[of _ _ "v\<^sub>1" "v\<^sub>2"]} \end{theorem} \begin{proof} By induction on the definition of @{term "s \<in> r \<rightarrow> v\<^sub>1"} and a case analysis of @{term "s \<in> r \<rightarrow> v\<^sub>2"}.\qed \end{proof} \begin{lemma} @{thm[mode=IfThen] PMatch_mkeps} \end{lemma} \begin{proof} By routine induction on @{term r}.\qed \end{proof} \begin{lemma} @{thm[mode=IfThen] PMatch2_roy_version} \end{lemma} \begin{theorem}\mbox{}\smallskip\\ \begin{tabular}{ll} (1) & @{thm (lhs) lex_correct1a} if and only if @{thm (rhs) lex_correct1a}\\ (2) & @{thm (lhs) lex_correct3a} if and only if @{thm (rhs) lex_correct3a}\\ \end{tabular} \end{theorem}*}section {* The Argument by Sulzmmann and Lu *}section {* Conclusion *}text {* Nipkow lexer from 2000*}text {* \noindent We have also introduced a slightly restricted version of this relation where the last rule is restricted so that @{term "flat v \<noteq> []"}. This relation for \emph{non-problematic} is written @{term "\<Turnstile> v : r"}. \bigskip \noindent \noindent Our version of Sulzmann's ordering relation \begin{center} \begin{tabular}{c} @{thm[mode=Rule] ValOrd.intros(2)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>1'" "v\<^sub>2" "r\<^sub>2" "v\<^sub>2'"]} \qquad @{thm[mode=Rule] ValOrd.intros(1)[of "v\<^sub>2" "r\<^sub>2" "v\<^sub>2'" "v\<^sub>1" "r\<^sub>1"]}\medskip\\ @{thm[mode=Rule] ValOrd.intros(3)[of "v\<^sub>1" "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]} \qquad @{thm[mode=Rule] ValOrd.intros(4)[of "v\<^sub>2" "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\medskip\\ @{thm[mode=Rule] ValOrd.intros(5)[of "v\<^sub>2" "r\<^sub>2" "v\<^sub>2'" "r\<^sub>1"]} \qquad @{thm[mode=Rule] ValOrd.intros(6)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>1'" "r\<^sub>2"]} \medskip\\ @{thm[mode=Axiom] ValOrd.intros(7)}\qquad @{thm[mode=Axiom] ValOrd.intros(8)[of "c"]}\medskip\\ @{thm[mode=Rule] ValOrd.intros(9)[of "v" "vs" "r"]}\qquad @{thm[mode=Rule] ValOrd.intros(10)[of "v" "vs" "r"]}\medskip\\ @{thm[mode=Rule] ValOrd.intros(11)[of "v\<^sub>1" "r" "v\<^sub>2" "vs\<^sub>1" "vs\<^sub>2"]}\medskip\\ @{thm[mode=Rule] ValOrd.intros(12)[of "vs\<^sub>1" "r" "vs\<^sub>2" "v"]}\qquad @{thm[mode=Axiom] ValOrd.intros(13)[of "r"]}\medskip\\ \end{tabular} \end{center} \noindent A prefix of a string s \begin{center} \begin{tabular}{c} @{thm prefix_def[of "s\<^sub>1" "s\<^sub>2"]} \end{tabular} \end{center} \noindent Values and non-problematic values \begin{center} \begin{tabular}{c} @{thm Values_def}\medskip\\ \end{tabular} \end{center} \noindent The point is that for a given @{text s} and @{text r} there are only finitely many non-problematic values.*} text {* \noindent Some lemmas we have proved:\bigskip @{thm L_flat_Prf} @{thm L_flat_NPrf} @{thm[mode=IfThen] mkeps_nullable} @{thm[mode=IfThen] mkeps_flat} @{thm[mode=IfThen] Prf_injval} @{thm[mode=IfThen] Prf_injval_flat} @{thm[mode=IfThen] PMatch_mkeps} @{thm[mode=IfThen] PMatch1(2)} @{thm[mode=IfThen] PMatch1N} @{thm[mode=IfThen] PMatch_determ(1)[of "s" "r" "v\<^sub>1" "v\<^sub>2"]} \medskip \noindent This is the main theorem that lets us prove that the algorithm is correct according to @{term "s \<in> r \<rightarrow> v"}: @{thm[mode=IfThen] PMatch2} \mbox{}\bigskip \noindent {\bf Proof} The proof is by induction on the definition of @{const der}. Other inductions would go through as well. The interesting case is for @{term "SEQ r\<^sub>1 r\<^sub>2"}. First we analyse the case where @{term "nullable r\<^sub>1"}. We have by induction hypothesis \[ \begin{array}{l} (IH1)\quad @{text "\<forall>s v."} \text{\;if\;} @{term "s \<in> der c r\<^sub>1 \<rightarrow> v"} \text{\;then\;} @{term "(c # s) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v"}\\ (IH2)\quad @{text "\<forall>s v."} \text{\;if\;} @{term "s \<in> der c r\<^sub>2 \<rightarrow> v"} \text{\;then\;} @{term "(c # s) \<in> r\<^sub>2 \<rightarrow> injval r\<^sub>2 c v"} \end{array} \] \noindent and have \[ @{term "s \<in> ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c r\<^sub>2) \<rightarrow> v"} \] \noindent There are two cases what @{term v} can be: (1) @{term "Left v'"} and (2) @{term "Right v'"}. \begin{itemize} \item[(1)] We know @{term "s \<in> SEQ (der c r\<^sub>1) r\<^sub>2 \<rightarrow> v'"} holds, from which we can infer that there are @{text "s\<^sub>1"}, @{term "s\<^sub>2"}, @{text "v\<^sub>1"}, @{term "v\<^sub>2"} with \[ @{term "s\<^sub>1 \<in> der c r\<^sub>1 \<rightarrow> v\<^sub>1"} \qquad\text{and}\qquad @{term "s\<^sub>2 \<in> r\<^sub>2 \<rightarrow> v\<^sub>2"} \] and also \[ @{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> s\<^sub>1 @ s\<^sub>3 \<in> L (der c r\<^sub>1) \<and> s\<^sub>4 \<in> L r\<^sub>2)"} \] \noindent and have to prove \[ @{term "((c # s\<^sub>1) @ s\<^sub>2) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (injval r\<^sub>1 c v\<^sub>1) v\<^sub>2"} \] \noindent The two requirements @{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"} and @{term "s\<^sub>2 \<in> r\<^sub>2 \<rightarrow> v\<^sub>2"} can be proved by the induction hypothese (IH1) and the fact above. \noindent This leaves to prove \[ @{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> (c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"} \] \noindent which holds because @{term "(c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 "} implies @{term "s\<^sub>1 @ s\<^sub>3 \<in> L (der c r\<^sub>1) "} \item[(2)] This case is similar. \end{itemize} \noindent The final case is that @{term " \<not> nullable r\<^sub>1"} holds. This case again similar to the cases above.*}text {* %\noindent %{\bf Acknowledgements:} %We are grateful for the comments we received from anonymous %referees. \bibliographystyle{plain} \bibliography{root} \section{Roy's Rules} \newcommand{\abs}[1]{\mid\!\! #1\!\! \mid} %%\newcommand{\mts}{\textit{``''} \newcommand{\tl}{\ \triangleleft\ } $$\inferrule[]{Void \tl \epsilon}{} \quad\quad \inferrule[]{Char\ c \tl Lit\ c}{} $$ $$\inferrule {v_1 \tl r_1} {Left\ v_1 \tl r_1 + r_2} \quad\quad \inferrule[] { v_2 \tl r_2 \\ \abs{v_2}\ \not\in\ L(r_1)} {Right\ v_2 \tl r_1 + r_2} $$ $$ \inferrule {v_1 \tl r_1\\ v_2 \tl r_2\\ s \in\ L(r_1\backslash\! \abs{v_1}) \ \land\ \abs{v_2}\!\backslash s\ \epsilon\ L(r_2) \ \Rightarrow\ s = [] } {(v_1, v_2) \tl r_1 \cdot r_2} $$ $$\inferrule { v \tl r \\ vs \tl r^* \\ \abs{v}\ \not=\ []} { (v :: vs) \tl r^* } \quad\quad \inferrule{} { [] \tl r^* } $$*}(*<*)end(*>*)