lexing: comparison thys/Paper/Paper.thy

equal deleted inserted replaced

-:03ca57e3f199
+:23e68b81a908
 $\infer{ Stars\,vs_{1} \posix_{r^{\star}} Stars\,vs_{2}}
 {Stars\,(v :: vs_{1}) \posix_{r^{\star}} Stars\,(v :: vs_{2})}(K4)$
 \end{tabular}
 \end{center}
+\noindent
+The idea behind the rules $(A1)$ and $(A2)$ is that a $Left$-value is
+bigger than a $Right$-value, if the underlying string of the $Left$-value is
+bigger or equals to the underlying string of the $Right$-value. The order
+is reversed if the $Right$-value can match longer string than a $Left$-value.
+In this way the POSIX value is supposed to be the biggest value
+for a given string and regular expression.
 Sulzmann and Lu explicitly refer to the paper \cite{Frisch2004} by Frisch
-and Cardelli from where they have taken their main idea for their
+and Cardelli from where they have taken the idea for their correctness
-correctness proof of the POSIX value algorithm. Frisch and Cardelli
+proof of the POSIX value algorithm. Frisch and Cardelli introduced a
-introduced an ordering, written $\greedy$, for values and they show that
+similar ordering for GREEDY matching and they show that their greedy
-their greedy matching algorithm always produces a maximal element
+matching algorithm always produces a maximal element according to this
-according to this ordering (from all possible solutions). The only
+ordering (from all possible solutions). The only difference between their
-difference between their greedy ordering and the ``ordering'' by Sulzmann
+GREEDY ordering and the ``ordering'' by Sulzmann and Lu is that GREEDY, if
-and Lu is that GREEDY always prefers a $Left$-value over a $Right$-value.
+possible, always prefers a $Left$-value over a $Right$-value, no matter
-What is interesting for our purposes is that the properties reflexivity,
+what the underlying string is. This seems like only a very minor
-totality and transitivity for this GREEDY ordering can be proved
+difference, but it leads to drastic consequences in terms what
-relatively easily by induction.
+properties both orderings enjoy. What is interesting for our purposes is
+that the properties reflexivity, totality and transitivity for this GREEDY
-These properties of GREEDY, however, do not transfer to POSIX by
+ordering can be proved relatively easily by induction.
-Sulzmann and Lu. To start with, transitivity does not hold anymore in the
-``normal'' formulation, that is:
+These properties of GREEDY, however, do not transfer to the POSIX
+``ordering'' by Sulzmann and Lu. To start with, $v \geq_r v′$ is not
-\begin{property}
+defined inductively, but as $v = v′$ or $v >_r v′ \wedge |v| = |v′|$. This
+means that $v >_r v′$ does not imply $v \geq_r v′$. Moreover, transitivity
+does not hold in the ``usual'' formulation, for example:
+\begin{proposition}
 Suppose $v_1 : r$, $v_2 : r$ and $v_3 : r$.
 If $v_1 \posix_r v_2$ and $v_2 \posix_r v_3$
 then $v_1 \posix_r v_3$.
-\end{property}
+\end{proposition}
 \noindent If formulated like this, then there are various counter examples:
-Suppose $r$ is $a + ((a + a)(a + \textbf{0}))$ then the $v_1$, $v_2$ and $v_3$
+FOr example let $r$ be $a + ((a + a)\cdot (a + \textbf{0}))$ then the $v_1$,
-below are values of $r$:
+$v_2$ and $v_3$ below are values of $r$:
 \begin{center}
 \begin{tabular}{lcl}
 $v_1$ & $=$ & $Left(Char\;a)$\\
 $v_2$ & $=$ & $Right((Left(Char\;a), Right(Void)))$\\
 example actually $v_3 \posix_r v_1$!
 Sulzmann and Lu ``fix'' this problem by weakening the
 transitivity property. They require in addition that the
 underlying strings are of the same length. This excludes the
-counter example above and any counter-example we could find
+counter example above and any counter-example we were able
-with our implementation. Thus the transitivity lemma in
+to find (by hand and by machine). Thus the transitivity
-\cite{Sulzmann2014} is:
+lemma in should be formulated as:
 \begin{property}
 Suppose $v_1 : r$, $v_2 : r$ and
 $v_3 : r$, and also $|v_1|=|v_2|=|v_3|$.\\
 If $v_1 \posix_r v_2$ and $v_2 \posix_r v_3$
 then $v_1 \posix_r v_3$.
 \end{property}
-\noindent While we agree with Sulzmann and Lu that this
+\noindent While we agree with Sulzmann and Lu that this property probably
-property probably holds, proving it seems not so
+holds, proving it seems not so straightforward: although one begins with the
-straightforward. Sulzmann and Lu do not give an explicit proof
+assumption that the values have the same flattening, this
+cannot be maintained as one descends into the induction. This is
+a problem that occurs in a number of places in their proof.
+Sulzmann and Lu do not give an explicit proof
 of the transitivity property, but give a closely related
 property about the existence of maximal elements. They state
 that this can be verified by an induction on $r$. We disagree
 with this as we shall show next in case of transitivity.
 The case where the reasoning breaks down is the sequence case,
-say $r_1\,r_2$. The induction hypotheses in this case
+say $r_1\cdot r_2$. The induction hypotheses in this case
 are
 \begin{center}
 \begin{tabular}{@ {}cc@ {}}
 \begin{tabular}{@ {}ll@ {}}
 \noindent
 We can assume that
 \begin{equation}
-Seq\,v_{1l}\, v_{1r}) \posix_{r_1\,r_2} Seq\,v_{2l}\, v_{2r})
+Seq\,v_{1l}\, v_{1r} \posix_{r_1\cdot r_2} Seq\,v_{2l}\, v_{2r}
 \qquad\textrm{and}\qquad
-Seq\,v_{2l}\, v_{2r}) \posix_{r_1\,r_2} Seq\,v_{3l}\, v_{3r})
+Seq\,v_{2l}\, v_{2r} \posix_{r_1\cdot r_2} Seq\,v_{3l}\, v_{3r}
 \label{assms}
 \end{equation}
 \noindent
 hold, and furthermore that the values have equal length,
 \noindent
 We need to show that
 \[
-(v_{1l}, v_{1r}) \posix_{r_1\,r_2} (v_{3l}, v_{3r})
+Seq\,v_{1l}\, v_{1r}) \posix_{r_1\cdot r_2} Seq\, v_{3l}\, v_{3r}
 \]
 \noindent holds. We can proceed by analysing how the
 assumptions in \eqref{assms} have arisen. There are four
 cases. Let us assume we are in the case where
 v_{1l} \posix_{r_1} v_{2l}
 \qquad\textrm{and}\qquad
 v_{2l} \posix_{r_1} v_{3l}
 \]
-\noindent and also know the corresponding typing judgements.
+\noindent and also know the corresponding inhabitation judgements.
 This is exactly a case where we would like to apply the
 induction hypothesis IH~$r_1$. But we cannot! We still need to
 show that $|v_{1l}| = |v_{2l}|$ and $|v_{2l}| = |v_{3l}|$. We
 know from \eqref{lens} that the lengths of the sequence values
 are equal, but from this we cannot infer anything about the
 \qquad\textrm{and}\qquad
 |v_{1r}| \not= |v_{2r}|
 \]
 \noindent but still \eqref{lens} will hold. Now we are stuck,
-since the IH does not apply.
+since the IH does not apply. As said, this problem where the
+induction hypothesis does not apply arises in several places in
+the proof of Sulzmann and LU, not just when proving transitivity.
 *}
 section {* Conclusion *}
 text {*
-Nipkow lexer from 2000
+We have implemented the POSIX value calculation algorithm introduced in
-\noindent
+\cite{Sulzmann2014}. Our implementation is nearly identical to the
-We have also introduced a slightly restricted version of this relation
+original and all modifications are harmless (like our char-clause for
-where the last rule is restricted so that @{term "flat v \<noteq> []"}.
+@{text inj}). We have proved this algorithm to be correct, but correct
-\bigskip
+according to our own specification of what POSIX values are. Our
+specification (inspired from work in \cite{Vansummeren2006}) appears to be
+much simpler than in \cite{Sulzmann2014} and our proofs are nearly always
-*}
+straightforward. We have attempted to formalise the original proof
+by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains
+unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors
-text {*
+already acknowledge some small problems, but our experience suggests
+there are more serious problems.
+Closesly related to our work is an automata-based lexer formalised by
+Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
+initial substrings, but Nipkow's algorithm is not completely
+computational. The algorithm by Sulzmann and Lu, in contrast, can be
+implemented easily in functional languages. A bespoke lexer for the
+Imp-Language is formalised in Coq as part of the Software Foundations book
+\cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
+do not generalise easily to more advanced features.
+Our formalisation is available from
+\begin{center}
+\url{http://www.inf.kcl.ac.uk/staff/urbanc/lex}
+\end{center}
 %\noindent
 %{\bf Acknowledgements:}
 %We are grateful for the comments we received from anonymous
 %referees.
 \bibliographystyle{plain}
 \bibliography{root}
 *}

changeset 133	23e68b81a908
parent 132	03ca57e3f199
child 134	2f043f8be9a9