--- a/csupp.tex Mon Sep 05 20:59:50 2011 +0000
+++ b/csupp.tex Tue Sep 06 02:35:10 2011 +0000
@@ -34,86 +34,89 @@
\begin{multicols}{2}
\section*{Background}
+
\noindent
-Parsing is the act of transforming plain text into some
-structure that can be analyzed by computers for further processing.
-One might think that parsing has been studied to death, and after
-\emph{yacc} and \emph{lex} no new results can be obtained in this area.
-However recent developments and novel approaches make it increasingly clear,
-that this is not true anymore.
+Parsers transform plain text into some abstract structure that can be analyzed by
+computers for further processing. One might think that parsers have been
+studied to death, and after \emph{yacc} and \emph{lex} no new results can be
+obtained in this area. However recent developments and novel approaches make
+it increasingly clear, that this is not true anymore~\cite{Might11}. And
+there is a practical need for new results: for example the future HTML 5
+Standard abandons a well-defined grammar specification, in favour of a bespoke
+parser given as pseudo code.
-We propose to on parsers from a certification point of view. Increasingly,
-parsers are part of certified compilers, like \mbox{\emph{CompCert}}, which
-are guaranteed to be correct and bug-free. Such certified compilers are
-crucial in areas where software just cannot fail. However, so far the parsers
-of these compilers have been left out of the certification. This is because
-parsing algorithms are often ad hoc and their semantics is not clearly
-specified. Unfortunately, this means parsers can harbour errors that
-potentially invalidate the whole certification and correctness of the
-compiler. In this project, we like to change that with the help of theorem
-provers.
+This work targets parsers from a certification point of view. Increasingly,
+parsers are part of certified compilers, like
+\mbox{\emph{CompCert}}~\cite{Leroy09}, which are guaranteed to be
+bug-free. Such certified compilers are crucial in areas where software just
+cannot fail. However, so far the parsers of these compilers have been left out
+of the certification. This is because parsing algorithms are often ad hoc and
+their semantics is not clearly specified. This means, unfortunately, parsers
+can harbour errors that potentially invalidate the whole certification and
+correctness of the compiler. In this project, we aim to change this situation
+with the help of the theorem prover Isabelle/HOL.
-Only in the last few years, theorem provers have become good enough for
+Only in the last few years, theorem provers have become powerful enough for
establishing the correctness of some standard lexing and parsing
algorithms. For this, the algorithms still need to be formulated in way so
that it is easy to reason about them. In our earlier work about lexing and
regular languages, we showed that this precludes well-known algorithms based
-automata. However we showed also that regular languages can be formulated and
+automata~\cite{WuZhangUrban11}. However we showed also that regular languages can be formulated and
reasoned about entirely in terms regular expressions, which can be easily
represented in theorem provers. This work uses the device of derivatives of
regular expressions. We like to extend this device to parsers and grammars.
The aim is to come up with elegant and practical useful parsing algorithms
-whose correctness can be certified in a theorem prover.
+whose correctness can be certified in Isabelle/HOL.
\section*{Proposed Work}
-A recent development in parsing is Parsing Expression Grammars (PEG), which
-are an extension of the weel-known Context Free Grammars
-(CFG)~\cite{Ford04a}. The extension introduces new regular operators, such as
-negation and conjunction, on the right-hand sides of grammar rules, as well as
-priority orderings. With these extensions, PEG parsing becomes much
-more powerful. For example disambiguation, formerly expressed by semantic
+A recent development in the field of parsing are Parsing Expression Grammars
+(PEGs)~\cite{Ford04}, which
+are an extension of the well-known Context Free Grammars
+(CFGs). This extension introduces new regular operators, such as
+negation and conjunction, on the right-hand side of grammar rules, as well as
+priority orderings for rules. With these extensions, PEG parsing becomes much
+more powerful and more useful in practise. For example disambiguation, formerly expressed by semantic
filters, can now be expressed directly using grammar rules.
-However, there is serious disadvantage of PEG for applications: is does not
-support grammrs involving left recursion~\cite{Ford02b}. Although a new PEG
-parsing algorithm has been proposed that can deal with left
-recursion~\cite{conf/pepm/WarthDM08}, there is no correctness proof, not even
-in ``paper-and-pencil'' form. One aim of this research is to solve this sorry
-state-of-affairs by either certifying this algorithm or inventing a new
-one. For this we will first formalize a fixed point semantics of PEG, based on
-which an efficient, certified parsing algorithm can be given given. For this
-we take as starting point the paper~\cite{Ford04a}, which does not treat
-left-recursion, but gives an operational semantics for PEG parsing. For the
-semantics, it seems plausible that we can adapt work on Boolean
-Grammars~\cite{Okhotin/04a}, which are similar to PEGs, and for which the
-paper ~\cite{journals/iandc/KountouriotisNR09} gives a semantics to negation
-operators, but not to Kleene's star operation.
+However, there is a serious limitation of PEGs, which affects potential
+applications: grammars involving left recursion are not
+allowed~\cite{Ford02}. Although one PEG parsing algorithm has already been
+proposed that can deal with left recursion~\cite{WarthDM08}, there
+is no correctness proof for it, not even one with ``pencil and paper''. One aim of this
+research is to solve this sorry state-of-affairs by either certifying this
+algorithm or inventing a new one. For this we will first devise a
+fixed-point semantics of PEGs, against which we can certify a parser. For this
+semantics we take as starting point the paper~\cite{Ford04}, which does not
+treat left-recursion, but gives an operational semantics for PEG
+parsing. There are also good indications that we can adapt work on Boolean
+Grammars~\cite{Okhotin04}, which are similar to PEGs, and for which the
+paper~\cite{KountouriotisNR09} gives a fixed-point semantics
+to negation operators, but not to the Kleene star.
-For the parsing algorithm, we might also be able to draw inspiration from
-parsers based on Cocke-Younger-Kasami (CYK)
-algorithms~\cite{journals/iandc/KountouriotisNR09} and
-Early~\cite{Earley70,AycHor02} parsers. The defect CYK algorithms is that the
-original grammar specification needs to be transformed into a normal
-form. This transformation may lead to grammar explosion and inefficient
-parsing. We will investigate whether this transformation can be avoided.
-Early style parsers, which have recently been certified by Ridge [???],
-need to be extended to PEG parsing in order to be helpful for us.
+For the parsing algorithm, we might be able to draw inspiration from parsers
+based on the classic Cocke-Younger-Kasami (CYK)
+algorithms~\cite{KountouriotisNR09} and
+Early~\cite{AycHor02, Earley70} parsers. The defect of CYK algorithms, however,
+is that the grammar specifications given by the user need to be transformed
+into a normal form. This transformation may potentially lead to rule explosion
+and hence inefficient parsing. We will investigate whether this transformation
+can be avoided. The problem with Early-style parsers is that they need to be
+extended to PEG parsing in order to be helpful for us.
+
Finally, we want to investigate whether derivatives of regular expressions
-~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023}
-can be extended to parsing. Lexing based on derivatives gives rise to very
-elegant regular expression matchers that can be certified in a theorem prover
-with ease. We will study whether the idea of taking a derivative of a regular
-expression can be extended to rules in grammars. The problem that needs to be
-overcome again arises from possible left recursion in parsing.
-
-
-\newpage
+~\cite{Brzozowski64,Might11,OwensReppyTuron09}
+can be generalised to PEG parsing. In earlier work, we showed that lexing based on
+derivatives gives rise to very elegant regular expression matchers that can be
+certified in a theorem prover with ease. We will study whether the idea of
+taking a derivative of a regular expression can be extended to rules in
+grammars. The problem that needs to be overcome is again how to deal with possible
+left-recursion in grammar rules.
-\small
+
\bibliography{Journal/document/root}
\bibliographystyle{abbrv}
\end{multicols}