regexp: comparison csupp.tex

equal deleted inserted replaced

-:68d48522ea9a
+:093e45c44d91
 \thispagestyle{empty}
 \mbox{}\\[-5mm]
 \begin{multicols}{2}
 \section*{Background}
 \noindent
-Parsing is the act of transforming plain text into some
+Parsers transform plain text into some abstract structure that can be analyzed by
-structure that can be analyzed by computers for further processing.
+computers for further processing.  One might think that parsers have been
-One might think that parsing has been studied to death, and after
+studied to death, and after \emph{yacc} and \emph{lex} no new results can be
-\emph{yacc} and \emph{lex} no new results can be obtained in this area.
+obtained in this area.  However recent developments and novel approaches make
-However recent developments and novel approaches make it increasingly clear,
+it increasingly clear, that this is not true anymore~\cite{Might11}. And
-that this is not true anymore.
+there is a practical need for new results: for example the future HTML 5
+Standard abandons a well-defined grammar specification, in favour of a bespoke
+parser given as pseudo code.
-We propose to on parsers from a certification point of view. Increasingly,
+This work targets parsers from a certification point of view. Increasingly,
-parsers are part of certified compilers, like \mbox{\emph{CompCert}}, which
+parsers are part of certified compilers, like
-are guaranteed to be correct and bug-free. Such certified compilers are
+\mbox{\emph{CompCert}}~\cite{Leroy09}, which are guaranteed to be
-crucial in areas where software just cannot fail. However, so far the parsers
+bug-free. Such certified compilers are crucial in areas where software just
-of these compilers have been left out of the certification.  This is because
+cannot fail. However, so far the parsers of these compilers have been left out
-parsing algorithms are often ad hoc and their semantics is not clearly
+of the certification.  This is because parsing algorithms are often ad hoc and
-specified. Unfortunately, this means parsers can harbour errors that
+their semantics is not clearly specified. This means, unfortunately, parsers
-potentially invalidate the whole certification and correctness of the
+can harbour errors that potentially invalidate the whole certification and
-compiler. In this project, we like to change that with the help of theorem
+correctness of the compiler. In this project, we aim to change this situation
-provers.
+with the help of the theorem prover Isabelle/HOL.
-Only in the last few years, theorem provers have become good enough for
+Only in the last few years, theorem provers have become powerful enough for
 establishing the correctness of some standard lexing and parsing
 algorithms. For this, the algorithms still need to be formulated in way so
 that it is easy to reason about them. In our earlier work about lexing and
 regular languages, we showed that this precludes well-known algorithms based
-automata. However we showed also that regular languages can be formulated and
+automata~\cite{WuZhangUrban11}. However we showed also that regular languages can be formulated and
 reasoned about entirely in terms regular expressions, which can be easily
 represented in theorem provers. This work uses the device of derivatives of
 regular expressions. We like to extend this device to parsers and grammars.
 The aim is to come up with elegant and practical useful parsing algorithms
-whose correctness can be certified in a theorem prover.
+whose correctness can be certified in Isabelle/HOL.
 \section*{Proposed Work}
-A recent development in parsing is Parsing Expression Grammars (PEG), which
+A recent development in the field of parsing are Parsing Expression Grammars
-are an extension of the weel-known Context Free Grammars
+(PEGs)~\cite{Ford04}, which
-(CFG)~\cite{Ford04a}. The extension introduces new regular operators, such as
+are an extension of the well-known Context Free Grammars
-negation and conjunction, on the right-hand sides of grammar rules, as well as
+(CFGs). This extension introduces new regular operators, such as
-priority orderings. With these extensions, PEG parsing becomes much
+negation and conjunction, on the right-hand side of grammar rules, as well as
-more powerful. For example disambiguation, formerly expressed by semantic
+priority orderings for rules. With these extensions, PEG parsing becomes much
+more powerful and more useful in practise. For example disambiguation, formerly expressed by semantic
 filters, can now be expressed directly using grammar rules.
-However, there is serious disadvantage of PEG for applications: is does not
+However, there is a serious limitation of PEGs, which affects potential
-support grammrs involving left recursion~\cite{Ford02b}. Although a new PEG
+applications: grammars involving left recursion are not
-parsing algorithm has been proposed that can deal with left
+allowed~\cite{Ford02}. Although one PEG parsing algorithm has already been
-recursion~\cite{conf/pepm/WarthDM08}, there is no correctness proof, not even
+proposed that can deal with left recursion~\cite{WarthDM08}, there
-in ``paper-and-pencil'' form. One aim of this research is to solve this sorry
+is no correctness proof for it, not even one with ``pencil and paper''. One aim of this
-state-of-affairs by either certifying this algorithm or inventing a new
+research is to solve this sorry state-of-affairs by either certifying this
-one. For this we will first formalize a fixed point semantics of PEG, based on
+algorithm or inventing a new one. For this we will first devise a
-which an efficient, certified parsing algorithm can be given given. For this
+fixed-point semantics of PEGs, against which we can certify a parser. For this
-we take as starting point the paper~\cite{Ford04a}, which does not treat
+semantics we take as starting point the paper~\cite{Ford04}, which does not
-left-recursion, but gives an operational semantics for PEG parsing. For the
+treat left-recursion, but gives an operational semantics for PEG
-semantics, it seems plausible that we can adapt work on Boolean
+parsing. There are also good indications that we can adapt work on Boolean
-Grammars~\cite{Okhotin/04a}, which are similar to PEGs, and for which the
+Grammars~\cite{Okhotin04}, which are similar to PEGs, and for which the
-paper ~\cite{journals/iandc/KountouriotisNR09} gives a semantics to negation
+paper~\cite{KountouriotisNR09} gives a fixed-point semantics
-operators, but not to Kleene's star operation.
+to negation operators, but not to the Kleene star.
-For the parsing algorithm, we might also be able to draw inspiration from
+For the parsing algorithm, we might be able to draw inspiration from parsers
-parsers based on Cocke-Younger-Kasami (CYK)
+based on the classic Cocke-Younger-Kasami (CYK)
-algorithms~\cite{journals/iandc/KountouriotisNR09} and
+algorithms~\cite{KountouriotisNR09} and
-Early~\cite{Earley70,AycHor02} parsers. The defect CYK algorithms is that the
+Early~\cite{AycHor02, Earley70} parsers. The defect of CYK algorithms, however,
-original grammar specification needs to be transformed into a normal
+is that the grammar specifications given by the user need to be transformed
-form. This transformation may lead to grammar explosion and inefficient
+into a normal form. This transformation may potentially lead to rule explosion
-parsing. We will investigate whether this transformation can be avoided.
+and hence inefficient parsing. We will investigate whether this transformation
-Early style parsers, which have recently been certified by Ridge [???],
+can be avoided.  The problem with Early-style parsers is that they need to be
-need to be extended to PEG parsing in order to be helpful for us.
+extended to PEG parsing in order to be helpful for us.
 Finally, we want to investigate whether derivatives of regular expressions
-~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023}
+~\cite{Brzozowski64,Might11,OwensReppyTuron09}
-can be extended to parsing. Lexing based on derivatives gives rise to very
+can be generalised to PEG parsing. In earlier work, we showed that lexing based on
-elegant regular expression matchers that can be certified in a theorem prover
+derivatives gives rise to very elegant regular expression matchers that can be
-with ease.  We will study whether the idea of taking a derivative of a regular
+certified in a theorem prover with ease.  We will study whether the idea of
-expression can be extended to rules in grammars. The problem that needs to be
+taking a derivative of a regular expression can be extended to rules in
-overcome again arises from possible left recursion in parsing.
+grammars. The problem that needs to be overcome is again how to deal with possible
+left-recursion in grammar rules.
-\newpage
-\small
 \bibliography{Journal/document/root}
 \bibliographystyle{abbrv}
 \end{multicols}
 %  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

changeset 242	093e45c44d91
parent 236	2d4f1334b5ca
child 243	86a4182c73e7