# HG changeset patch # User urbanc # Date 1315228178 0 # Node ID 2d4f1334b5caa22ab90d438097d13c375a8214da # Parent a7ddcad0a0239cfb99fcb182721fd92ce497478c shortened diff -r a7ddcad0a023 -r 2d4f1334b5ca csupp.pdf Binary file csupp.pdf has changed diff -r a7ddcad0a023 -r 2d4f1334b5ca csupp.tex --- a/csupp.tex Mon Sep 05 12:40:22 2011 +0000 +++ b/csupp.tex Mon Sep 05 13:09:38 2011 +0000 @@ -39,123 +39,78 @@ structure that can be analyzed by computers for further processing. One might think that parsing has been studied to death, and after \emph{yacc} and \emph{lex} no new results can be obtained in this area. -However recent results and novel approaches make it increasingly clear, +However recent developments and novel approaches make it increasingly clear, that this is not true anymore. -We propose to approach the subject of parsing from a certification point -of view. Increasingly, parsers are part of certified compilers, like \mbox{\emph{CompCert}}, -which are guaranteed to be correct and bug-free. Such certified compilers are -crucial in areas where software just cannot fail. However, so far the -parsers of these compilers have been left out of the certification. -This is because parsing algorithms are often ad hoc and their semantics -is not clearly specified. Unfortunately, this means parsers can harbour -errors that potentially invalidate the whole certification and correctness -of the compiler. In this project, we like to change that. +We propose to on parsers from a certification point of view. Increasingly, +parsers are part of certified compilers, like \mbox{\emph{CompCert}}, which +are guaranteed to be correct and bug-free. Such certified compilers are +crucial in areas where software just cannot fail. However, so far the parsers +of these compilers have been left out of the certification. This is because +parsing algorithms are often ad hoc and their semantics is not clearly +specified. Unfortunately, this means parsers can harbour errors that +potentially invalidate the whole certification and correctness of the +compiler. In this project, we like to change that with the help of theorem +provers. -Only in the last few years, theorem provers have become good enough -for establishing the correctness of some standard lexing and -parsing algorithms. For this, the algorithms still need to be formulated -in way so that it is easy to reason about them. In our earlier work -about lexing and regular languages, the authors showed that this -precludes well-known algorithms based automata. However we showed -that regular -languages can be formulated and reasoned about entirely in terms -regular expressions, which can be easily represented in theorem -provers. This work uses the device of derivatives of regular -expressions. We like to extend this device to parsers and grammars. +Only in the last few years, theorem provers have become good enough for +establishing the correctness of some standard lexing and parsing +algorithms. For this, the algorithms still need to be formulated in way so +that it is easy to reason about them. In our earlier work about lexing and +regular languages, we showed that this precludes well-known algorithms based +automata. However we showed also that regular languages can be formulated and +reasoned about entirely in terms regular expressions, which can be easily +represented in theorem provers. This work uses the device of derivatives of +regular expressions. We like to extend this device to parsers and grammars. The aim is to come up with elegant and practical useful parsing algorithms -whose correctness can be certified in a -theorem prover. +whose correctness can be certified in a theorem prover. + \section*{Proposed Work} A recent development in parsing is Parsing Expression Grammars (PEG), which -are an extension of the standard Context Free Grammars +are an extension of the weel-known Context Free Grammars (CFG)~\cite{Ford04a}. The extension introduces new regular operators, such as negation and conjunction, on the right-hand sides of grammar rules, as well as -priority orderings on rules. With these extensions, PEG parsing becomes much +priority orderings. With these extensions, PEG parsing becomes much more powerful. For example disambiguation, formerly expressed by semantic -filters, can now be expressed directly using grammar rules. This means a -simpler and more systematic treatment of ambiguity and more concise grammar -specifications for programming languages. - -However, a serious disadvantage of PEG is that it does not allow left -recursion, because parsing algorithms for PEG~\cite{Ford02b} can not deal with -left recursions. Although a new PEG parsing algorithm has been proposed -that can deal with left recursion~\cite{conf/pepm/WarthDM08}, there is no -correctness proof, not even in ``paper-and-pencil'' form. One aim of this -research is to solve this sorry state-of-affairs by either certifying this -algorithm or inventing a new one. For this we will first formalize a fixed -point semantics of PEG, based on which an efficient, certified parsing -algorithm can be given given. +filters, can now be expressed directly using grammar rules. -There are several existing works we can draw upon: -\begin{enumerate} -\item The works on PEG. - \begin {enumerate} - \item An operation semantics for PEG has already been given - in~\cite{Ford04a}, but it is not adequate to deal with left recursions. But this - work gives at least a precise description of what the original PEG meant - for. This will serve an a basis to show the conservativeness of - the fixed point semantics we are - going to develop. - - \item The new algorithm~\cite{conf/pepm/WarthDM08} which claimed to be able - to deal with left recursions. Although there is no correctness proof yet, this - may provide some useful inspirations for our new algorithm design. - \end{enumerate} +However, there is serious disadvantage of PEG for applications: is does not +support grammrs involving left recursion~\cite{Ford02b}. Although a new PEG +parsing algorithm has been proposed that can deal with left +recursion~\cite{conf/pepm/WarthDM08}, there is no correctness proof, not even +in ``paper-and-pencil'' form. One aim of this research is to solve this sorry +state-of-affairs by either certifying this algorithm or inventing a new +one. For this we will first formalize a fixed point semantics of PEG, based on +which an efficient, certified parsing algorithm can be given given. For this +we take as starting point the paper~\cite{Ford04a}, which does not treat +left-recursion, but gives an operational semantics for PEG parsing. For the +semantics, it seems plausible that we can adapt work on Boolean +Grammars~\cite{Okhotin/04a}, which are similar to PEGs, and for which the +paper ~\cite{journals/iandc/KountouriotisNR09} gives a semantics to negation +operators, but not to Kleene's star operation. -\item The works on Boolean Grammars~\cite{Okhotin/04a}. Boolean Grammar is - very closely related to PEG, because it also contains negative and conjunctive - grammars. The main differences are: First, Boolean Grammar has no ordering on - productions; Second: Boolean Grammar does not contain STAR operator. There are - two works about Boolean Grammar which might be useful for this research: - - \begin{enumerate} - \item A fixed point semantics for Boolean - Grammar~\cite{journals/iandc/KountouriotisNR09}. The idea to define the - semantics of negative and conjunctive operators is certainly what we can - borrow. Therefore, this work gives the basis on which we can add in production - ordering and STAR operator. - - \item A parsing algorithm for Boolean Grammar based on CYK - parsing~\cite{journals/iandc/KountouriotisNR09}. The draw back of CYK parsing - is that: the original grammar specification needs to be transformed into a - normal form. This transformation may lead to grammar explosion and is - undesirable. One aim of this research is to see whether this transformation - can be avoided. For this purpose, other parsing style may provide useful inspirations, for example: +For the parsing algorithm, we might also be able to draw inspiration from +parsers based on Cocke-Younger-Kasami (CYK) +algorithms~\cite{journals/iandc/KountouriotisNR09} and +Early~\cite{Earley70,AycHor02} parsers. The defect CYK algorithms is that the +original grammar specification needs to be transformed into a normal +form. This transformation may lead to grammar explosion and inefficient +parsing. We will investigate whether this transformation can be avoided. +Early style parsers, which have recently been certified by Ridge [???], +need to be extended to PEG parsing in order to be helpful for us. - \begin{enumerate} - \item Derivative - Parsing~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023}. Christian - Urban has used derivative methods to establish the correctness of a regular - expression matcher, as well the the finite partition property of regular - expression~\cite{WuZhangUrban11}. There are well founded envisage that the - derivative methods may provide the foundation to the new parsing algorithms of PEG. - - \item Early parsing~\cite{Earley70,AycHor02}. It is a refinement of CYK - parsing which does not require the transformation to normal forms, and - therefore provide one possible direction to adapt the current CYK based - parsing algorithm of Boolean Grammar for PEG grammar. +Finally, we want to investigate whether derivatives of regular expressions +~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023} +can be extended to parsing. Lexing based on derivatives gives rise to very +elegant regular expression matchers that can be certified in a theorem prover +with ease. We will study whether the idea of taking a derivative of a regular +expression can be extended to rules in grammars. The problem that needs to be +overcome again arises from possible left recursion in parsing. - \item The new parsing algorithm proposed by Tom Ridge[???]. Recently, - T. Ridge has proposed and certified an combinator style parsing algorithm for - CFG, which borrows some ideas from Early parsing. The proposed algorithm is - very simple and elegant. We are going to strive for a parsing algorithm as elegant as this one. - \end{enumerate} - Which of the above possibilities will finally get into our final solutions - is an interesting point about this current research. - - \end{enumerate} -\end{enumerate} - -Based on these works, we are quite confident that our idea may lead to some concrete results. - -\mbox{}\\[15cm] - -\noindent - +\newpage \small