regexp: csupp.tex@44c4450152e3

\documentclass{article}
\usepackage{a4wide,ot1patch}
\usepackage[latin1]{inputenc}
\usepackage{multicol}
\usepackage{charter}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{fancyheadings}

\addtolength{\oddsidemargin}{-6mm}
\addtolength{\evensidemargin}{-6mm}
\addtolength{\textwidth}{11mm}
\addtolength{\columnsep}{3mm}
\addtolength{\textheight}{8mm}
\addtolength{\topmargin}{-7.5mm}

\pagestyle{fancyplain}
\lhead[\fancyplain{}{A}]{\fancyplain{}{}}
\rhead[\fancyplain{}{C}]{\fancyplain{}{}}
\renewcommand{\headrulewidth}{0pt}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}


\begin{center}
\begin{tabular}{c}
\\[-5mm]
\LARGE\bf Novel Certified Parsers\\[-10mm]
\mbox{}
\end{tabular}
\end{center}
\thispagestyle{empty}
\mbox{}\\[-5mm]

\begin{multicols}{2}
\section*{Background}

\noindent
Parsers transform plain text into some abstract structure that can be analyzed by
computers for further processing.  One might think that parsers have been
studied to death, and after \emph{yacc} and \emph{lex} no new results can be
obtained in this area.  However recent developments and novel approaches make
it increasingly clear, that this is not true anymore~\cite{Might11}. And
there is a real practical need for new results: for example the future HTML5 
Standard abandons a well-defined grammar specification, in favour of a bespoke
parser given as pseudo code. Proving any property about this parser is nearly 
impossible.

This work targets parsers from a certification point of view. Increasingly,
parsers are part of certified compilers, like
\mbox{\emph{CompCert}}~\cite{Leroy09}, which are guaranteed to be
bug-free. Such certified compilers are crucial in areas where software just
cannot fail. However, so far the parsers of these compilers have been left out
of the certification.  This is because parsing algorithms are often ad hoc and
their semantics is not clearly specified. This means, unfortunately, parsers
can harbour errors that potentially invalidate the whole certification and
correctness of the compiler. In this project, we aim to change this situation
with the help of the theorem prover Isabelle/HOL.

Only in the last few years, theorem provers have become powerful enough for
establishing the correctness of some standard lexing and parsing
algorithms. For this, the algorithms still need to be formulated in a way so
that it is easy to reason about them. In our earlier work about lexing and
regular languages, we showed that this precludes well-known algorithms based
automata~\cite{WuZhangUrban11}. However we showed also that regular languages can be formulated and
reasoned about entirely in terms regular expressions, which can be easily
represented in theorem provers. This work uses the device of derivatives of
regular expressions. We like to extend this device to parsers and grammars.
The aim is to come up with elegant and practical useful parsing algorithms
whose correctness can be certified in Isabelle/HOL.


\section*{Proposed Work}

A recent development in the field of parsing are Parsing Expression Grammars
(PEGs)~\cite{Ford04}, which
are an extension of the well-known Context Free Grammars
(CFGs). This extension introduces new regular operators, such as
negation and conjunction, on the right-hand side of grammar rules, as well as
priority orderings for rules. With these extensions, PEG parsing becomes much
more powerful and more useful in practice. For example disambiguation, formerly expressed by semantic
filters, can now be expressed directly using grammar rules. 

However, there is a serious limitation of PEGs, which affects potential
applications: grammars involving left recursion are not
allowed~\cite{Ford02}. Although one PEG parsing algorithm has already been
proposed that can deal with left recursion~\cite{WarthDM08}, there
is no correctness proof for it, not even one with ``pencil and paper''. One aim of this
research is to solve this sorry state-of-affairs by either certifying this
algorithm or inventing a new one. For this we will first devise a
fixed-point semantics of PEGs, against which we can certify a parser. For this
semantics we take as starting point the paper~\cite{Ford04}, which does not
treat left-recursion, but gives an operational semantics for PEG
parsing. There are also good indications that we can adapt work on Boolean
Grammars~\cite{Okhotin04}, which are similar to PEGs and for which the
paper~\cite{KountouriotisNR09} gives a fixed-point semantics 
for negation operators, but not to the Kleene star.

For our parsing algorithm, we might be able to build upon
the classic Cocke-Younger-Kasami (CYK)
algorithms~\cite{KountouriotisNR09} and
Early~\cite{AycHor02, Earley70} parsers. The defect of CYK algorithms, however,
is that the grammar specifications given by the user need to be transformed
into a normal form. This transformation may potentially lead to rule explosion
and hence inefficient parsing. We will investigate whether this transformation
can be avoided.  The problem with Early-style parsers is that they need to be 
extended to PEG parsing in order to be helpful for us.


Finally, we want to investigate whether derivatives of regular expressions
~\cite{Brzozowski64,Might11,OwensReppyTuron09}
can be generalised to PEG parsing. In earlier work, we showed that lexing based on
derivatives gives rise to very elegant regular expression matchers that can be
certified in a theorem prover with ease.  We will study whether the idea of
taking a derivative of a regular expression can be extended to rules in
grammars. The problem that needs to be addressed is again how to deal with 
left-recursive grammar rules.



\bibliography{Journal/document/root}
\bibliographystyle{abbrv}
\end{multicols}

%  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  \noindent {\bf Objectives:} The overall goals of the project are as follows:
%  \begin{itemize}
%  \item To solve the POPLmark challenge.

%  \item To complete and greatly improve the existing implementation of the
%    nominal datatype package.
%  \item To explore the strengths of this package by proving the
%    safety of SML.
%  \item To provide a basis for extracting programs from safety proofs.

%  \item To make the nominal datatype package usable for teaching
%    students about the lambda-calculus and the theory of programming
%    languages. \smallskip
%  \end{itemize}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% TeX-command-default: "PdfLaTeX"
%%% TeX-view-style: (("." "kpdf %s.pdf"))
%%% End:
author	Christian Urban <christian dot urban at kcl dot ac dot uk>
	Fri, 01 Mar 2013 17:13:32 +0000 (2013-03-01)
changeset 375	44c4450152e3
parent 258	1abf8586ee6b
permissions	-rwxr-xr-x