regexp: csupp.tex@a7ddcad0a023

\documentclass{article}
\usepackage{a4wide,ot1patch}
\usepackage[latin1]{inputenc}
\usepackage{multicol}
\usepackage{charter}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{fancyheadings}

\addtolength{\oddsidemargin}{-6mm}
\addtolength{\evensidemargin}{-6mm}
\addtolength{\textwidth}{11mm}
\addtolength{\columnsep}{3mm}
\addtolength{\textheight}{8mm}
\addtolength{\topmargin}{-7.5mm}

\pagestyle{fancyplain}
\lhead[\fancyplain{}{A}]{\fancyplain{}{}}
\rhead[\fancyplain{}{C}]{\fancyplain{}{}}
\renewcommand{\headrulewidth}{0pt}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}


\begin{center}
\begin{tabular}{c}
\\[-5mm]
\LARGE\bf Certified Parsing\\[-10mm]
\mbox{}
\end{tabular}
\end{center}
\thispagestyle{empty}
\mbox{}\\[-5mm]

\begin{multicols}{2}
\section*{Background}
\noindent
Parsing is the act of transforming plain text into some
structure that can be analyzed by computers for further processing.
One might think that parsing has been studied to death, and after
\emph{yacc} and \emph{lex} no new results can be obtained in this area.
However recent results and novel approaches make it increasingly clear,
that this is not true anymore.

We propose to approach the subject of parsing from a certification point
of view. Increasingly, parsers are part of certified compilers, like \mbox{\emph{CompCert}},
which are guaranteed to be correct and bug-free. Such certified compilers are
crucial in areas where software just cannot fail. However, so far the
parsers of these compilers have been left out of the certification.
This is because parsing algorithms are often ad hoc and their semantics
is not clearly specified. Unfortunately, this means parsers can harbour
errors that potentially invalidate the whole certification and correctness
of the compiler. In this project, we like to change that.

Only in the last few years, theorem provers have become good enough
for establishing the correctness of some standard lexing and
parsing algorithms. For this, the algorithms still need to be formulated
in way so that it is easy to reason about them. In our earlier work
about lexing and regular languages, the authors showed that this
precludes well-known algorithms based automata. However we showed 
that regular
languages can be formulated and reasoned about entirely in terms
regular expressions, which can be easily represented in theorem
provers. This work uses the device of derivatives of regular
expressions. We like to extend this device to parsers and grammars.
The aim is to come up with elegant and practical useful parsing algorithms
whose correctness can be certified in a
theorem prover.

\section*{Proposed Work}

A recent development in parsing is Parsing Expression Grammars (PEG), which
are an extension of the standard Context Free Grammars
(CFG)~\cite{Ford04a}. The extension introduces new regular operators, such as
negation and conjunction, on the right-hand sides of grammar rules, as well as
priority orderings on rules. With these extensions, PEG parsing becomes much
more powerful. For example disambiguation, formerly expressed by semantic
filters, can now be expressed directly using grammar rules. This means a
simpler and more systematic treatment of ambiguity and more concise grammar
specifications for programming languages.

However, a serious disadvantage of PEG is that it does not allow left
recursion, because parsing algorithms for PEG~\cite{Ford02b} can not deal with
left recursions. Although a new PEG parsing algorithm has been proposed
that can deal with left recursion~\cite{conf/pepm/WarthDM08}, there is no
correctness proof, not even in ``paper-and-pencil'' form. One aim of this
research is to solve this sorry state-of-affairs by either certifying this
algorithm or inventing a new one. For this we will first formalize a fixed
point semantics of PEG, based on which an efficient, certified parsing
algorithm can be given given.

There are several existing works we can draw upon:
\begin{enumerate}
\item The works on PEG.
  \begin {enumerate}
  \item An operation semantics for PEG has already been given
    in~\cite{Ford04a}, but it is not adequate to deal with left recursions. But this
    work gives at least a precise description of what the original PEG meant
    for. This will serve an a basis to show the conservativeness of
    the fixed point semantics we are
    going to develop.
    
  \item The new algorithm~\cite{conf/pepm/WarthDM08} which claimed to be able
    to deal with left recursions. Although there is no correctness proof yet, this
    may provide some useful inspirations for our new algorithm design.
  \end{enumerate}

\item The works on Boolean Grammars~\cite{Okhotin/04a}. Boolean Grammar is
  very closely related to PEG, because it also contains negative and conjunctive
  grammars. The main differences are: First, Boolean Grammar has no ordering on
  productions; Second: Boolean Grammar does not contain STAR operator. There are
  two works about Boolean Grammar which might be useful for this research:
  
  \begin{enumerate}
  \item A fixed point semantics for Boolean
    Grammar~\cite{journals/iandc/KountouriotisNR09}. The idea to define the
    semantics of negative and conjunctive operators is certainly what we can
    borrow. Therefore, this work gives the basis on which we can add in production
    ordering and STAR operator.

  \item A parsing algorithm for Boolean Grammar based on CYK
    parsing~\cite{journals/iandc/KountouriotisNR09}. The draw back of CYK parsing
    is that: the original grammar specification needs to be transformed into a
    normal form. This transformation may lead to grammar explosion and is
    undesirable. One aim of this research is to see whether this transformation
    can be avoided. For this purpose, other parsing style may provide useful inspirations, for example:

    \begin{enumerate}
    \item Derivative
      Parsing~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023}. Christian
      Urban has used derivative methods to establish the correctness of a regular
      expression matcher, as well the the finite partition property of regular
      expression~\cite{WuZhangUrban11}.  There are well founded envisage that the
      derivative methods may provide the foundation to the new parsing algorithms of PEG.

    \item Early parsing~\cite{Earley70,AycHor02}. It is a refinement of CYK
      parsing which does not require the transformation to normal forms, and
      therefore provide one possible direction to adapt the current CYK based
      parsing algorithm of Boolean Grammar for PEG grammar.

    \item The new parsing algorithm proposed by Tom Ridge[???]. Recently,
      T. Ridge has proposed and certified an combinator style parsing algorithm for
      CFG, which borrows some ideas from Early parsing. The proposed algorithm is
      very simple and elegant. We are going to strive for a parsing algorithm as elegant as this one.
    \end{enumerate}
    
    Which of the above possibilities will finally get into our final solutions
    is an interesting point about this current research.

  \end{enumerate}
\end{enumerate}

Based on these works, we are quite confident that our idea may lead to some concrete results.

\mbox{}\\[15cm]

\noindent



\small
\bibliography{Journal/document/root}
\bibliographystyle{abbrv}
\end{multicols}

%  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  \noindent {\bf Objectives:} The overall goals of the project are as follows:
%  \begin{itemize}
%  \item To solve the POPLmark challenge.

%  \item To complete and greatly improve the existing implementation of the
%    nominal datatype package.
%  \item To explore the strengths of this package by proving the
%    safety of SML.
%  \item To provide a basis for extracting programs from safety proofs.

%  \item To make the nominal datatype package usable for teaching
%    students about the lambda-calculus and the theory of programming
%    languages. \smallskip
%  \end{itemize}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% TeX-command-default: "PdfLaTeX"
%%% TeX-view-style: (("." "kpdf %s.pdf"))
%%% End:
author	urbanc
	Mon, 05 Sep 2011 12:40:22 +0000 (2011-09-05)
changeset 235	a7ddcad0a023
parent 234	eeadb4e51d74
child 236	2d4f1334b5ca
permissions	-rwxr-xr-x