author | Christian Urban <christian dot urban at kcl dot ac dot uk> |
Fri, 27 Sep 2013 09:20:58 +0100 | |
changeset 389 | 796de251332c |
parent 258 | 1abf8586ee6b |
permissions | -rwxr-xr-x |
225 | 1 |
\documentclass{article} |
2 |
\usepackage{a4wide,ot1patch} |
|
3 |
\usepackage[latin1]{inputenc} |
|
4 |
\usepackage{multicol} |
|
5 |
\usepackage{charter} |
|
6 |
\usepackage{amsmath,amssymb,amsthm} |
|
7 |
\usepackage{fancyheadings} |
|
8 |
||
9 |
\addtolength{\oddsidemargin}{-6mm} |
|
10 |
\addtolength{\evensidemargin}{-6mm} |
|
11 |
\addtolength{\textwidth}{11mm} |
|
12 |
\addtolength{\columnsep}{3mm} |
|
13 |
\addtolength{\textheight}{8mm} |
|
14 |
\addtolength{\topmargin}{-7.5mm} |
|
15 |
||
16 |
\pagestyle{fancyplain} |
|
17 |
\lhead[\fancyplain{}{A}]{\fancyplain{}{}} |
|
18 |
\rhead[\fancyplain{}{C}]{\fancyplain{}{}} |
|
19 |
\renewcommand{\headrulewidth}{0pt} |
|
20 |
||
21 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
22 |
\begin{document} |
|
23 |
||
24 |
||
25 |
\begin{center} |
|
26 |
\begin{tabular}{c} |
|
27 |
\\[-5mm] |
|
243 | 28 |
\LARGE\bf Novel Certified Parsers\\[-10mm] |
229 | 29 |
\mbox{} |
225 | 30 |
\end{tabular} |
31 |
\end{center} |
|
32 |
\thispagestyle{empty} |
|
228 | 33 |
\mbox{}\\[-5mm] |
225 | 34 |
|
35 |
\begin{multicols}{2} |
|
228 | 36 |
\section*{Background} |
242 | 37 |
|
225 | 38 |
\noindent |
242 | 39 |
Parsers transform plain text into some abstract structure that can be analyzed by |
40 |
computers for further processing. One might think that parsers have been |
|
41 |
studied to death, and after \emph{yacc} and \emph{lex} no new results can be |
|
42 |
obtained in this area. However recent developments and novel approaches make |
|
43 |
it increasingly clear, that this is not true anymore~\cite{Might11}. And |
|
244 | 44 |
there is a real practical need for new results: for example the future HTML5 |
242 | 45 |
Standard abandons a well-defined grammar specification, in favour of a bespoke |
258 | 46 |
parser given as pseudo code. Proving any property about this parser is nearly |
47 |
impossible. |
|
225 | 48 |
|
242 | 49 |
This work targets parsers from a certification point of view. Increasingly, |
50 |
parsers are part of certified compilers, like |
|
51 |
\mbox{\emph{CompCert}}~\cite{Leroy09}, which are guaranteed to be |
|
52 |
bug-free. Such certified compilers are crucial in areas where software just |
|
53 |
cannot fail. However, so far the parsers of these compilers have been left out |
|
54 |
of the certification. This is because parsing algorithms are often ad hoc and |
|
55 |
their semantics is not clearly specified. This means, unfortunately, parsers |
|
56 |
can harbour errors that potentially invalidate the whole certification and |
|
57 |
correctness of the compiler. In this project, we aim to change this situation |
|
58 |
with the help of the theorem prover Isabelle/HOL. |
|
225 | 59 |
|
242 | 60 |
Only in the last few years, theorem provers have become powerful enough for |
236 | 61 |
establishing the correctness of some standard lexing and parsing |
243 | 62 |
algorithms. For this, the algorithms still need to be formulated in a way so |
236 | 63 |
that it is easy to reason about them. In our earlier work about lexing and |
64 |
regular languages, we showed that this precludes well-known algorithms based |
|
242 | 65 |
automata~\cite{WuZhangUrban11}. However we showed also that regular languages can be formulated and |
236 | 66 |
reasoned about entirely in terms regular expressions, which can be easily |
67 |
represented in theorem provers. This work uses the device of derivatives of |
|
68 |
regular expressions. We like to extend this device to parsers and grammars. |
|
234 | 69 |
The aim is to come up with elegant and practical useful parsing algorithms |
242 | 70 |
whose correctness can be certified in Isabelle/HOL. |
236 | 71 |
|
228 | 72 |
|
229 | 73 |
\section*{Proposed Work} |
74 |
||
242 | 75 |
A recent development in the field of parsing are Parsing Expression Grammars |
76 |
(PEGs)~\cite{Ford04}, which |
|
77 |
are an extension of the well-known Context Free Grammars |
|
78 |
(CFGs). This extension introduces new regular operators, such as |
|
79 |
negation and conjunction, on the right-hand side of grammar rules, as well as |
|
80 |
priority orderings for rules. With these extensions, PEG parsing becomes much |
|
258 | 81 |
more powerful and more useful in practice. For example disambiguation, formerly expressed by semantic |
236 | 82 |
filters, can now be expressed directly using grammar rules. |
230 | 83 |
|
242 | 84 |
However, there is a serious limitation of PEGs, which affects potential |
85 |
applications: grammars involving left recursion are not |
|
86 |
allowed~\cite{Ford02}. Although one PEG parsing algorithm has already been |
|
87 |
proposed that can deal with left recursion~\cite{WarthDM08}, there |
|
88 |
is no correctness proof for it, not even one with ``pencil and paper''. One aim of this |
|
89 |
research is to solve this sorry state-of-affairs by either certifying this |
|
90 |
algorithm or inventing a new one. For this we will first devise a |
|
91 |
fixed-point semantics of PEGs, against which we can certify a parser. For this |
|
92 |
semantics we take as starting point the paper~\cite{Ford04}, which does not |
|
93 |
treat left-recursion, but gives an operational semantics for PEG |
|
94 |
parsing. There are also good indications that we can adapt work on Boolean |
|
243 | 95 |
Grammars~\cite{Okhotin04}, which are similar to PEGs and for which the |
242 | 96 |
paper~\cite{KountouriotisNR09} gives a fixed-point semantics |
258 | 97 |
for negation operators, but not to the Kleene star. |
234 | 98 |
|
258 | 99 |
For our parsing algorithm, we might be able to build upon |
243 | 100 |
the classic Cocke-Younger-Kasami (CYK) |
242 | 101 |
algorithms~\cite{KountouriotisNR09} and |
102 |
Early~\cite{AycHor02, Earley70} parsers. The defect of CYK algorithms, however, |
|
103 |
is that the grammar specifications given by the user need to be transformed |
|
104 |
into a normal form. This transformation may potentially lead to rule explosion |
|
105 |
and hence inefficient parsing. We will investigate whether this transformation |
|
106 |
can be avoided. The problem with Early-style parsers is that they need to be |
|
107 |
extended to PEG parsing in order to be helpful for us. |
|
108 |
||
234 | 109 |
|
236 | 110 |
Finally, we want to investigate whether derivatives of regular expressions |
242 | 111 |
~\cite{Brzozowski64,Might11,OwensReppyTuron09} |
112 |
can be generalised to PEG parsing. In earlier work, we showed that lexing based on |
|
113 |
derivatives gives rise to very elegant regular expression matchers that can be |
|
114 |
certified in a theorem prover with ease. We will study whether the idea of |
|
115 |
taking a derivative of a regular expression can be extended to rules in |
|
243 | 116 |
grammars. The problem that needs to be addressed is again how to deal with |
117 |
left-recursive grammar rules. |
|
229 | 118 |
|
225 | 119 |
|
242 | 120 |
|
232
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
121 |
\bibliography{Journal/document/root} |
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
122 |
\bibliographystyle{abbrv} |
225 | 123 |
\end{multicols} |
124 |
||
125 |
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
126 |
% \noindent {\bf Objectives:} The overall goals of the project are as follows: |
|
127 |
% \begin{itemize} |
|
128 |
% \item To solve the POPLmark challenge. |
|
129 |
||
130 |
% \item To complete and greatly improve the existing implementation of the |
|
131 |
% nominal datatype package. |
|
132 |
% \item To explore the strengths of this package by proving the |
|
133 |
% safety of SML. |
|
134 |
% \item To provide a basis for extracting programs from safety proofs. |
|
135 |
||
229 | 136 |
% \item To make the nominal datatype package usable for teaching |
225 | 137 |
% students about the lambda-calculus and the theory of programming |
138 |
% languages. \smallskip |
|
139 |
% \end{itemize} |
|
140 |
||
141 |
||
142 |
||
143 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
144 |
\end{document} |
|
145 |
||
229 | 146 |
%%% Local Variables: |
225 | 147 |
%%% mode: latex |
148 |
%%% TeX-master: t |
|
149 |
%%% TeX-command-default: "PdfLaTeX" |
|
150 |
%%% TeX-view-style: (("." "kpdf %s.pdf")) |
|
229 | 151 |
%%% End: |