author | urbanc |
Wed, 14 Sep 2011 13:39:03 +0000 | |
changeset 253 | bcef7669f55a |
parent 244 | a9598a206c41 |
child 258 | 1abf8586ee6b |
permissions | -rwxr-xr-x |
225 | 1 |
\documentclass{article} |
2 |
\usepackage{a4wide,ot1patch} |
|
3 |
\usepackage[latin1]{inputenc} |
|
4 |
\usepackage{multicol} |
|
5 |
\usepackage{charter} |
|
6 |
\usepackage{amsmath,amssymb,amsthm} |
|
7 |
\usepackage{fancyheadings} |
|
8 |
||
9 |
\addtolength{\oddsidemargin}{-6mm} |
|
10 |
\addtolength{\evensidemargin}{-6mm} |
|
11 |
\addtolength{\textwidth}{11mm} |
|
12 |
\addtolength{\columnsep}{3mm} |
|
13 |
\addtolength{\textheight}{8mm} |
|
14 |
\addtolength{\topmargin}{-7.5mm} |
|
15 |
||
16 |
\pagestyle{fancyplain} |
|
17 |
\lhead[\fancyplain{}{A}]{\fancyplain{}{}} |
|
18 |
\rhead[\fancyplain{}{C}]{\fancyplain{}{}} |
|
19 |
\renewcommand{\headrulewidth}{0pt} |
|
20 |
||
21 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
22 |
\begin{document} |
|
23 |
||
24 |
||
25 |
\begin{center} |
|
26 |
\begin{tabular}{c} |
|
27 |
\\[-5mm] |
|
243 | 28 |
\LARGE\bf Novel Certified Parsers\\[-10mm] |
229 | 29 |
\mbox{} |
225 | 30 |
\end{tabular} |
31 |
\end{center} |
|
32 |
\thispagestyle{empty} |
|
228 | 33 |
\mbox{}\\[-5mm] |
225 | 34 |
|
35 |
\begin{multicols}{2} |
|
228 | 36 |
\section*{Background} |
242 | 37 |
|
225 | 38 |
\noindent |
242 | 39 |
Parsers transform plain text into some abstract structure that can be analyzed by |
40 |
computers for further processing. One might think that parsers have been |
|
41 |
studied to death, and after \emph{yacc} and \emph{lex} no new results can be |
|
42 |
obtained in this area. However recent developments and novel approaches make |
|
43 |
it increasingly clear, that this is not true anymore~\cite{Might11}. And |
|
244 | 44 |
there is a real practical need for new results: for example the future HTML5 |
242 | 45 |
Standard abandons a well-defined grammar specification, in favour of a bespoke |
46 |
parser given as pseudo code. |
|
225 | 47 |
|
242 | 48 |
This work targets parsers from a certification point of view. Increasingly, |
49 |
parsers are part of certified compilers, like |
|
50 |
\mbox{\emph{CompCert}}~\cite{Leroy09}, which are guaranteed to be |
|
51 |
bug-free. Such certified compilers are crucial in areas where software just |
|
52 |
cannot fail. However, so far the parsers of these compilers have been left out |
|
53 |
of the certification. This is because parsing algorithms are often ad hoc and |
|
54 |
their semantics is not clearly specified. This means, unfortunately, parsers |
|
55 |
can harbour errors that potentially invalidate the whole certification and |
|
56 |
correctness of the compiler. In this project, we aim to change this situation |
|
57 |
with the help of the theorem prover Isabelle/HOL. |
|
225 | 58 |
|
242 | 59 |
Only in the last few years, theorem provers have become powerful enough for |
236 | 60 |
establishing the correctness of some standard lexing and parsing |
243 | 61 |
algorithms. For this, the algorithms still need to be formulated in a way so |
236 | 62 |
that it is easy to reason about them. In our earlier work about lexing and |
63 |
regular languages, we showed that this precludes well-known algorithms based |
|
242 | 64 |
automata~\cite{WuZhangUrban11}. However we showed also that regular languages can be formulated and |
236 | 65 |
reasoned about entirely in terms regular expressions, which can be easily |
66 |
represented in theorem provers. This work uses the device of derivatives of |
|
67 |
regular expressions. We like to extend this device to parsers and grammars. |
|
234 | 68 |
The aim is to come up with elegant and practical useful parsing algorithms |
242 | 69 |
whose correctness can be certified in Isabelle/HOL. |
236 | 70 |
|
228 | 71 |
|
229 | 72 |
\section*{Proposed Work} |
73 |
||
242 | 74 |
A recent development in the field of parsing are Parsing Expression Grammars |
75 |
(PEGs)~\cite{Ford04}, which |
|
76 |
are an extension of the well-known Context Free Grammars |
|
77 |
(CFGs). This extension introduces new regular operators, such as |
|
78 |
negation and conjunction, on the right-hand side of grammar rules, as well as |
|
79 |
priority orderings for rules. With these extensions, PEG parsing becomes much |
|
80 |
more powerful and more useful in practise. For example disambiguation, formerly expressed by semantic |
|
236 | 81 |
filters, can now be expressed directly using grammar rules. |
230 | 82 |
|
242 | 83 |
However, there is a serious limitation of PEGs, which affects potential |
84 |
applications: grammars involving left recursion are not |
|
85 |
allowed~\cite{Ford02}. Although one PEG parsing algorithm has already been |
|
86 |
proposed that can deal with left recursion~\cite{WarthDM08}, there |
|
87 |
is no correctness proof for it, not even one with ``pencil and paper''. One aim of this |
|
88 |
research is to solve this sorry state-of-affairs by either certifying this |
|
89 |
algorithm or inventing a new one. For this we will first devise a |
|
90 |
fixed-point semantics of PEGs, against which we can certify a parser. For this |
|
91 |
semantics we take as starting point the paper~\cite{Ford04}, which does not |
|
92 |
treat left-recursion, but gives an operational semantics for PEG |
|
93 |
parsing. There are also good indications that we can adapt work on Boolean |
|
243 | 94 |
Grammars~\cite{Okhotin04}, which are similar to PEGs and for which the |
242 | 95 |
paper~\cite{KountouriotisNR09} gives a fixed-point semantics |
96 |
to negation operators, but not to the Kleene star. |
|
234 | 97 |
|
243 | 98 |
For the parsing algorithm, we might be able to build upon |
99 |
the classic Cocke-Younger-Kasami (CYK) |
|
242 | 100 |
algorithms~\cite{KountouriotisNR09} and |
101 |
Early~\cite{AycHor02, Earley70} parsers. The defect of CYK algorithms, however, |
|
102 |
is that the grammar specifications given by the user need to be transformed |
|
103 |
into a normal form. This transformation may potentially lead to rule explosion |
|
104 |
and hence inefficient parsing. We will investigate whether this transformation |
|
105 |
can be avoided. The problem with Early-style parsers is that they need to be |
|
106 |
extended to PEG parsing in order to be helpful for us. |
|
107 |
||
234 | 108 |
|
236 | 109 |
Finally, we want to investigate whether derivatives of regular expressions |
242 | 110 |
~\cite{Brzozowski64,Might11,OwensReppyTuron09} |
111 |
can be generalised to PEG parsing. In earlier work, we showed that lexing based on |
|
112 |
derivatives gives rise to very elegant regular expression matchers that can be |
|
113 |
certified in a theorem prover with ease. We will study whether the idea of |
|
114 |
taking a derivative of a regular expression can be extended to rules in |
|
243 | 115 |
grammars. The problem that needs to be addressed is again how to deal with |
116 |
left-recursive grammar rules. |
|
229 | 117 |
|
225 | 118 |
|
242 | 119 |
|
232
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
120 |
\bibliography{Journal/document/root} |
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
121 |
\bibliographystyle{abbrv} |
225 | 122 |
\end{multicols} |
123 |
||
124 |
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
125 |
% \noindent {\bf Objectives:} The overall goals of the project are as follows: |
|
126 |
% \begin{itemize} |
|
127 |
% \item To solve the POPLmark challenge. |
|
128 |
||
129 |
% \item To complete and greatly improve the existing implementation of the |
|
130 |
% nominal datatype package. |
|
131 |
% \item To explore the strengths of this package by proving the |
|
132 |
% safety of SML. |
|
133 |
% \item To provide a basis for extracting programs from safety proofs. |
|
134 |
||
229 | 135 |
% \item To make the nominal datatype package usable for teaching |
225 | 136 |
% students about the lambda-calculus and the theory of programming |
137 |
% languages. \smallskip |
|
138 |
% \end{itemize} |
|
139 |
||
140 |
||
141 |
||
142 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
143 |
\end{document} |
|
144 |
||
229 | 145 |
%%% Local Variables: |
225 | 146 |
%%% mode: latex |
147 |
%%% TeX-master: t |
|
148 |
%%% TeX-command-default: "PdfLaTeX" |
|
149 |
%%% TeX-view-style: (("." "kpdf %s.pdf")) |
|
229 | 150 |
%%% End: |