225
|
1 |
\documentclass{article}
|
|
2 |
\usepackage{a4wide,ot1patch}
|
|
3 |
\usepackage[latin1]{inputenc}
|
|
4 |
\usepackage{multicol}
|
|
5 |
\usepackage{charter}
|
|
6 |
\usepackage{amsmath,amssymb,amsthm}
|
|
7 |
\usepackage{fancyheadings}
|
|
8 |
|
|
9 |
\addtolength{\oddsidemargin}{-6mm}
|
|
10 |
\addtolength{\evensidemargin}{-6mm}
|
|
11 |
\addtolength{\textwidth}{11mm}
|
|
12 |
\addtolength{\columnsep}{3mm}
|
|
13 |
\addtolength{\textheight}{8mm}
|
|
14 |
\addtolength{\topmargin}{-7.5mm}
|
|
15 |
|
|
16 |
\pagestyle{fancyplain}
|
|
17 |
\lhead[\fancyplain{}{A}]{\fancyplain{}{}}
|
|
18 |
\rhead[\fancyplain{}{C}]{\fancyplain{}{}}
|
|
19 |
\renewcommand{\headrulewidth}{0pt}
|
|
20 |
|
|
21 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
22 |
\begin{document}
|
|
23 |
|
|
24 |
|
|
25 |
\begin{center}
|
|
26 |
\begin{tabular}{c}
|
|
27 |
\\[-5mm]
|
|
28 |
\LARGE\bf Certified Parsing\\[-10mm]
|
229
|
29 |
\mbox{}
|
225
|
30 |
\end{tabular}
|
|
31 |
\end{center}
|
|
32 |
\thispagestyle{empty}
|
228
|
33 |
\mbox{}\\[-5mm]
|
225
|
34 |
|
|
35 |
\begin{multicols}{2}
|
228
|
36 |
\section*{Background}
|
225
|
37 |
\noindent
|
|
38 |
Parsing is the act of transforming plain text into some
|
232
|
39 |
structure that can be analyzed by computers for further processing.
|
225
|
40 |
One might think that parsing has been studied to death and after
|
229
|
41 |
\emph{yacc} and \emph{lex} no new results can be obtained in this area.
|
225
|
42 |
However recent results and novel approaches make it increasingly clear,
|
|
43 |
that this is not true anymore.
|
|
44 |
|
229
|
45 |
We propose to approach the subject of parsing from a certification point
|
234
|
46 |
of view. Increasingly, parsers are part of certified compilers, like \mbox{\emph{CompCert}},
|
229
|
47 |
which are guaranteed to be correct and bug-free. Such certified compilers are
|
228
|
48 |
crucial in areas where software just cannot fail. However, so far the
|
225
|
49 |
parsers of these compilers have been left out of the certification.
|
232
|
50 |
This is because parsing algorithms are often ad hoc and their semantics
|
229
|
51 |
is not clearly specified. Unfortunately, this means parsers can harbour
|
|
52 |
errors that potentially invalidate the whole certification and correctness
|
225
|
53 |
of the compiler. In this project, we like to change that.
|
|
54 |
|
|
55 |
Only in the last few years, theorem provers have become good enough
|
|
56 |
for establishing the correctness of some standard lexing and
|
234
|
57 |
parsing algorithms. For this, the algorithms still need to be formulated
|
|
58 |
in way so that it is easy to reason about them. In our earlier work
|
229
|
59 |
about lexing and regular languages, the authors showed that this
|
234
|
60 |
precludes well-known algorithms based automata. However we showed
|
|
61 |
that regular
|
229
|
62 |
languages can be formulated and reasoned about entirely in terms
|
225
|
63 |
regular expressions, which can be easily represented in theorem
|
229
|
64 |
provers. This work uses the device of derivatives of regular
|
227
|
65 |
expressions. We like to extend this device to parsers and grammars.
|
234
|
66 |
The aim is to come up with elegant and practical useful parsing algorithms
|
|
67 |
whose correctness can be certified in a
|
228
|
68 |
theorem prover.
|
|
69 |
|
229
|
70 |
\section*{Proposed Work}
|
|
71 |
|
234
|
72 |
A recent development in parsing is Parsing Expression Grammars (PEG), which
|
|
73 |
are an extension of the standard Context Free Grammars
|
|
74 |
(CFG)~\cite{Ford04a}. The extension introduces new regular operators, such as
|
|
75 |
negation and conjunction, on the right-hand sides of grammar rules, as well as
|
|
76 |
priority orderings on rules. With these extensions, PEG parsing becomes much
|
|
77 |
more powerful. For example disambiguation, formerly expressed by semantic
|
|
78 |
filters, can now be expressed directly using grammar rules. This means a
|
|
79 |
simpler and more systematic treatment of ambiguity and more concise grammar
|
|
80 |
specifications for programming languages.
|
232
|
81 |
|
234
|
82 |
However, a serious disadvantage of PEG is that it does not allow left
|
|
83 |
recursion, because parsing algorithms for PEG~\cite{Ford02b} can not deal with
|
|
84 |
left recursions. Although a new PEG parsing algorithm has been proposed
|
|
85 |
that can deal with left recursion~\cite{conf/pepm/WarthDM08}, there is no
|
|
86 |
correctness proof, not even in ``paper-and-pencil'' form. One aim of this
|
|
87 |
research is to solve this sorry state-of-affairs by either certifying this
|
|
88 |
algorithm or inventing a new one. For this we will first formalize a fixed
|
|
89 |
point semantics of PEG, based on which an efficient, certified parsing
|
|
90 |
algorithm can be given given.
|
230
|
91 |
|
232
|
92 |
There are several existing works we can draw upon:
|
|
93 |
\begin{enumerate}
|
234
|
94 |
\item The works on PEG.
|
|
95 |
\begin {enumerate}
|
|
96 |
\item An operation semantics for PEG has already been given
|
|
97 |
in~\cite{Ford04a}, but it is not adequate to deal with left recursions. But this
|
|
98 |
work gives at least a precise description of what the original PEG meant
|
|
99 |
for. This will serve an a basis to show the conservativeness of
|
|
100 |
the fixed point semantics we are
|
|
101 |
going to develop.
|
|
102 |
|
|
103 |
\item The new algorithm~\cite{conf/pepm/WarthDM08} which claimed to be able
|
|
104 |
to deal with left recursions. Although there is no correctness proof yet, this
|
|
105 |
may provide some useful inspirations for our new algorithm design.
|
|
106 |
\end{enumerate}
|
|
107 |
|
|
108 |
\item The works on Boolean Grammars~\cite{Okhotin/04a}. Boolean Grammar is
|
|
109 |
very closely related to PEG, because it also contains negative and conjunctive
|
|
110 |
grammars. The main differences are: First, Boolean Grammar has no ordering on
|
|
111 |
productions; Second: Boolean Grammar does not contain STAR operator. There are
|
|
112 |
two works about Boolean Grammar which might be useful for this research:
|
|
113 |
|
|
114 |
\begin{enumerate}
|
|
115 |
\item A fixed point semantics for Boolean
|
|
116 |
Grammar~\cite{journals/iandc/KountouriotisNR09}. The idea to define the
|
|
117 |
semantics of negative and conjunctive operators is certainly what we can
|
|
118 |
borrow. Therefore, this work gives the basis on which we can add in production
|
|
119 |
ordering and STAR operator.
|
|
120 |
|
|
121 |
\item A parsing algorithm for Boolean Grammar based on CYK
|
|
122 |
parsing~\cite{journals/iandc/KountouriotisNR09}. The draw back of CYK parsing
|
|
123 |
is that: the original grammar specification needs to be transformed into a
|
|
124 |
normal form. This transformation may lead to grammar explosion and is
|
|
125 |
undesirable. One aim of this research is to see whether this transformation
|
|
126 |
can be avoided. For this purpose, other parsing style may provide useful inspirations, for example:
|
|
127 |
|
|
128 |
\begin{enumerate}
|
|
129 |
\item Derivative
|
|
130 |
Parsing~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023}. Christian
|
|
131 |
Urban has used derivative methods to establish the correctness of a regular
|
|
132 |
expression matcher, as well the the finite partition property of regular
|
|
133 |
expression~\cite{WuZhangUrban11}. There are well founded envisage that the
|
|
134 |
derivative methods may provide the foundation to the new parsing algorithms of PEG.
|
|
135 |
|
|
136 |
\item Early parsing~\cite{Earley70,AycHor02}. It is a refinement of CYK
|
|
137 |
parsing which does not require the transformation to normal forms, and
|
|
138 |
therefore provide one possible direction to adapt the current CYK based
|
|
139 |
parsing algorithm of Boolean Grammar for PEG grammar.
|
|
140 |
|
|
141 |
\item The new parsing algorithm proposed by Tom Ridge[???]. Recently,
|
|
142 |
T. Ridge has proposed and certified an combinator style parsing algorithm for
|
|
143 |
CFG, which borrows some ideas from Early parsing. The proposed algorithm is
|
|
144 |
very simple and elegant. We are going to strive for a parsing algorithm as elegant as this one.
|
|
145 |
\end{enumerate}
|
|
146 |
|
|
147 |
Which of the above possibilities will finally get into our final solutions
|
|
148 |
is an interesting point about this current research.
|
|
149 |
|
|
150 |
\end{enumerate}
|
232
|
151 |
\end{enumerate}
|
234
|
152 |
|
232
|
153 |
Based on these works, we are quite confident that our idea may lead to some concrete results.
|
229
|
154 |
|
225
|
155 |
\mbox{}\\[15cm]
|
|
156 |
|
229
|
157 |
\noindent
|
|
158 |
|
|
159 |
|
225
|
160 |
|
232
|
161 |
\small
|
|
162 |
\bibliography{Journal/document/root}
|
|
163 |
\bibliographystyle{abbrv}
|
225
|
164 |
\end{multicols}
|
|
165 |
|
|
166 |
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
167 |
% \noindent {\bf Objectives:} The overall goals of the project are as follows:
|
|
168 |
% \begin{itemize}
|
|
169 |
% \item To solve the POPLmark challenge.
|
|
170 |
|
|
171 |
% \item To complete and greatly improve the existing implementation of the
|
|
172 |
% nominal datatype package.
|
|
173 |
% \item To explore the strengths of this package by proving the
|
|
174 |
% safety of SML.
|
|
175 |
% \item To provide a basis for extracting programs from safety proofs.
|
|
176 |
|
229
|
177 |
% \item To make the nominal datatype package usable for teaching
|
225
|
178 |
% students about the lambda-calculus and the theory of programming
|
|
179 |
% languages. \smallskip
|
|
180 |
% \end{itemize}
|
|
181 |
|
|
182 |
|
|
183 |
|
|
184 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
185 |
\end{document}
|
|
186 |
|
229
|
187 |
%%% Local Variables:
|
225
|
188 |
%%% mode: latex
|
|
189 |
%%% TeX-master: t
|
|
190 |
%%% TeX-command-default: "PdfLaTeX"
|
|
191 |
%%% TeX-view-style: (("." "kpdf %s.pdf"))
|
229
|
192 |
%%% End:
|