author | urbanc |
Mon, 05 Sep 2011 12:33:26 +0000 | |
changeset 234 | eeadb4e51d74 |
parent 232 | 114064363ef0 |
child 235 | a7ddcad0a023 |
permissions | -rwxr-xr-x |
225 | 1 |
\documentclass{article} |
2 |
\usepackage{a4wide,ot1patch} |
|
3 |
\usepackage[latin1]{inputenc} |
|
4 |
\usepackage{multicol} |
|
5 |
\usepackage{charter} |
|
6 |
\usepackage{amsmath,amssymb,amsthm} |
|
7 |
\usepackage{fancyheadings} |
|
8 |
||
9 |
\addtolength{\oddsidemargin}{-6mm} |
|
10 |
\addtolength{\evensidemargin}{-6mm} |
|
11 |
\addtolength{\textwidth}{11mm} |
|
12 |
\addtolength{\columnsep}{3mm} |
|
13 |
\addtolength{\textheight}{8mm} |
|
14 |
\addtolength{\topmargin}{-7.5mm} |
|
15 |
||
16 |
\pagestyle{fancyplain} |
|
17 |
\lhead[\fancyplain{}{A}]{\fancyplain{}{}} |
|
18 |
\rhead[\fancyplain{}{C}]{\fancyplain{}{}} |
|
19 |
\renewcommand{\headrulewidth}{0pt} |
|
20 |
||
21 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
22 |
\begin{document} |
|
23 |
||
24 |
||
25 |
\begin{center} |
|
26 |
\begin{tabular}{c} |
|
27 |
\\[-5mm] |
|
28 |
\LARGE\bf Certified Parsing\\[-10mm] |
|
229 | 29 |
\mbox{} |
225 | 30 |
\end{tabular} |
31 |
\end{center} |
|
32 |
\thispagestyle{empty} |
|
228 | 33 |
\mbox{}\\[-5mm] |
225 | 34 |
|
35 |
\begin{multicols}{2} |
|
228 | 36 |
\section*{Background} |
225 | 37 |
\noindent |
38 |
Parsing is the act of transforming plain text into some |
|
232
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
39 |
structure that can be analyzed by computers for further processing. |
225 | 40 |
One might think that parsing has been studied to death and after |
229 | 41 |
\emph{yacc} and \emph{lex} no new results can be obtained in this area. |
225 | 42 |
However recent results and novel approaches make it increasingly clear, |
43 |
that this is not true anymore. |
|
44 |
||
229 | 45 |
We propose to approach the subject of parsing from a certification point |
234 | 46 |
of view. Increasingly, parsers are part of certified compilers, like \mbox{\emph{CompCert}}, |
229 | 47 |
which are guaranteed to be correct and bug-free. Such certified compilers are |
228 | 48 |
crucial in areas where software just cannot fail. However, so far the |
225 | 49 |
parsers of these compilers have been left out of the certification. |
232
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
50 |
This is because parsing algorithms are often ad hoc and their semantics |
229 | 51 |
is not clearly specified. Unfortunately, this means parsers can harbour |
52 |
errors that potentially invalidate the whole certification and correctness |
|
225 | 53 |
of the compiler. In this project, we like to change that. |
54 |
||
55 |
Only in the last few years, theorem provers have become good enough |
|
56 |
for establishing the correctness of some standard lexing and |
|
234 | 57 |
parsing algorithms. For this, the algorithms still need to be formulated |
58 |
in way so that it is easy to reason about them. In our earlier work |
|
229 | 59 |
about lexing and regular languages, the authors showed that this |
234 | 60 |
precludes well-known algorithms based automata. However we showed |
61 |
that regular |
|
229 | 62 |
languages can be formulated and reasoned about entirely in terms |
225 | 63 |
regular expressions, which can be easily represented in theorem |
229 | 64 |
provers. This work uses the device of derivatives of regular |
227 | 65 |
expressions. We like to extend this device to parsers and grammars. |
234 | 66 |
The aim is to come up with elegant and practical useful parsing algorithms |
67 |
whose correctness can be certified in a |
|
228 | 68 |
theorem prover. |
69 |
||
229 | 70 |
\section*{Proposed Work} |
71 |
||
234 | 72 |
A recent development in parsing is Parsing Expression Grammars (PEG), which |
73 |
are an extension of the standard Context Free Grammars |
|
74 |
(CFG)~\cite{Ford04a}. The extension introduces new regular operators, such as |
|
75 |
negation and conjunction, on the right-hand sides of grammar rules, as well as |
|
76 |
priority orderings on rules. With these extensions, PEG parsing becomes much |
|
77 |
more powerful. For example disambiguation, formerly expressed by semantic |
|
78 |
filters, can now be expressed directly using grammar rules. This means a |
|
79 |
simpler and more systematic treatment of ambiguity and more concise grammar |
|
80 |
specifications for programming languages. |
|
232
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
81 |
|
234 | 82 |
However, a serious disadvantage of PEG is that it does not allow left |
83 |
recursion, because parsing algorithms for PEG~\cite{Ford02b} can not deal with |
|
84 |
left recursions. Although a new PEG parsing algorithm has been proposed |
|
85 |
that can deal with left recursion~\cite{conf/pepm/WarthDM08}, there is no |
|
86 |
correctness proof, not even in ``paper-and-pencil'' form. One aim of this |
|
87 |
research is to solve this sorry state-of-affairs by either certifying this |
|
88 |
algorithm or inventing a new one. For this we will first formalize a fixed |
|
89 |
point semantics of PEG, based on which an efficient, certified parsing |
|
90 |
algorithm can be given given. |
|
230 | 91 |
|
232
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
92 |
There are several existing works we can draw upon: |
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
93 |
\begin{enumerate} |
234 | 94 |
\item The works on PEG. |
95 |
\begin {enumerate} |
|
96 |
\item An operation semantics for PEG has already been given |
|
97 |
in~\cite{Ford04a}, but it is not adequate to deal with left recursions. But this |
|
98 |
work gives at least a precise description of what the original PEG meant |
|
99 |
for. This will serve an a basis to show the conservativeness of |
|
100 |
the fixed point semantics we are |
|
101 |
going to develop. |
|
102 |
||
103 |
\item The new algorithm~\cite{conf/pepm/WarthDM08} which claimed to be able |
|
104 |
to deal with left recursions. Although there is no correctness proof yet, this |
|
105 |
may provide some useful inspirations for our new algorithm design. |
|
106 |
\end{enumerate} |
|
107 |
||
108 |
\item The works on Boolean Grammars~\cite{Okhotin/04a}. Boolean Grammar is |
|
109 |
very closely related to PEG, because it also contains negative and conjunctive |
|
110 |
grammars. The main differences are: First, Boolean Grammar has no ordering on |
|
111 |
productions; Second: Boolean Grammar does not contain STAR operator. There are |
|
112 |
two works about Boolean Grammar which might be useful for this research: |
|
113 |
||
114 |
\begin{enumerate} |
|
115 |
\item A fixed point semantics for Boolean |
|
116 |
Grammar~\cite{journals/iandc/KountouriotisNR09}. The idea to define the |
|
117 |
semantics of negative and conjunctive operators is certainly what we can |
|
118 |
borrow. Therefore, this work gives the basis on which we can add in production |
|
119 |
ordering and STAR operator. |
|
120 |
||
121 |
\item A parsing algorithm for Boolean Grammar based on CYK |
|
122 |
parsing~\cite{journals/iandc/KountouriotisNR09}. The draw back of CYK parsing |
|
123 |
is that: the original grammar specification needs to be transformed into a |
|
124 |
normal form. This transformation may lead to grammar explosion and is |
|
125 |
undesirable. One aim of this research is to see whether this transformation |
|
126 |
can be avoided. For this purpose, other parsing style may provide useful inspirations, for example: |
|
127 |
||
128 |
\begin{enumerate} |
|
129 |
\item Derivative |
|
130 |
Parsing~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023}. Christian |
|
131 |
Urban has used derivative methods to establish the correctness of a regular |
|
132 |
expression matcher, as well the the finite partition property of regular |
|
133 |
expression~\cite{WuZhangUrban11}. There are well founded envisage that the |
|
134 |
derivative methods may provide the foundation to the new parsing algorithms of PEG. |
|
135 |
||
136 |
\item Early parsing~\cite{Earley70,AycHor02}. It is a refinement of CYK |
|
137 |
parsing which does not require the transformation to normal forms, and |
|
138 |
therefore provide one possible direction to adapt the current CYK based |
|
139 |
parsing algorithm of Boolean Grammar for PEG grammar. |
|
140 |
||
141 |
\item The new parsing algorithm proposed by Tom Ridge[???]. Recently, |
|
142 |
T. Ridge has proposed and certified an combinator style parsing algorithm for |
|
143 |
CFG, which borrows some ideas from Early parsing. The proposed algorithm is |
|
144 |
very simple and elegant. We are going to strive for a parsing algorithm as elegant as this one. |
|
145 |
\end{enumerate} |
|
146 |
||
147 |
Which of the above possibilities will finally get into our final solutions |
|
148 |
is an interesting point about this current research. |
|
149 |
||
150 |
\end{enumerate} |
|
232
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
151 |
\end{enumerate} |
234 | 152 |
|
232
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
153 |
Based on these works, we are quite confident that our idea may lead to some concrete results. |
229 | 154 |
|
225 | 155 |
\mbox{}\\[15cm] |
156 |
||
229 | 157 |
\noindent |
158 |
||
159 |
||
225 | 160 |
|
232
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
161 |
\small |
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
162 |
\bibliography{Journal/document/root} |
114064363ef0
Proposal paragraphs by Xingyuan completed (with references added).
zhang
parents:
230
diff
changeset
|
163 |
\bibliographystyle{abbrv} |
225 | 164 |
\end{multicols} |
165 |
||
166 |
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
167 |
% \noindent {\bf Objectives:} The overall goals of the project are as follows: |
|
168 |
% \begin{itemize} |
|
169 |
% \item To solve the POPLmark challenge. |
|
170 |
||
171 |
% \item To complete and greatly improve the existing implementation of the |
|
172 |
% nominal datatype package. |
|
173 |
% \item To explore the strengths of this package by proving the |
|
174 |
% safety of SML. |
|
175 |
% \item To provide a basis for extracting programs from safety proofs. |
|
176 |
||
229 | 177 |
% \item To make the nominal datatype package usable for teaching |
225 | 178 |
% students about the lambda-calculus and the theory of programming |
179 |
% languages. \smallskip |
|
180 |
% \end{itemize} |
|
181 |
||
182 |
||
183 |
||
184 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
|
185 |
\end{document} |
|
186 |
||
229 | 187 |
%%% Local Variables: |
225 | 188 |
%%% mode: latex |
189 |
%%% TeX-master: t |
|
190 |
%%% TeX-command-default: "PdfLaTeX" |
|
191 |
%%% TeX-view-style: (("." "kpdf %s.pdf")) |
|
229 | 192 |
%%% End: |