32 \thispagestyle{empty} |
32 \thispagestyle{empty} |
33 \mbox{}\\[-5mm] |
33 \mbox{}\\[-5mm] |
34 |
34 |
35 \begin{multicols}{2} |
35 \begin{multicols}{2} |
36 \section*{Background} |
36 \section*{Background} |
|
37 |
37 \noindent |
38 \noindent |
38 Parsing is the act of transforming plain text into some |
39 Parsers transform plain text into some abstract structure that can be analyzed by |
39 structure that can be analyzed by computers for further processing. |
40 computers for further processing. One might think that parsers have been |
40 One might think that parsing has been studied to death, and after |
41 studied to death, and after \emph{yacc} and \emph{lex} no new results can be |
41 \emph{yacc} and \emph{lex} no new results can be obtained in this area. |
42 obtained in this area. However recent developments and novel approaches make |
42 However recent developments and novel approaches make it increasingly clear, |
43 it increasingly clear, that this is not true anymore~\cite{Might11}. And |
43 that this is not true anymore. |
44 there is a practical need for new results: for example the future HTML 5 |
|
45 Standard abandons a well-defined grammar specification, in favour of a bespoke |
|
46 parser given as pseudo code. |
44 |
47 |
45 We propose to on parsers from a certification point of view. Increasingly, |
48 This work targets parsers from a certification point of view. Increasingly, |
46 parsers are part of certified compilers, like \mbox{\emph{CompCert}}, which |
49 parsers are part of certified compilers, like |
47 are guaranteed to be correct and bug-free. Such certified compilers are |
50 \mbox{\emph{CompCert}}~\cite{Leroy09}, which are guaranteed to be |
48 crucial in areas where software just cannot fail. However, so far the parsers |
51 bug-free. Such certified compilers are crucial in areas where software just |
49 of these compilers have been left out of the certification. This is because |
52 cannot fail. However, so far the parsers of these compilers have been left out |
50 parsing algorithms are often ad hoc and their semantics is not clearly |
53 of the certification. This is because parsing algorithms are often ad hoc and |
51 specified. Unfortunately, this means parsers can harbour errors that |
54 their semantics is not clearly specified. This means, unfortunately, parsers |
52 potentially invalidate the whole certification and correctness of the |
55 can harbour errors that potentially invalidate the whole certification and |
53 compiler. In this project, we like to change that with the help of theorem |
56 correctness of the compiler. In this project, we aim to change this situation |
54 provers. |
57 with the help of the theorem prover Isabelle/HOL. |
55 |
58 |
56 Only in the last few years, theorem provers have become good enough for |
59 Only in the last few years, theorem provers have become powerful enough for |
57 establishing the correctness of some standard lexing and parsing |
60 establishing the correctness of some standard lexing and parsing |
58 algorithms. For this, the algorithms still need to be formulated in way so |
61 algorithms. For this, the algorithms still need to be formulated in way so |
59 that it is easy to reason about them. In our earlier work about lexing and |
62 that it is easy to reason about them. In our earlier work about lexing and |
60 regular languages, we showed that this precludes well-known algorithms based |
63 regular languages, we showed that this precludes well-known algorithms based |
61 automata. However we showed also that regular languages can be formulated and |
64 automata~\cite{WuZhangUrban11}. However we showed also that regular languages can be formulated and |
62 reasoned about entirely in terms regular expressions, which can be easily |
65 reasoned about entirely in terms regular expressions, which can be easily |
63 represented in theorem provers. This work uses the device of derivatives of |
66 represented in theorem provers. This work uses the device of derivatives of |
64 regular expressions. We like to extend this device to parsers and grammars. |
67 regular expressions. We like to extend this device to parsers and grammars. |
65 The aim is to come up with elegant and practical useful parsing algorithms |
68 The aim is to come up with elegant and practical useful parsing algorithms |
66 whose correctness can be certified in a theorem prover. |
69 whose correctness can be certified in Isabelle/HOL. |
67 |
70 |
68 |
71 |
69 \section*{Proposed Work} |
72 \section*{Proposed Work} |
70 |
73 |
71 A recent development in parsing is Parsing Expression Grammars (PEG), which |
74 A recent development in the field of parsing are Parsing Expression Grammars |
72 are an extension of the weel-known Context Free Grammars |
75 (PEGs)~\cite{Ford04}, which |
73 (CFG)~\cite{Ford04a}. The extension introduces new regular operators, such as |
76 are an extension of the well-known Context Free Grammars |
74 negation and conjunction, on the right-hand sides of grammar rules, as well as |
77 (CFGs). This extension introduces new regular operators, such as |
75 priority orderings. With these extensions, PEG parsing becomes much |
78 negation and conjunction, on the right-hand side of grammar rules, as well as |
76 more powerful. For example disambiguation, formerly expressed by semantic |
79 priority orderings for rules. With these extensions, PEG parsing becomes much |
|
80 more powerful and more useful in practise. For example disambiguation, formerly expressed by semantic |
77 filters, can now be expressed directly using grammar rules. |
81 filters, can now be expressed directly using grammar rules. |
78 |
82 |
79 However, there is serious disadvantage of PEG for applications: is does not |
83 However, there is a serious limitation of PEGs, which affects potential |
80 support grammrs involving left recursion~\cite{Ford02b}. Although a new PEG |
84 applications: grammars involving left recursion are not |
81 parsing algorithm has been proposed that can deal with left |
85 allowed~\cite{Ford02}. Although one PEG parsing algorithm has already been |
82 recursion~\cite{conf/pepm/WarthDM08}, there is no correctness proof, not even |
86 proposed that can deal with left recursion~\cite{WarthDM08}, there |
83 in ``paper-and-pencil'' form. One aim of this research is to solve this sorry |
87 is no correctness proof for it, not even one with ``pencil and paper''. One aim of this |
84 state-of-affairs by either certifying this algorithm or inventing a new |
88 research is to solve this sorry state-of-affairs by either certifying this |
85 one. For this we will first formalize a fixed point semantics of PEG, based on |
89 algorithm or inventing a new one. For this we will first devise a |
86 which an efficient, certified parsing algorithm can be given given. For this |
90 fixed-point semantics of PEGs, against which we can certify a parser. For this |
87 we take as starting point the paper~\cite{Ford04a}, which does not treat |
91 semantics we take as starting point the paper~\cite{Ford04}, which does not |
88 left-recursion, but gives an operational semantics for PEG parsing. For the |
92 treat left-recursion, but gives an operational semantics for PEG |
89 semantics, it seems plausible that we can adapt work on Boolean |
93 parsing. There are also good indications that we can adapt work on Boolean |
90 Grammars~\cite{Okhotin/04a}, which are similar to PEGs, and for which the |
94 Grammars~\cite{Okhotin04}, which are similar to PEGs, and for which the |
91 paper ~\cite{journals/iandc/KountouriotisNR09} gives a semantics to negation |
95 paper~\cite{KountouriotisNR09} gives a fixed-point semantics |
92 operators, but not to Kleene's star operation. |
96 to negation operators, but not to the Kleene star. |
93 |
97 |
94 For the parsing algorithm, we might also be able to draw inspiration from |
98 For the parsing algorithm, we might be able to draw inspiration from parsers |
95 parsers based on Cocke-Younger-Kasami (CYK) |
99 based on the classic Cocke-Younger-Kasami (CYK) |
96 algorithms~\cite{journals/iandc/KountouriotisNR09} and |
100 algorithms~\cite{KountouriotisNR09} and |
97 Early~\cite{Earley70,AycHor02} parsers. The defect CYK algorithms is that the |
101 Early~\cite{AycHor02, Earley70} parsers. The defect of CYK algorithms, however, |
98 original grammar specification needs to be transformed into a normal |
102 is that the grammar specifications given by the user need to be transformed |
99 form. This transformation may lead to grammar explosion and inefficient |
103 into a normal form. This transformation may potentially lead to rule explosion |
100 parsing. We will investigate whether this transformation can be avoided. |
104 and hence inefficient parsing. We will investigate whether this transformation |
101 Early style parsers, which have recently been certified by Ridge [???], |
105 can be avoided. The problem with Early-style parsers is that they need to be |
102 need to be extended to PEG parsing in order to be helpful for us. |
106 extended to PEG parsing in order to be helpful for us. |
|
107 |
103 |
108 |
104 Finally, we want to investigate whether derivatives of regular expressions |
109 Finally, we want to investigate whether derivatives of regular expressions |
105 ~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023} |
110 ~\cite{Brzozowski64,Might11,OwensReppyTuron09} |
106 can be extended to parsing. Lexing based on derivatives gives rise to very |
111 can be generalised to PEG parsing. In earlier work, we showed that lexing based on |
107 elegant regular expression matchers that can be certified in a theorem prover |
112 derivatives gives rise to very elegant regular expression matchers that can be |
108 with ease. We will study whether the idea of taking a derivative of a regular |
113 certified in a theorem prover with ease. We will study whether the idea of |
109 expression can be extended to rules in grammars. The problem that needs to be |
114 taking a derivative of a regular expression can be extended to rules in |
110 overcome again arises from possible left recursion in parsing. |
115 grammars. The problem that needs to be overcome is again how to deal with possible |
111 |
116 left-recursion in grammar rules. |
112 |
|
113 \newpage |
|
114 |
117 |
115 |
118 |
116 \small |
119 |
117 \bibliography{Journal/document/root} |
120 \bibliography{Journal/document/root} |
118 \bibliographystyle{abbrv} |
121 \bibliographystyle{abbrv} |
119 \end{multicols} |
122 \end{multicols} |
120 |
123 |
121 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
124 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |