37 \noindent |
37 \noindent |
38 Parsing is the act of transforming plain text into some |
38 Parsing is the act of transforming plain text into some |
39 structure that can be analyzed by computers for further processing. |
39 structure that can be analyzed by computers for further processing. |
40 One might think that parsing has been studied to death, and after |
40 One might think that parsing has been studied to death, and after |
41 \emph{yacc} and \emph{lex} no new results can be obtained in this area. |
41 \emph{yacc} and \emph{lex} no new results can be obtained in this area. |
42 However recent results and novel approaches make it increasingly clear, |
42 However recent developments and novel approaches make it increasingly clear, |
43 that this is not true anymore. |
43 that this is not true anymore. |
44 |
44 |
45 We propose to approach the subject of parsing from a certification point |
45 We propose to on parsers from a certification point of view. Increasingly, |
46 of view. Increasingly, parsers are part of certified compilers, like \mbox{\emph{CompCert}}, |
46 parsers are part of certified compilers, like \mbox{\emph{CompCert}}, which |
47 which are guaranteed to be correct and bug-free. Such certified compilers are |
47 are guaranteed to be correct and bug-free. Such certified compilers are |
48 crucial in areas where software just cannot fail. However, so far the |
48 crucial in areas where software just cannot fail. However, so far the parsers |
49 parsers of these compilers have been left out of the certification. |
49 of these compilers have been left out of the certification. This is because |
50 This is because parsing algorithms are often ad hoc and their semantics |
50 parsing algorithms are often ad hoc and their semantics is not clearly |
51 is not clearly specified. Unfortunately, this means parsers can harbour |
51 specified. Unfortunately, this means parsers can harbour errors that |
52 errors that potentially invalidate the whole certification and correctness |
52 potentially invalidate the whole certification and correctness of the |
53 of the compiler. In this project, we like to change that. |
53 compiler. In this project, we like to change that with the help of theorem |
|
54 provers. |
54 |
55 |
55 Only in the last few years, theorem provers have become good enough |
56 Only in the last few years, theorem provers have become good enough for |
56 for establishing the correctness of some standard lexing and |
57 establishing the correctness of some standard lexing and parsing |
57 parsing algorithms. For this, the algorithms still need to be formulated |
58 algorithms. For this, the algorithms still need to be formulated in way so |
58 in way so that it is easy to reason about them. In our earlier work |
59 that it is easy to reason about them. In our earlier work about lexing and |
59 about lexing and regular languages, the authors showed that this |
60 regular languages, we showed that this precludes well-known algorithms based |
60 precludes well-known algorithms based automata. However we showed |
61 automata. However we showed also that regular languages can be formulated and |
61 that regular |
62 reasoned about entirely in terms regular expressions, which can be easily |
62 languages can be formulated and reasoned about entirely in terms |
63 represented in theorem provers. This work uses the device of derivatives of |
63 regular expressions, which can be easily represented in theorem |
64 regular expressions. We like to extend this device to parsers and grammars. |
64 provers. This work uses the device of derivatives of regular |
|
65 expressions. We like to extend this device to parsers and grammars. |
|
66 The aim is to come up with elegant and practical useful parsing algorithms |
65 The aim is to come up with elegant and practical useful parsing algorithms |
67 whose correctness can be certified in a |
66 whose correctness can be certified in a theorem prover. |
68 theorem prover. |
67 |
69 |
68 |
70 \section*{Proposed Work} |
69 \section*{Proposed Work} |
71 |
70 |
72 A recent development in parsing is Parsing Expression Grammars (PEG), which |
71 A recent development in parsing is Parsing Expression Grammars (PEG), which |
73 are an extension of the standard Context Free Grammars |
72 are an extension of the weel-known Context Free Grammars |
74 (CFG)~\cite{Ford04a}. The extension introduces new regular operators, such as |
73 (CFG)~\cite{Ford04a}. The extension introduces new regular operators, such as |
75 negation and conjunction, on the right-hand sides of grammar rules, as well as |
74 negation and conjunction, on the right-hand sides of grammar rules, as well as |
76 priority orderings on rules. With these extensions, PEG parsing becomes much |
75 priority orderings. With these extensions, PEG parsing becomes much |
77 more powerful. For example disambiguation, formerly expressed by semantic |
76 more powerful. For example disambiguation, formerly expressed by semantic |
78 filters, can now be expressed directly using grammar rules. This means a |
77 filters, can now be expressed directly using grammar rules. |
79 simpler and more systematic treatment of ambiguity and more concise grammar |
|
80 specifications for programming languages. |
|
81 |
78 |
82 However, a serious disadvantage of PEG is that it does not allow left |
79 However, there is serious disadvantage of PEG for applications: is does not |
83 recursion, because parsing algorithms for PEG~\cite{Ford02b} can not deal with |
80 support grammrs involving left recursion~\cite{Ford02b}. Although a new PEG |
84 left recursions. Although a new PEG parsing algorithm has been proposed |
81 parsing algorithm has been proposed that can deal with left |
85 that can deal with left recursion~\cite{conf/pepm/WarthDM08}, there is no |
82 recursion~\cite{conf/pepm/WarthDM08}, there is no correctness proof, not even |
86 correctness proof, not even in ``paper-and-pencil'' form. One aim of this |
83 in ``paper-and-pencil'' form. One aim of this research is to solve this sorry |
87 research is to solve this sorry state-of-affairs by either certifying this |
84 state-of-affairs by either certifying this algorithm or inventing a new |
88 algorithm or inventing a new one. For this we will first formalize a fixed |
85 one. For this we will first formalize a fixed point semantics of PEG, based on |
89 point semantics of PEG, based on which an efficient, certified parsing |
86 which an efficient, certified parsing algorithm can be given given. For this |
90 algorithm can be given given. |
87 we take as starting point the paper~\cite{Ford04a}, which does not treat |
|
88 left-recursion, but gives an operational semantics for PEG parsing. For the |
|
89 semantics, it seems plausible that we can adapt work on Boolean |
|
90 Grammars~\cite{Okhotin/04a}, which are similar to PEGs, and for which the |
|
91 paper ~\cite{journals/iandc/KountouriotisNR09} gives a semantics to negation |
|
92 operators, but not to Kleene's star operation. |
91 |
93 |
92 There are several existing works we can draw upon: |
94 For the parsing algorithm, we might also be able to draw inspiration from |
93 \begin{enumerate} |
95 parsers based on Cocke-Younger-Kasami (CYK) |
94 \item The works on PEG. |
96 algorithms~\cite{journals/iandc/KountouriotisNR09} and |
95 \begin {enumerate} |
97 Early~\cite{Earley70,AycHor02} parsers. The defect CYK algorithms is that the |
96 \item An operation semantics for PEG has already been given |
98 original grammar specification needs to be transformed into a normal |
97 in~\cite{Ford04a}, but it is not adequate to deal with left recursions. But this |
99 form. This transformation may lead to grammar explosion and inefficient |
98 work gives at least a precise description of what the original PEG meant |
100 parsing. We will investigate whether this transformation can be avoided. |
99 for. This will serve an a basis to show the conservativeness of |
101 Early style parsers, which have recently been certified by Ridge [???], |
100 the fixed point semantics we are |
102 need to be extended to PEG parsing in order to be helpful for us. |
101 going to develop. |
103 |
|
104 Finally, we want to investigate whether derivatives of regular expressions |
|
105 ~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023} |
|
106 can be extended to parsing. Lexing based on derivatives gives rise to very |
|
107 elegant regular expression matchers that can be certified in a theorem prover |
|
108 with ease. We will study whether the idea of taking a derivative of a regular |
|
109 expression can be extended to rules in grammars. The problem that needs to be |
|
110 overcome again arises from possible left recursion in parsing. |
|
111 |
102 |
112 |
103 \item The new algorithm~\cite{conf/pepm/WarthDM08} which claimed to be able |
113 \newpage |
104 to deal with left recursions. Although there is no correctness proof yet, this |
|
105 may provide some useful inspirations for our new algorithm design. |
|
106 \end{enumerate} |
|
107 |
|
108 \item The works on Boolean Grammars~\cite{Okhotin/04a}. Boolean Grammar is |
|
109 very closely related to PEG, because it also contains negative and conjunctive |
|
110 grammars. The main differences are: First, Boolean Grammar has no ordering on |
|
111 productions; Second: Boolean Grammar does not contain STAR operator. There are |
|
112 two works about Boolean Grammar which might be useful for this research: |
|
113 |
|
114 \begin{enumerate} |
|
115 \item A fixed point semantics for Boolean |
|
116 Grammar~\cite{journals/iandc/KountouriotisNR09}. The idea to define the |
|
117 semantics of negative and conjunctive operators is certainly what we can |
|
118 borrow. Therefore, this work gives the basis on which we can add in production |
|
119 ordering and STAR operator. |
|
120 |
|
121 \item A parsing algorithm for Boolean Grammar based on CYK |
|
122 parsing~\cite{journals/iandc/KountouriotisNR09}. The draw back of CYK parsing |
|
123 is that: the original grammar specification needs to be transformed into a |
|
124 normal form. This transformation may lead to grammar explosion and is |
|
125 undesirable. One aim of this research is to see whether this transformation |
|
126 can be avoided. For this purpose, other parsing style may provide useful inspirations, for example: |
|
127 |
|
128 \begin{enumerate} |
|
129 \item Derivative |
|
130 Parsing~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023}. Christian |
|
131 Urban has used derivative methods to establish the correctness of a regular |
|
132 expression matcher, as well the the finite partition property of regular |
|
133 expression~\cite{WuZhangUrban11}. There are well founded envisage that the |
|
134 derivative methods may provide the foundation to the new parsing algorithms of PEG. |
|
135 |
|
136 \item Early parsing~\cite{Earley70,AycHor02}. It is a refinement of CYK |
|
137 parsing which does not require the transformation to normal forms, and |
|
138 therefore provide one possible direction to adapt the current CYK based |
|
139 parsing algorithm of Boolean Grammar for PEG grammar. |
|
140 |
|
141 \item The new parsing algorithm proposed by Tom Ridge[???]. Recently, |
|
142 T. Ridge has proposed and certified an combinator style parsing algorithm for |
|
143 CFG, which borrows some ideas from Early parsing. The proposed algorithm is |
|
144 very simple and elegant. We are going to strive for a parsing algorithm as elegant as this one. |
|
145 \end{enumerate} |
|
146 |
|
147 Which of the above possibilities will finally get into our final solutions |
|
148 is an interesting point about this current research. |
|
149 |
|
150 \end{enumerate} |
|
151 \end{enumerate} |
|
152 |
|
153 Based on these works, we are quite confident that our idea may lead to some concrete results. |
|
154 |
|
155 \mbox{}\\[15cm] |
|
156 |
|
157 \noindent |
|
158 |
|
159 |
114 |
160 |
115 |
161 \small |
116 \small |
162 \bibliography{Journal/document/root} |
117 \bibliography{Journal/document/root} |
163 \bibliographystyle{abbrv} |
118 \bibliographystyle{abbrv} |