csupp.tex
changeset 242 093e45c44d91
parent 236 2d4f1334b5ca
child 243 86a4182c73e7
equal deleted inserted replaced
241:68d48522ea9a 242:093e45c44d91
    32 \thispagestyle{empty}
    32 \thispagestyle{empty}
    33 \mbox{}\\[-5mm]
    33 \mbox{}\\[-5mm]
    34 
    34 
    35 \begin{multicols}{2}
    35 \begin{multicols}{2}
    36 \section*{Background}
    36 \section*{Background}
       
    37 
    37 \noindent
    38 \noindent
    38 Parsing is the act of transforming plain text into some
    39 Parsers transform plain text into some abstract structure that can be analyzed by
    39 structure that can be analyzed by computers for further processing.
    40 computers for further processing.  One might think that parsers have been
    40 One might think that parsing has been studied to death, and after
    41 studied to death, and after \emph{yacc} and \emph{lex} no new results can be
    41 \emph{yacc} and \emph{lex} no new results can be obtained in this area.
    42 obtained in this area.  However recent developments and novel approaches make
    42 However recent developments and novel approaches make it increasingly clear,
    43 it increasingly clear, that this is not true anymore~\cite{Might11}. And
    43 that this is not true anymore.
    44 there is a practical need for new results: for example the future HTML 5 
       
    45 Standard abandons a well-defined grammar specification, in favour of a bespoke
       
    46 parser given as pseudo code.
    44 
    47 
    45 We propose to on parsers from a certification point of view. Increasingly,
    48 This work targets parsers from a certification point of view. Increasingly,
    46 parsers are part of certified compilers, like \mbox{\emph{CompCert}}, which
    49 parsers are part of certified compilers, like
    47 are guaranteed to be correct and bug-free. Such certified compilers are
    50 \mbox{\emph{CompCert}}~\cite{Leroy09}, which are guaranteed to be
    48 crucial in areas where software just cannot fail. However, so far the parsers
    51 bug-free. Such certified compilers are crucial in areas where software just
    49 of these compilers have been left out of the certification.  This is because
    52 cannot fail. However, so far the parsers of these compilers have been left out
    50 parsing algorithms are often ad hoc and their semantics is not clearly
    53 of the certification.  This is because parsing algorithms are often ad hoc and
    51 specified. Unfortunately, this means parsers can harbour errors that
    54 their semantics is not clearly specified. This means, unfortunately, parsers
    52 potentially invalidate the whole certification and correctness of the
    55 can harbour errors that potentially invalidate the whole certification and
    53 compiler. In this project, we like to change that with the help of theorem
    56 correctness of the compiler. In this project, we aim to change this situation
    54 provers.
    57 with the help of the theorem prover Isabelle/HOL.
    55 
    58 
    56 Only in the last few years, theorem provers have become good enough for
    59 Only in the last few years, theorem provers have become powerful enough for
    57 establishing the correctness of some standard lexing and parsing
    60 establishing the correctness of some standard lexing and parsing
    58 algorithms. For this, the algorithms still need to be formulated in way so
    61 algorithms. For this, the algorithms still need to be formulated in way so
    59 that it is easy to reason about them. In our earlier work about lexing and
    62 that it is easy to reason about them. In our earlier work about lexing and
    60 regular languages, we showed that this precludes well-known algorithms based
    63 regular languages, we showed that this precludes well-known algorithms based
    61 automata. However we showed also that regular languages can be formulated and
    64 automata~\cite{WuZhangUrban11}. However we showed also that regular languages can be formulated and
    62 reasoned about entirely in terms regular expressions, which can be easily
    65 reasoned about entirely in terms regular expressions, which can be easily
    63 represented in theorem provers. This work uses the device of derivatives of
    66 represented in theorem provers. This work uses the device of derivatives of
    64 regular expressions. We like to extend this device to parsers and grammars.
    67 regular expressions. We like to extend this device to parsers and grammars.
    65 The aim is to come up with elegant and practical useful parsing algorithms
    68 The aim is to come up with elegant and practical useful parsing algorithms
    66 whose correctness can be certified in a theorem prover.
    69 whose correctness can be certified in Isabelle/HOL.
    67 
    70 
    68 
    71 
    69 \section*{Proposed Work}
    72 \section*{Proposed Work}
    70 
    73 
    71 A recent development in parsing is Parsing Expression Grammars (PEG), which
    74 A recent development in the field of parsing are Parsing Expression Grammars
    72 are an extension of the weel-known Context Free Grammars
    75 (PEGs)~\cite{Ford04}, which
    73 (CFG)~\cite{Ford04a}. The extension introduces new regular operators, such as
    76 are an extension of the well-known Context Free Grammars
    74 negation and conjunction, on the right-hand sides of grammar rules, as well as
    77 (CFGs). This extension introduces new regular operators, such as
    75 priority orderings. With these extensions, PEG parsing becomes much
    78 negation and conjunction, on the right-hand side of grammar rules, as well as
    76 more powerful. For example disambiguation, formerly expressed by semantic
    79 priority orderings for rules. With these extensions, PEG parsing becomes much
       
    80 more powerful and more useful in practise. For example disambiguation, formerly expressed by semantic
    77 filters, can now be expressed directly using grammar rules. 
    81 filters, can now be expressed directly using grammar rules. 
    78 
    82 
    79 However, there is serious disadvantage of PEG for applications: is does not
    83 However, there is a serious limitation of PEGs, which affects potential
    80 support grammrs involving left recursion~\cite{Ford02b}. Although a new PEG
    84 applications: grammars involving left recursion are not
    81 parsing algorithm has been proposed that can deal with left
    85 allowed~\cite{Ford02}. Although one PEG parsing algorithm has already been
    82 recursion~\cite{conf/pepm/WarthDM08}, there is no correctness proof, not even
    86 proposed that can deal with left recursion~\cite{WarthDM08}, there
    83 in ``paper-and-pencil'' form. One aim of this research is to solve this sorry
    87 is no correctness proof for it, not even one with ``pencil and paper''. One aim of this
    84 state-of-affairs by either certifying this algorithm or inventing a new
    88 research is to solve this sorry state-of-affairs by either certifying this
    85 one. For this we will first formalize a fixed point semantics of PEG, based on
    89 algorithm or inventing a new one. For this we will first devise a
    86 which an efficient, certified parsing algorithm can be given given. For this
    90 fixed-point semantics of PEGs, against which we can certify a parser. For this
    87 we take as starting point the paper~\cite{Ford04a}, which does not treat
    91 semantics we take as starting point the paper~\cite{Ford04}, which does not
    88 left-recursion, but gives an operational semantics for PEG parsing. For the
    92 treat left-recursion, but gives an operational semantics for PEG
    89 semantics, it seems plausible that we can adapt work on Boolean
    93 parsing. There are also good indications that we can adapt work on Boolean
    90 Grammars~\cite{Okhotin/04a}, which are similar to PEGs, and for which the
    94 Grammars~\cite{Okhotin04}, which are similar to PEGs, and for which the
    91 paper ~\cite{journals/iandc/KountouriotisNR09} gives a semantics to negation
    95 paper~\cite{KountouriotisNR09} gives a fixed-point semantics 
    92 operators, but not to Kleene's star operation.
    96 to negation operators, but not to the Kleene star.
    93 
    97 
    94 For the parsing algorithm, we might also be able to draw inspiration from
    98 For the parsing algorithm, we might be able to draw inspiration from parsers
    95 parsers based on Cocke-Younger-Kasami (CYK)
    99 based on the classic Cocke-Younger-Kasami (CYK)
    96 algorithms~\cite{journals/iandc/KountouriotisNR09} and
   100 algorithms~\cite{KountouriotisNR09} and
    97 Early~\cite{Earley70,AycHor02} parsers. The defect CYK algorithms is that the
   101 Early~\cite{AycHor02, Earley70} parsers. The defect of CYK algorithms, however,
    98 original grammar specification needs to be transformed into a normal
   102 is that the grammar specifications given by the user need to be transformed
    99 form. This transformation may lead to grammar explosion and inefficient
   103 into a normal form. This transformation may potentially lead to rule explosion
   100 parsing. We will investigate whether this transformation can be avoided.
   104 and hence inefficient parsing. We will investigate whether this transformation
   101 Early style parsers, which have recently been certified by Ridge [???], 
   105 can be avoided.  The problem with Early-style parsers is that they need to be 
   102 need to be extended to PEG parsing in order to be helpful for us.
   106 extended to PEG parsing in order to be helpful for us.
       
   107 
   103 
   108 
   104 Finally, we want to investigate whether derivatives of regular expressions
   109 Finally, we want to investigate whether derivatives of regular expressions
   105 ~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023}
   110 ~\cite{Brzozowski64,Might11,OwensReppyTuron09}
   106 can be extended to parsing. Lexing based on derivatives gives rise to very
   111 can be generalised to PEG parsing. In earlier work, we showed that lexing based on
   107 elegant regular expression matchers that can be certified in a theorem prover 
   112 derivatives gives rise to very elegant regular expression matchers that can be
   108 with ease.  We will study whether the idea of taking a derivative of a regular
   113 certified in a theorem prover with ease.  We will study whether the idea of
   109 expression can be extended to rules in grammars. The problem that needs to be
   114 taking a derivative of a regular expression can be extended to rules in
   110 overcome again arises from possible left recursion in parsing. 
   115 grammars. The problem that needs to be overcome is again how to deal with possible
   111 
   116 left-recursion in grammar rules.
   112     
       
   113 \newpage
       
   114 
   117 
   115 
   118 
   116 \small
   119 
   117 \bibliography{Journal/document/root}
   120 \bibliography{Journal/document/root}
   118 \bibliographystyle{abbrv}
   121 \bibliographystyle{abbrv}
   119 \end{multicols}
   122 \end{multicols}
   120 
   123 
   121 %  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   124 %  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%