csupp.tex
changeset 234 eeadb4e51d74
parent 232 114064363ef0
child 235 a7ddcad0a023
equal deleted inserted replaced
233:e2dc11e12e0b 234:eeadb4e51d74
    41 \emph{yacc} and \emph{lex} no new results can be obtained in this area.
    41 \emph{yacc} and \emph{lex} no new results can be obtained in this area.
    42 However recent results and novel approaches make it increasingly clear,
    42 However recent results and novel approaches make it increasingly clear,
    43 that this is not true anymore.
    43 that this is not true anymore.
    44 
    44 
    45 We propose to approach the subject of parsing from a certification point
    45 We propose to approach the subject of parsing from a certification point
    46 of view. Parsers are increasingly part of certified compilers, like \mbox{\emph{CompCert}},
    46 of view. Increasingly, parsers are part of certified compilers, like \mbox{\emph{CompCert}},
    47 which are guaranteed to be correct and bug-free. Such certified compilers are
    47 which are guaranteed to be correct and bug-free. Such certified compilers are
    48 crucial in areas where software just cannot fail. However, so far the
    48 crucial in areas where software just cannot fail. However, so far the
    49 parsers of these compilers have been left out of the certification.
    49 parsers of these compilers have been left out of the certification.
    50 This is because parsing algorithms are often ad hoc and their semantics
    50 This is because parsing algorithms are often ad hoc and their semantics
    51 is not clearly specified. Unfortunately, this means parsers can harbour
    51 is not clearly specified. Unfortunately, this means parsers can harbour
    52 errors that potentially invalidate the whole certification and correctness
    52 errors that potentially invalidate the whole certification and correctness
    53 of the compiler. In this project, we like to change that.
    53 of the compiler. In this project, we like to change that.
    54 
    54 
    55 Only in the last few years, theorem provers have become good enough
    55 Only in the last few years, theorem provers have become good enough
    56 for establishing the correctness of some standard lexing and
    56 for establishing the correctness of some standard lexing and
    57 parsing algorithms. For this, the algorithms need to be formulated
    57 parsing algorithms. For this, the algorithms still need to be formulated
    58 in way so that it is easy to reason about them. In earlier work
    58 in way so that it is easy to reason about them. In our earlier work
    59 about lexing and regular languages, the authors showed that this
    59 about lexing and regular languages, the authors showed that this
    60 precludes well-known algorithms working over graphs. However regular
    60 precludes well-known algorithms based automata. However we showed 
       
    61 that regular
    61 languages can be formulated and reasoned about entirely in terms
    62 languages can be formulated and reasoned about entirely in terms
    62 regular expressions, which can be easily represented in theorem
    63 regular expressions, which can be easily represented in theorem
    63 provers. This work uses the device of derivatives of regular
    64 provers. This work uses the device of derivatives of regular
    64 expressions. We like to extend this device to parsers and grammars.
    65 expressions. We like to extend this device to parsers and grammars.
    65 The aim is to come up with elegant and useful parsing algorithms
    66 The aim is to come up with elegant and practical useful parsing algorithms
    66 whose correctness and the absence of bugs can be certified in a
    67 whose correctness can be certified in a
    67 theorem prover.
    68 theorem prover.
    68 
    69 
    69 \section*{Proposed Work}
    70 \section*{Proposed Work}
    70 
    71 
    71 One new development in formal grammar is the introduction of Parsing Expression Grammar (PEG) as an extension of the standard Context Free Grammar (CFG)\cite{Ford04a}. The extension introduces new regular operators such as negation and conjunction to the right hand side of productions, as well as well as an priority ordering on productions. With these extensions, PEG becomes more powerful such that disambiguation formerly expressed using semantic filters can now be expressed directly using production expressions. This means a simpler and more systematic treatment of ambiguity and more concise grammar specification for programming languages.
    72 A recent development in parsing is Parsing Expression Grammars (PEG), which
       
    73 are an extension of the standard Context Free Grammars
       
    74 (CFG)~\cite{Ford04a}. The extension introduces new regular operators, such as
       
    75 negation and conjunction, on the right-hand sides of grammar rules, as well as
       
    76 priority orderings on rules. With these extensions, PEG parsing becomes much
       
    77 more powerful. For example disambiguation, formerly expressed by semantic
       
    78 filters, can now be expressed directly using grammar rules. This means a
       
    79 simpler and more systematic treatment of ambiguity and more concise grammar
       
    80 specifications for programming languages.
    72 
    81 
    73 However, one disadvantage of PEG is that it does not allow left recursion in grammar specification, because the accompanying algorithms of PEG\cite{Ford02b} can not deal with left recursions. Although some authors claimed new PEG parsing algorithm for left recursion\cite{conf/pepm/WarthDM08}, there is no correctness proof, not even in paper-and-pencil form. One aim of this research is to formalize a fixed point semantics of PEG, based on which an efficient, certified parsing algorithm is given.
    82 However, a serious disadvantage of PEG is that it does not allow left
       
    83 recursion, because parsing algorithms for PEG~\cite{Ford02b} can not deal with
       
    84 left recursions. Although a new PEG parsing algorithm has been proposed
       
    85 that can deal with left recursion~\cite{conf/pepm/WarthDM08}, there is no
       
    86 correctness proof, not even in ``paper-and-pencil'' form. One aim of this
       
    87 research is to solve this sorry state-of-affairs by either certifying this
       
    88 algorithm or inventing a new one. For this we will first formalize a fixed
       
    89 point semantics of PEG, based on which an efficient, certified parsing
       
    90 algorithm can be given given.
    74 
    91 
    75 There are several existing works we can draw upon:
    92 There are several existing works we can draw upon:
    76 \begin{enumerate}
    93 \begin{enumerate}
    77     \item The works on PEG.
    94 \item The works on PEG.
    78         \begin {enumerate}
    95   \begin {enumerate}
    79             \item An operation semantics for PEG has already been given in \cite{Ford04a}, but it is not adequate to deal with left recursions. But this work gives at least a precise description of what the original PEG  meant for. This will serve an a basis to show the conservativeness of the fixed point semantics we are going to develop.
    96   \item An operation semantics for PEG has already been given
    80             \item The new algorithm\cite{conf/pepm/WarthDM08} which claimed to be able to deal with left recursions. Although there is no correctness proof yet, this may provide some useful inspirations for our new algorithm design.
    97     in~\cite{Ford04a}, but it is not adequate to deal with left recursions. But this
    81         \end{enumerate}
    98     work gives at least a precise description of what the original PEG meant
    82     \item The works on Boolean Grammar\cite{Okhotin/04a}. Boolean Grammar is very closely related to PEG, because it also contains negative and conjunctive grammars. The main differences are: First, Boolean Grammar has no ordering on productions; Second: Boolean Grammar does not contain STAR operator. There are two works about Boolean Grammar which might be useful for this research:
    99     for. This will serve an a basis to show the conservativeness of
    83         \begin{enumerate}
   100     the fixed point semantics we are
    84             \item A fixed point semantics for Boolean Grammar\cite{journals/iandc/KountouriotisNR09}. The idea to define the semantics of negative and conjunctive operators is certainly what we can borrow. Therefore, this work gives the basis on which we can add in production ordering and STAR operator.
   101     going to develop.
    85             \item A parsing algorithm for Boolean Grammar based on CYK parsing\cite{journals/iandc/KountouriotisNR09}. The draw back of CYK parsing is that: the original grammar specification needs to be transformed into a normal form. This transformation may lead to grammar explosion and is undesirable. One aim of this research is to see whether this transformation can be avoided. For this purpose, other parsing style may provide useful inspirations, for example:
   102     
    86                     \begin{enumerate}
   103   \item The new algorithm~\cite{conf/pepm/WarthDM08} which claimed to be able
    87                         \item Derivative Parsing\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023}. Christian Urban has used derivative methods to establish the correctness of a regular expression matcher, as well the the finite partition property of regular expression\cite{WuZhangUrban11}.  There are well founded envisage that the derivative methods may provide the foundation to the new parsing algorithms of PEG.
   104     to deal with left recursions. Although there is no correctness proof yet, this
    88                         \item Early parsing\cite{Earley70,AycHor02}. It is a refinement of CYK parsing which does not require the transformation to normal forms, and therefore provide one possible direction to adapt the current CYK based parsing algorithm of Boolean Grammar for PEG grammar.
   105     may provide some useful inspirations for our new algorithm design.
    89                         \item The new parsing algorithm proposed by Tom Ridge[???]. Recently, T. Ridge has proposed and certified an combinator style parsing algorithm for CFG, which borrows some ideas from Early parsing. The proposed algorithm is very simple and elegant. We are going to strive for a parsing algorithm as elegant as this one.
   106   \end{enumerate}
    90                     \end{enumerate}
   107 
    91                 Which of the above possibilities will finally get into our final solutions is an interesting point about this current research.
   108 \item The works on Boolean Grammars~\cite{Okhotin/04a}. Boolean Grammar is
    92         \end{enumerate}
   109   very closely related to PEG, because it also contains negative and conjunctive
       
   110   grammars. The main differences are: First, Boolean Grammar has no ordering on
       
   111   productions; Second: Boolean Grammar does not contain STAR operator. There are
       
   112   two works about Boolean Grammar which might be useful for this research:
       
   113   
       
   114   \begin{enumerate}
       
   115   \item A fixed point semantics for Boolean
       
   116     Grammar~\cite{journals/iandc/KountouriotisNR09}. The idea to define the
       
   117     semantics of negative and conjunctive operators is certainly what we can
       
   118     borrow. Therefore, this work gives the basis on which we can add in production
       
   119     ordering and STAR operator.
       
   120 
       
   121   \item A parsing algorithm for Boolean Grammar based on CYK
       
   122     parsing~\cite{journals/iandc/KountouriotisNR09}. The draw back of CYK parsing
       
   123     is that: the original grammar specification needs to be transformed into a
       
   124     normal form. This transformation may lead to grammar explosion and is
       
   125     undesirable. One aim of this research is to see whether this transformation
       
   126     can be avoided. For this purpose, other parsing style may provide useful inspirations, for example:
       
   127 
       
   128     \begin{enumerate}
       
   129     \item Derivative
       
   130       Parsing~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023}. Christian
       
   131       Urban has used derivative methods to establish the correctness of a regular
       
   132       expression matcher, as well the the finite partition property of regular
       
   133       expression~\cite{WuZhangUrban11}.  There are well founded envisage that the
       
   134       derivative methods may provide the foundation to the new parsing algorithms of PEG.
       
   135 
       
   136     \item Early parsing~\cite{Earley70,AycHor02}. It is a refinement of CYK
       
   137       parsing which does not require the transformation to normal forms, and
       
   138       therefore provide one possible direction to adapt the current CYK based
       
   139       parsing algorithm of Boolean Grammar for PEG grammar.
       
   140 
       
   141     \item The new parsing algorithm proposed by Tom Ridge[???]. Recently,
       
   142       T. Ridge has proposed and certified an combinator style parsing algorithm for
       
   143       CFG, which borrows some ideas from Early parsing. The proposed algorithm is
       
   144       very simple and elegant. We are going to strive for a parsing algorithm as elegant as this one.
       
   145     \end{enumerate}
       
   146     
       
   147     Which of the above possibilities will finally get into our final solutions
       
   148     is an interesting point about this current research.
       
   149 
       
   150   \end{enumerate}
    93 \end{enumerate}
   151 \end{enumerate}
       
   152 
    94 Based on these works, we are quite confident that our idea may lead to some concrete results.
   153 Based on these works, we are quite confident that our idea may lead to some concrete results.
    95 
   154 
    96 \mbox{}\\[15cm]
   155 \mbox{}\\[15cm]
    97 
   156 
    98 \noindent
   157 \noindent