csupp.tex
changeset 236 2d4f1334b5ca
parent 235 a7ddcad0a023
child 242 093e45c44d91
equal deleted inserted replaced
235:a7ddcad0a023 236:2d4f1334b5ca
    37 \noindent
    37 \noindent
    38 Parsing is the act of transforming plain text into some
    38 Parsing is the act of transforming plain text into some
    39 structure that can be analyzed by computers for further processing.
    39 structure that can be analyzed by computers for further processing.
    40 One might think that parsing has been studied to death, and after
    40 One might think that parsing has been studied to death, and after
    41 \emph{yacc} and \emph{lex} no new results can be obtained in this area.
    41 \emph{yacc} and \emph{lex} no new results can be obtained in this area.
    42 However recent results and novel approaches make it increasingly clear,
    42 However recent developments and novel approaches make it increasingly clear,
    43 that this is not true anymore.
    43 that this is not true anymore.
    44 
    44 
    45 We propose to approach the subject of parsing from a certification point
    45 We propose to on parsers from a certification point of view. Increasingly,
    46 of view. Increasingly, parsers are part of certified compilers, like \mbox{\emph{CompCert}},
    46 parsers are part of certified compilers, like \mbox{\emph{CompCert}}, which
    47 which are guaranteed to be correct and bug-free. Such certified compilers are
    47 are guaranteed to be correct and bug-free. Such certified compilers are
    48 crucial in areas where software just cannot fail. However, so far the
    48 crucial in areas where software just cannot fail. However, so far the parsers
    49 parsers of these compilers have been left out of the certification.
    49 of these compilers have been left out of the certification.  This is because
    50 This is because parsing algorithms are often ad hoc and their semantics
    50 parsing algorithms are often ad hoc and their semantics is not clearly
    51 is not clearly specified. Unfortunately, this means parsers can harbour
    51 specified. Unfortunately, this means parsers can harbour errors that
    52 errors that potentially invalidate the whole certification and correctness
    52 potentially invalidate the whole certification and correctness of the
    53 of the compiler. In this project, we like to change that.
    53 compiler. In this project, we like to change that with the help of theorem
       
    54 provers.
    54 
    55 
    55 Only in the last few years, theorem provers have become good enough
    56 Only in the last few years, theorem provers have become good enough for
    56 for establishing the correctness of some standard lexing and
    57 establishing the correctness of some standard lexing and parsing
    57 parsing algorithms. For this, the algorithms still need to be formulated
    58 algorithms. For this, the algorithms still need to be formulated in way so
    58 in way so that it is easy to reason about them. In our earlier work
    59 that it is easy to reason about them. In our earlier work about lexing and
    59 about lexing and regular languages, the authors showed that this
    60 regular languages, we showed that this precludes well-known algorithms based
    60 precludes well-known algorithms based automata. However we showed 
    61 automata. However we showed also that regular languages can be formulated and
    61 that regular
    62 reasoned about entirely in terms regular expressions, which can be easily
    62 languages can be formulated and reasoned about entirely in terms
    63 represented in theorem provers. This work uses the device of derivatives of
    63 regular expressions, which can be easily represented in theorem
    64 regular expressions. We like to extend this device to parsers and grammars.
    64 provers. This work uses the device of derivatives of regular
       
    65 expressions. We like to extend this device to parsers and grammars.
       
    66 The aim is to come up with elegant and practical useful parsing algorithms
    65 The aim is to come up with elegant and practical useful parsing algorithms
    67 whose correctness can be certified in a
    66 whose correctness can be certified in a theorem prover.
    68 theorem prover.
    67 
    69 
    68 
    70 \section*{Proposed Work}
    69 \section*{Proposed Work}
    71 
    70 
    72 A recent development in parsing is Parsing Expression Grammars (PEG), which
    71 A recent development in parsing is Parsing Expression Grammars (PEG), which
    73 are an extension of the standard Context Free Grammars
    72 are an extension of the weel-known Context Free Grammars
    74 (CFG)~\cite{Ford04a}. The extension introduces new regular operators, such as
    73 (CFG)~\cite{Ford04a}. The extension introduces new regular operators, such as
    75 negation and conjunction, on the right-hand sides of grammar rules, as well as
    74 negation and conjunction, on the right-hand sides of grammar rules, as well as
    76 priority orderings on rules. With these extensions, PEG parsing becomes much
    75 priority orderings. With these extensions, PEG parsing becomes much
    77 more powerful. For example disambiguation, formerly expressed by semantic
    76 more powerful. For example disambiguation, formerly expressed by semantic
    78 filters, can now be expressed directly using grammar rules. This means a
    77 filters, can now be expressed directly using grammar rules. 
    79 simpler and more systematic treatment of ambiguity and more concise grammar
       
    80 specifications for programming languages.
       
    81 
    78 
    82 However, a serious disadvantage of PEG is that it does not allow left
    79 However, there is serious disadvantage of PEG for applications: is does not
    83 recursion, because parsing algorithms for PEG~\cite{Ford02b} can not deal with
    80 support grammrs involving left recursion~\cite{Ford02b}. Although a new PEG
    84 left recursions. Although a new PEG parsing algorithm has been proposed
    81 parsing algorithm has been proposed that can deal with left
    85 that can deal with left recursion~\cite{conf/pepm/WarthDM08}, there is no
    82 recursion~\cite{conf/pepm/WarthDM08}, there is no correctness proof, not even
    86 correctness proof, not even in ``paper-and-pencil'' form. One aim of this
    83 in ``paper-and-pencil'' form. One aim of this research is to solve this sorry
    87 research is to solve this sorry state-of-affairs by either certifying this
    84 state-of-affairs by either certifying this algorithm or inventing a new
    88 algorithm or inventing a new one. For this we will first formalize a fixed
    85 one. For this we will first formalize a fixed point semantics of PEG, based on
    89 point semantics of PEG, based on which an efficient, certified parsing
    86 which an efficient, certified parsing algorithm can be given given. For this
    90 algorithm can be given given.
    87 we take as starting point the paper~\cite{Ford04a}, which does not treat
       
    88 left-recursion, but gives an operational semantics for PEG parsing. For the
       
    89 semantics, it seems plausible that we can adapt work on Boolean
       
    90 Grammars~\cite{Okhotin/04a}, which are similar to PEGs, and for which the
       
    91 paper ~\cite{journals/iandc/KountouriotisNR09} gives a semantics to negation
       
    92 operators, but not to Kleene's star operation.
    91 
    93 
    92 There are several existing works we can draw upon:
    94 For the parsing algorithm, we might also be able to draw inspiration from
    93 \begin{enumerate}
    95 parsers based on Cocke-Younger-Kasami (CYK)
    94 \item The works on PEG.
    96 algorithms~\cite{journals/iandc/KountouriotisNR09} and
    95   \begin {enumerate}
    97 Early~\cite{Earley70,AycHor02} parsers. The defect CYK algorithms is that the
    96   \item An operation semantics for PEG has already been given
    98 original grammar specification needs to be transformed into a normal
    97     in~\cite{Ford04a}, but it is not adequate to deal with left recursions. But this
    99 form. This transformation may lead to grammar explosion and inefficient
    98     work gives at least a precise description of what the original PEG meant
   100 parsing. We will investigate whether this transformation can be avoided.
    99     for. This will serve an a basis to show the conservativeness of
   101 Early style parsers, which have recently been certified by Ridge [???], 
   100     the fixed point semantics we are
   102 need to be extended to PEG parsing in order to be helpful for us.
   101     going to develop.
   103 
       
   104 Finally, we want to investigate whether derivatives of regular expressions
       
   105 ~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023}
       
   106 can be extended to parsing. Lexing based on derivatives gives rise to very
       
   107 elegant regular expression matchers that can be certified in a theorem prover 
       
   108 with ease.  We will study whether the idea of taking a derivative of a regular
       
   109 expression can be extended to rules in grammars. The problem that needs to be
       
   110 overcome again arises from possible left recursion in parsing. 
       
   111 
   102     
   112     
   103   \item The new algorithm~\cite{conf/pepm/WarthDM08} which claimed to be able
   113 \newpage
   104     to deal with left recursions. Although there is no correctness proof yet, this
       
   105     may provide some useful inspirations for our new algorithm design.
       
   106   \end{enumerate}
       
   107 
       
   108 \item The works on Boolean Grammars~\cite{Okhotin/04a}. Boolean Grammar is
       
   109   very closely related to PEG, because it also contains negative and conjunctive
       
   110   grammars. The main differences are: First, Boolean Grammar has no ordering on
       
   111   productions; Second: Boolean Grammar does not contain STAR operator. There are
       
   112   two works about Boolean Grammar which might be useful for this research:
       
   113   
       
   114   \begin{enumerate}
       
   115   \item A fixed point semantics for Boolean
       
   116     Grammar~\cite{journals/iandc/KountouriotisNR09}. The idea to define the
       
   117     semantics of negative and conjunctive operators is certainly what we can
       
   118     borrow. Therefore, this work gives the basis on which we can add in production
       
   119     ordering and STAR operator.
       
   120 
       
   121   \item A parsing algorithm for Boolean Grammar based on CYK
       
   122     parsing~\cite{journals/iandc/KountouriotisNR09}. The draw back of CYK parsing
       
   123     is that: the original grammar specification needs to be transformed into a
       
   124     normal form. This transformation may lead to grammar explosion and is
       
   125     undesirable. One aim of this research is to see whether this transformation
       
   126     can be avoided. For this purpose, other parsing style may provide useful inspirations, for example:
       
   127 
       
   128     \begin{enumerate}
       
   129     \item Derivative
       
   130       Parsing~\cite{Brzozowski64,Almeidaetal10,OwensReppyTuron09,journals/corr/abs-1010-5023}. Christian
       
   131       Urban has used derivative methods to establish the correctness of a regular
       
   132       expression matcher, as well the the finite partition property of regular
       
   133       expression~\cite{WuZhangUrban11}.  There are well founded envisage that the
       
   134       derivative methods may provide the foundation to the new parsing algorithms of PEG.
       
   135 
       
   136     \item Early parsing~\cite{Earley70,AycHor02}. It is a refinement of CYK
       
   137       parsing which does not require the transformation to normal forms, and
       
   138       therefore provide one possible direction to adapt the current CYK based
       
   139       parsing algorithm of Boolean Grammar for PEG grammar.
       
   140 
       
   141     \item The new parsing algorithm proposed by Tom Ridge[???]. Recently,
       
   142       T. Ridge has proposed and certified an combinator style parsing algorithm for
       
   143       CFG, which borrows some ideas from Early parsing. The proposed algorithm is
       
   144       very simple and elegant. We are going to strive for a parsing algorithm as elegant as this one.
       
   145     \end{enumerate}
       
   146     
       
   147     Which of the above possibilities will finally get into our final solutions
       
   148     is an interesting point about this current research.
       
   149 
       
   150   \end{enumerate}
       
   151 \end{enumerate}
       
   152 
       
   153 Based on these works, we are quite confident that our idea may lead to some concrete results.
       
   154 
       
   155 \mbox{}\\[15cm]
       
   156 
       
   157 \noindent
       
   158 
       
   159 
   114 
   160 
   115 
   161 \small
   116 \small
   162 \bibliography{Journal/document/root}
   117 \bibliography{Journal/document/root}
   163 \bibliographystyle{abbrv}
   118 \bibliographystyle{abbrv}