thys/Paper/Paper.thy
changeset 134 2f043f8be9a9
parent 133 23e68b81a908
child 135 fee5641c5994
equal deleted inserted replaced
133:23e68b81a908 134:2f043f8be9a9
   139 sequence of tokens, POSIX is the more natural disambiguation strategy for
   139 sequence of tokens, POSIX is the more natural disambiguation strategy for
   140 what programmers consider basic syntactic building blocks in their programs.
   140 what programmers consider basic syntactic building blocks in their programs.
   141 These building blocks are often specified by some regular expressions, say
   141 These building blocks are often specified by some regular expressions, say
   142 @{text "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising keywords and
   142 @{text "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising keywords and
   143 identifiers, respectively. There are two underlying (informal) rules behind
   143 identifiers, respectively. There are two underlying (informal) rules behind
   144 tokenising a string in a POSIX fashion:
   144 tokenising a string in a POSIX fashion according to a collection of regular
       
   145 expressions:
   145 
   146 
   146 \begin{itemize} 
   147 \begin{itemize} 
   147 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}
   148 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}
   148 
   149 
   149 The longest initial substring matched by any regular expression is taken as
   150 The longest initial substring matched by any regular expression is taken as
   164 token, not a keyword followed by an identifier. For @{text "if"} we obtain by
   165 token, not a keyword followed by an identifier. For @{text "if"} we obtain by
   165 the priority rule a keyword token, not an identifier token---even if @{text
   166 the priority rule a keyword token, not an identifier token---even if @{text
   166 "r\<^bsub>id\<^esub>"} matches also.\bigskip
   167 "r\<^bsub>id\<^esub>"} matches also.\bigskip
   167 
   168 
   168 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
   169 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
   169 derivative-based regular expression matching algorithm as described by
   170 derivative-based regular expression matching algorithm of
   170 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
   171 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
   171 algorithm according to our specification of what a POSIX value is. Sulzmann
   172 algorithm according to our specification of what a POSIX value is. Sulzmann
   172 and Lu sketch in \cite{Sulzmann2014} an informal correctness proof: but to
   173 and Lu sketch in \cite{Sulzmann2014} an informal correctness proof: but to
   173 us it contains unfillable gaps.\footnote{An extended version of
   174 us it contains unfillable gaps.\footnote{An extended version of
   174 \cite{Sulzmann2014} is available at the website of its first author; this
   175 \cite{Sulzmann2014} is available at the website of its first author; this
  1133   implemented easily in functional languages. A bespoke lexer for the
  1134   implemented easily in functional languages. A bespoke lexer for the
  1134   Imp-Language is formalised in Coq as part of the Software Foundations book
  1135   Imp-Language is formalised in Coq as part of the Software Foundations book
  1135   \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
  1136   \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they
  1136   do not generalise easily to more advanced features.
  1137   do not generalise easily to more advanced features.
  1137   Our formalisation is available from
  1138   Our formalisation is available from
  1138 
  1139   \url{http://www.inf.kcl.ac.uk/staff/urbanc/lex}.
  1139   \begin{center}
       
  1140   \url{http://www.inf.kcl.ac.uk/staff/urbanc/lex}
       
  1141   \end{center}
       
  1142 
  1140 
  1143   %\noindent
  1141   %\noindent
  1144   %{\bf Acknowledgements:}
  1142   %{\bf Acknowledgements:}
  1145   %We are grateful for the comments we received from anonymous
  1143   %We are grateful for the comments we received from anonymous
  1146   %referees.
  1144   %referees.