thys/Journal/Paper.thy
changeset 269 12772d537b71
parent 268 6746f5e1f1f8
child 270 462d893ecb3d
equal deleted inserted replaced
268:6746f5e1f1f8 269:12772d537b71
   195 
   195 
   196 One limitation of Brzozowski's matcher is that it only generates a
   196 One limitation of Brzozowski's matcher is that it only generates a
   197 YES/NO answer for whether a string is being matched by a regular
   197 YES/NO answer for whether a string is being matched by a regular
   198 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this matcher
   198 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this matcher
   199 to allow generation not just of a YES/NO answer but of an actual
   199 to allow generation not just of a YES/NO answer but of an actual
   200 matching, called a [lexical] {\em value}. They give a simple algorithm
   200 matching, called a [lexical] {\em value}. \marginpar{explain values;
   201 to calculate a value that appears to be the value associated with
   201 who introduced them} They give a simple algorithm to calculate a value
   202 POSIX matching.  The challenge then is to specify that value, in an
   202 that appears to be the value associated with POSIX matching.  The
   203 algorithm-independent fashion, and to show that Sulzmann and Lu's
   203 challenge then is to specify that value, in an algorithm-independent
   204 derivative-based algorithm does indeed calculate a value that is
   204 fashion, and to show that Sulzmann and Lu's derivative-based algorithm
   205 correct according to the specification.
   205 does indeed calculate a value that is correct according to the
       
   206 specification.
   206 
   207 
   207 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
   208 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
   208 relation (called an ``order relation'') on the set of values of @{term
   209 relation (called an ``order relation'') on the set of values of @{term
   209 r}, and to show that (once a string to be matched is chosen) there is
   210 r}, and to show that (once a string to be matched is chosen) there is
   210 a maximum element and that it is computed by their derivative-based
   211 a maximum element and that it is computed by their derivative-based
   478   only value associated with the regular expression @{term ONE} is
   479   only value associated with the regular expression @{term ONE} is
   479   @{term Void}.  It is routine to establish how values ``inhabiting''
   480   @{term Void}.  It is routine to establish how values ``inhabiting''
   480   a regular expression correspond to the language of a regular
   481   a regular expression correspond to the language of a regular
   481   expression, namely
   482   expression, namely
   482 
   483 
   483   \begin{proposition}
   484   \begin{proposition}\label{inhabs}
   484   @{thm L_flat_Prf}
   485   @{thm L_flat_Prf}
   485   \end{proposition}
   486   \end{proposition}
   486 
   487 
   487   \noindent
   488   \noindent
   488   Given a regular expression @{text r} and a string @{text s}, we define the 
   489   Given a regular expression @{text r} and a string @{text s}, we define the 
   933   yet. Unfortunately, we were not able to verify claims that their
   934   yet. Unfortunately, we were not able to verify claims that their
   934   ordering has properties such as being transitive or having maximal
   935   ordering has properties such as being transitive or having maximal
   935   elements.
   936   elements.
   936  
   937  
   937   Okui and Suzuki \cite{OkuiSuzuki2010,OkuiSuzukiTech} described
   938   Okui and Suzuki \cite{OkuiSuzuki2010,OkuiSuzukiTech} described
   938    another ordering of values, which they use to establish the correctness of
   939   another ordering of values, which they use to establish the
   939    their automata-based algorithm for POSIX matching.  Their ordering
   940   correctness of their automata-based algorithm for POSIX matching.
   940    resembles some aspects of the one given by Sulzmann and Lu, but
   941   Their ordering resembles some aspects of the one given by Sulzmann
   941    is quite different. To begin with, Okui and Suzuki identify POSIX
   942   and Lu, but is quite different. To begin with, Okui and Suzuki
   942    values as least elements in their ordering. A more substantial 
   943   identify POSIX values as least, rather than maximal, elements in
   943     difference is that the ordering by Okui
   944   their ordering. A more substantial difference is that the ordering
   944    and Suzuki uses \emph{positions} in order to identify and
   945   by Okui and Suzuki uses \emph{positions} in order to identify and
   945    compare subvalues, whereby positions are lists of natural
   946   compare subvalues. Positions are lists of natural numbers. This
   946    numbers. This allows them to quite naturally formalise the Longest
   947   allows them to quite naturally formalise the Longest Match and
   947    Match and Priority rules of the informal POSIX standard.  Consider
   948   Priority rules of the informal POSIX standard.  Consider for example
   948    for example the value @{term v} of the form @{term "Stars [Seq
   949   the value @{term v}
   949    (Char x) (Char y), Char z]"}, say.  At position @{text "[0,1]"} of
   950 
   950    this value is the subvalue @{text "Char y"} and at position @{text
   951   \begin{center}
   951    "[1]"} the subvalue @{term "Char z"}.  At the `root' position, or
   952   @{term "v == Stars [Seq (Char x) (Char y), Char z]"}
   952    empty list @{term "[]"}, is the whole value @{term v}. The
   953   \end{center}
   953    positions @{text "[0,1,0]"} and @{text "[2]"}, for example, are
   954 
   954    outside of @{text v}. If it exists, the subvalue of @{term v} at a
   955   \noindent
   955    position @{text p}, written @{term "at v p"}, can be recursively
   956   At position @{text "[0,1]"} of this value is the
   956    defined by
   957   subvalue @{text "Char y"} and at position @{text "[1]"} the
       
   958   subvalue @{term "Char z"}.  At the `root' position, or empty list
       
   959   @{term "[]"}, is the whole value @{term v}. The positions @{text
       
   960   "[0,1,0]"} and @{text "[2]"}, for example, are outside of @{text
       
   961   v}. If it exists, the subvalue of @{term v} at a position @{text
       
   962   p}, written @{term "at v p"}, can be recursively defined by
   957   
   963   
   958   \begin{center}
   964   \begin{center}
   959   \begin{tabular}{r@ {\hspace{0mm}}lcl}
   965   \begin{tabular}{r@ {\hspace{0mm}}lcl}
   960   @{term v} &  @{text "\<downharpoonleft>\<^bsub>[]\<^esub>"} & @{text "\<equiv>"}& @{thm (rhs) at.simps(1)}\\
   966   @{term v} &  @{text "\<downharpoonleft>\<^bsub>[]\<^esub>"} & @{text "\<equiv>"}& @{thm (rhs) at.simps(1)}\\
   961   @{term "Left v"} & @{text "\<downharpoonleft>\<^bsub>0::ps\<^esub>"} & @{text "\<equiv>"}& @{thm (rhs) at.simps(2)}\\
   967   @{term "Left v"} & @{text "\<downharpoonleft>\<^bsub>0::ps\<^esub>"} & @{text "\<equiv>"}& @{thm (rhs) at.simps(2)}\\
   968   @{thm (rhs) at.simps(5)[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2", simplified Suc_0_fold]} \\
   974   @{thm (rhs) at.simps(5)[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2", simplified Suc_0_fold]} \\
   969   @{term "Stars vs"} & @{text "\<downharpoonleft>\<^bsub>n::ps\<^esub>"} & @{text "\<equiv>"}& @{thm (rhs) at.simps(6)}\\
   975   @{term "Stars vs"} & @{text "\<downharpoonleft>\<^bsub>n::ps\<^esub>"} & @{text "\<equiv>"}& @{thm (rhs) at.simps(6)}\\
   970   \end{tabular} 
   976   \end{tabular} 
   971   \end{center}
   977   \end{center}
   972 
   978 
   973   \noindent We use Isabelle's notation @{term "vs ! n"} for the
   979   \noindent In the last clause we use Isabelle's notation @{term "vs ! n"} for the
   974   @{text n}th element in a list.  The set of positions inside a value @{text v},
   980   @{text n}th element in a list.  The set of positions inside a value @{text v},
   975   written @{term "Pos v"}, is given by the clauses
   981   written @{term "Pos v"}, is given by the clauses
   976 
   982 
   977   \begin{center}
   983   \begin{center}
   978   \begin{tabular}{lcl}
   984   \begin{tabular}{lcl}
  1115  The reasoning in the other cases is similar.\qed
  1121  The reasoning in the other cases is similar.\qed
  1116  \end{proof}
  1122  \end{proof}
  1117 
  1123 
  1118  \noindent We can show that @{term "DUMMY :\<sqsubseteq>val DUMMY"} is
  1124  \noindent We can show that @{term "DUMMY :\<sqsubseteq>val DUMMY"} is
  1119  a partial order.  Okui and Suzuki also show that it is a linear order
  1125  a partial order.  Okui and Suzuki also show that it is a linear order
  1120  for lexical values \cite{OkuiSuzuki2010}, but we have not done
  1126  for lexical values \cite{OkuiSuzuki2010} of a given regular
  1121  this. What we are going to show below is that for a given @{text r}
  1127  expression and given string, but we have not done this. It is not
  1122  and @{text s}, the ordering has a unique minimal element on the set
  1128  essential for our results. What we are going to show below is that
  1123  @{term "LV r s"} , which is the POSIX value we defined in the
  1129  for a given @{text r} and @{text s}, the ordering has a unique
  1124  previous section.
  1130  minimal element on the set @{term "LV r s"}, which is the POSIX value
       
  1131  we defined in the previous section.
  1125 
  1132 
  1126 
  1133 
  1127  Lemma 1
  1134  Lemma 1
  1128 
  1135 
  1129  @{thm [mode=IfThen] PosOrd_shorterE[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
  1136  @{thm [mode=IfThen] PosOrd_shorterE[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
  1143 
  1150 
  1144   \begin{theorem}
  1151   \begin{theorem}
  1145   @{thm [mode=IfThen] Posix_PosOrd[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
  1152   @{thm [mode=IfThen] Posix_PosOrd[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
  1146   \end{theorem}
  1153   \end{theorem}
  1147 
  1154 
  1148   \begin{proof}
  1155   \begin{proof} By induction on our POSIX rules. It is clear that
  1149   By induction on our POSIX rules. The two base cases are straightforward: for example 
  1156   @{text "v\<^sub>1"} and @{text "v\<^sub>2"} have the same underlying
  1150   for @{term "v\<^sub>1 = Void"}, we have that @{term "v\<^sub>2 \<in> LV ONE []"} must also 
  1157   string.
  1151   be of the form \mbox{@{term "v\<^sub>2 = Void"}}. Therfore we have @{term "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}.
  1158 
  1152   The inductive cases are as follows:
  1159   The two base cases are straightforward: for example for @{term
  1153 
  1160   "v\<^sub>1 = Void"}, we have that @{term "v\<^sub>2 \<in> LV ONE
  1154   \begin{itemize}
  1161   []"} must also be of the form \mbox{@{term "v\<^sub>2 =
  1155   \item[$\bullet$] Case @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2) \<rightarrow> (Left w\<^sub>1)"}:
  1162   Void"}}. Therefore we have @{term "v\<^sub>1 :\<sqsubseteq>val
  1156   In this case @{term "v\<^sub>1 = Left w\<^sub>1"} and the value @{term "v\<^sub>2"} is either 
  1163   v\<^sub>2"}.  The inductive cases are as follows:
  1157   of the form @{term "Left w\<^sub>2"} or @{term "Right w\<^sub>2"}. In the latter case we 
  1164 
  1158   can immediately conclude with @{term "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"} since a @{text Left}-value 
  1165   \begin{itemize} \item[$\bullet$] Case @{term "s \<in> (ALT r\<^sub>1
  1159   with the same underlying string @{text s} is always smaller or equal than a @{text Right}-value.
  1166   r\<^sub>2) \<rightarrow> (Left w\<^sub>1)"}: In this case @{term
  1160   In the former case we have @{term "w\<^sub>2 \<in> LV r\<^sub>1 s"} and can use the induction
  1167   "v\<^sub>1 = Left w\<^sub>1"} and the value @{term "v\<^sub>2"} is
  1161   hypothesis to infer @{term "w\<^sub>1 :\<sqsubseteq>val w\<^sub>2"}. Because @{term "w\<^sub>1"}
  1168   either of the form @{term "Left w\<^sub>2"} or @{term "Right
  1162   and @{term "w\<^sub>2"} have the same underlying string @{text s}, we can conclude with 
  1169   w\<^sub>2"}. In the latter case we can immediately conclude with
  1163   @{term "Left w\<^sub>1 :\<sqsubseteq>val Left w\<^sub>2"}. 
  1170   @{term "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"} since a @{text
  1164 
  1171   Left}-value with the same underlying string @{text s} is always
  1165   \item[$\bullet$] Case @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2) \<rightarrow> (Right w\<^sub>1)"}:
  1172   smaller or equal than a @{text Right}-value.  In the former case we
  1166   Similarly for the case where
  1173   have @{term "w\<^sub>2 \<in> LV r\<^sub>1 s"} and can use the
  1167   @{term "v\<^sub>1 = Right w\<^sub>1"}.
  1174   induction hypothesis to infer @{term "w\<^sub>1 :\<sqsubseteq>val
  1168 
  1175   w\<^sub>2"}. Because @{term "w\<^sub>1"} and @{term "w\<^sub>2"}
  1169   \item[$\bullet$]
  1176   have the same underlying string @{text s}, we can conclude with
       
  1177   @{term "Left w\<^sub>1 :\<sqsubseteq>val Left w\<^sub>2"}.\smallskip
       
  1178 
       
  1179   \item[$\bullet$] Case @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2)
       
  1180   \<rightarrow> (Right w\<^sub>1)"}: This case similar as the previous
       
  1181   case, except that we know that @{term "s \<notin> L
       
  1182   r\<^sub>1"}. This is needed when @{term "v\<^sub>2 = Left
       
  1183   w\<^sub>2"}: since \mbox{@{term "flat v\<^sub>2 = flat w\<^sub>2"}
       
  1184   @{text "= s"}} and @{term "\<Turnstile> w\<^sub>2 : r\<^sub>1"}, we
       
  1185   can derive a contradiction using Prop.~\ref{inhabs}. So also in this
       
  1186   case \mbox{@{term "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}}.\smallskip
       
  1187 
       
  1188   \item[$\bullet$]  Case @{term "(s\<^sub>1 @ s\<^sub>2) \<in> (SEQ r\<^sub>1 r\<^sub>2)
       
  1189   \<rightarrow> (Seq w\<^sub>1 w\<^sub>2)"}: Assume @{term "v\<^sub>2 =
       
  1190   Seq (u\<^sub>1) (u\<^sub>2)"} with @{term "\<Turnstile> u\<^sub>1 : r\<^sub>1"}
       
  1191   and \mbox{@{term "\<Turnstile> u\<^sub>2 : r\<^sub>2"}}. We have
       
  1192   
  1170   \end{itemize}
  1193   \end{itemize}
  1171   \end{proof}
  1194   \end{proof}
  1172 
  1195 
  1173   Given a lexical value @{text "v\<^sub>1"}, say, in @{term "LV r s"} for which there does 
  1196   Given a lexical value @{text "v\<^sub>1"}, say, in @{term "LV r s"} for which there does 
  1174   not exists any smaller value in @{term "LV r s"}, then this value is our POSIX value:
  1197   not exists any smaller value in @{term "LV r s"}, then this value is our POSIX value: