thys3/Paper.thy
changeset 644 9f984ff20020
parent 643 9580bae0500d
equal deleted inserted replaced
643:9580bae0500d 644:9f984ff20020
   258   
   258   
   259   In our work here we also add to the usual ``basic'' regular
   259   In our work here we also add to the usual ``basic'' regular
   260   expressions the \emph{bounded} regular expression @{term "NTIMES r
   260   expressions the \emph{bounded} regular expression @{term "NTIMES r
   261   n"} where the @{term n} specifies that @{term r} should match
   261   n"} where the @{term n} specifies that @{term r} should match
   262   exactly @{term n}-times (it is not included in Sulzmann and Lu's original work). For brevity we omit the other bounded
   262   exactly @{term n}-times (it is not included in Sulzmann and Lu's original work). For brevity we omit the other bounded
   263   regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$
   263   regular expressions @{text "r"}$^{\{..\textit{n}\}}$,
   264   and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many
   264   @{text "r"}$^{\{\textit{n}..\}}$
       
   265   and @{text "r"}$^{\{\textit{n}..\textit{m}\}}$ which specify intervals for how many
   265   times @{text r} should match. The results presented in this paper
   266   times @{text r} should match. The results presented in this paper
   266   extend straightforwardly to them too. The importance of the bounded
   267   extend straightforwardly to them too. The importance of the bounded
   267   regular expressions is that they are often used in practical
   268   regular expressions is that they are often used in practical
   268   applications, such as Snort (a system for detecting network
   269   applications, such as Snort (a system for detecting network
   269   intrusions) and also in XML Schema definitions. According to Bj\"{o}rklund et
   270   intrusions) and also in XML Schema definitions. According to Bj\"{o}rklund et
   270   al~\cite{BjorklundMartensTimm2015}, bounded regular expressions 
   271   al~\cite{BjorklundMartensTimm2015}, bounded regular expressions 
   271   occur frequently in the latter and can have counters of up to
   272   occur frequently in the latter and can have counters of up to
   272   ten million.  The problem is that tools based on the classic notion
   273   ten million.  The problem is that tools based on the classic notion
   273   of automata need to expand @{text "r"}$^{\{n\}}$ into @{text n}
   274   of automata need to expand @{text "r"}$^{\{\textit{n}\}}$ into @{text n}
   274   connected copies of the automaton for @{text r}. This leads to very
   275   connected copies of the automaton for @{text r}. This leads to very
   275   inefficient matching algorithms or algorithms that consume large
   276   inefficient matching algorithms or algorithms that consume large
   276   amounts of memory.  A classic example is the regular expression
   277   amounts of memory.  A classic example is the regular expression
   277   \mbox{@{term "SEQ (SEQ (STAR (ALT a b)) a) (NTIMES (ALT a b) n)"}}
   278   \mbox{@{term "SEQ (SEQ (STAR (ALT a b)) a) (NTIMES (ALT a b) n)"}}
   278   where the minimal DFA requires at least $2^{n + 1}$ states (see
   279   where the minimal DFA requires at least $2^{n + 1}$ states (see
   395   inhabitation relation that associates values to regular expressions. Our
   396   inhabitation relation that associates values to regular expressions. Our
   396   version of this relation is defined by the following six rules:
   397   version of this relation is defined by the following six rules:
   397   %
   398   %
   398   \begin{center}
   399   \begin{center}
   399   \begin{tabular}{@ {}l@ {}}
   400   \begin{tabular}{@ {}l@ {}}
   400   @{thm[mode=Axiom] Prf.intros(4)}\quad
   401   @{thm[mode=Axiom] Prf.intros(4)}\qquad
   401   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad
   402   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\qquad
   402   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\quad
   403   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\qquad
   403   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
   404   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
   404   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad
   405   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad
   405   @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad
   406   @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad
   406   $\mprset{flushleft}\inferrule{
   407   $\mprset{flushleft}\inferrule{
   407   @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
   408   @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
   540   a value for how the last derivative can match the empty string. In case
   541   a value for how the last derivative can match the empty string. In case
   541   of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate
   542   of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate
   542   a list of exactly @{term n} copies, which is the length of the list we expect in this
   543   a list of exactly @{term n} copies, which is the length of the list we expect in this
   543   case.  The injection function\footnote{While the character argument @{text c} is not
   544   case.  The injection function\footnote{While the character argument @{text c} is not
   544   strictly necessary in the @{text inj}-function for the fragment of regular expressions we
   545   strictly necessary in the @{text inj}-function for the fragment of regular expressions we
   545   use in this paper, it is necessary for extended regular expressions, for example for the range regular expression of the form @{text "[a-z]"}.
   546   use in this paper, it is necessary for extended regular expressions. For example for the range regular expression of the form @{text "[a-z]"}.
   546   We therefore keep this argument from the original formulation of @{text inj} by Sulzmann and Lu.}
   547   We therefore keep this argument from the original formulation of @{text inj} by Sulzmann and Lu.}
   547   then calculates the corresponding value for each intermediate derivative until
   548   then calculates the corresponding value for each intermediate derivative until
   548   a value for the original regular expression is generated.
   549   a value for the original regular expression is generated.
   549   Graphically the algorithm by
   550   Graphically the algorithm by
   550   Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz}
   551   Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz}
  1567 @{term "Seq (Left (Char a)) (Left (Char b))"}
  1568 @{term "Seq (Left (Char a)) (Left (Char b))"}
  1568 where the @{text "Left"}-alternatives get priority. However, this violates
  1569 where the @{text "Left"}-alternatives get priority. However, this violates
  1569 the POSIX rules and we have not been able to
  1570 the POSIX rules and we have not been able to
  1570 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm]
  1571 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm]
  1571 
  1572 
  1572 Note that while Antimirov was able to give a bound on the \emph{size}
  1573 Note also that while Antimirov was able to give a bound on the \emph{size}
  1573 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound
  1574 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound
  1574 on the \emph{number} of derivatives, provided they are quotient via
  1575 on the \emph{number} of derivatives, provided they are quotient via
  1575 ACI rules \cite{Brzozowski1964}. Brzozowski's result is crucial when one
  1576 ACI rules \cite{Brzozowski1964}. Brzozowski's result is crucial when one
  1576 uses his derivatives for obtaining a DFA (it essentially bounds
  1577 uses his derivatives for obtaining a DFA (it essentially bounds
  1577 the number of states). However, this result does \emph{not}
  1578 the number of states). However, this result does \emph{not}