ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 606 99b530103464
parent 605 ed53ce26ecb6
child 607 e6fc9b72c0e3
equal deleted inserted replaced
605:ed53ce26ecb6 606:99b530103464
   582 more than half of the 
   582 more than half of the 
   583 XSDs they found have bounded regular expressions in them.
   583 XSDs they found have bounded regular expressions in them.
   584 Often the counters are quite large, the largest up to ten million. 
   584 Often the counters are quite large, the largest up to ten million. 
   585 An example XSD they gave
   585 An example XSD they gave
   586 was:
   586 was:
   587 \begin{verbatim}
   587 %\begin{verbatim}
   588 <sequence minOccurs="0" maxOccurs="65535">
   588 %<sequence minOccurs="0" maxOccurs="65535">
   589     <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
   589 %  <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
   590     <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
   590 %  <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
   591 </sequence>
   591 %</sequence>
   592 \end{verbatim}
   592 %\end{verbatim}
   593 This can be seen as the expression 
   593 This can be seen as the expression 
   594 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
   594 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
   595 regular expressions 
   595 regular expressions 
   596 satisfying certain constraints (such as 
   596 satisfying certain constraints (such as 
   597 satisfying the floating point number format).
   597 satisfying the floating point number format).
   602 for bounded regular expressions:
   602 for bounded regular expressions:
   603 For example, in the regular expression matching library in the Go
   603 For example, in the regular expression matching library in the Go
   604 language the regular expression $a^{1001}$ is not permitted, because no counter
   604 language the regular expression $a^{1001}$ is not permitted, because no counter
   605 can be above 1000, and in the built-in Rust regular expression library
   605 can be above 1000, and in the built-in Rust regular expression library
   606 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
   606 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
   607 for being too big. These problems can of course be solved in matching algorithms where 
   607 for being too big. 
       
   608 As Becchi and Crawley\cite{Becchi08}  have pointed out,
       
   609 the reason for these restrictions
       
   610 are that they simulate a non-deterministic finite
       
   611 automata (NFA) with a breadth-first search.
       
   612 This way the number of active states could
       
   613 be equal to the counter number.
       
   614 When the counters are large, 
       
   615 the memory requirement could become
       
   616 infeasible, and the pattern needs to be rejected straight away.
       
   617 \begin{figure}[H]
       
   618 \begin{center}
       
   619 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
       
   620  
       
   621     	\node (q0) [state, initial] {$0$};
       
   622 	\node (q1) [state, right = of q0] {$1$};
       
   623 	\node (q2) [state, right = of q1] {$2$};
       
   624 	\node (qdots) [right = of q2] {$\ldots$};
       
   625 	\node (qn) [state, right = of qdots] {$n$};
       
   626 	\node (qn1) [state, right = of qn] {$n+1$};
       
   627 	\node (qn2) [state, right = of qn1] {$n+2$};
       
   628 	\node (qn3) [state, accepting, right = of qn2] {$n+3$}; 
       
   629  
       
   630 \path [-stealth, thick]
       
   631 	(q0) edge [loop above] node {a} ()
       
   632     (q0) edge node {a}   (q1) 
       
   633     (q1) edge node {.}   (q2)
       
   634     (q2) edge node {.}   (qdots)
       
   635     (qdots) edge node {.} (qn)
       
   636     (qn) edge node {.} (qn1)
       
   637     (qn1) edge node {b} (qn2)
       
   638     (qn2) edge node {$c$} (qn3);
       
   639 \end{tikzpicture}
       
   640 %\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
       
   641 %   \node[state,initial] (q_0)   {$0$}; 
       
   642 %   \node[state, ] (q_1) [right=of q_0] {$1$}; 
       
   643 %   \node[state, ] (q_2) [right=of q_1] {$2$}; 
       
   644 %   \node[state,
       
   645 %   \node[state, accepting, ](q_3) [right=of q_2] {$3$};
       
   646 %    \path[->] 
       
   647 %    (q_0) edge  node {a} (q_1)
       
   648 %    	  edge [loop below] node {a,b} ()
       
   649 %    (q_1) edge  node  {a,b} (q_2)
       
   650 %    (q_2) edge  node  {a,b} (q_3);
       
   651 %\end{tikzpicture}
       
   652 \end{center}
       
   653 \caption{The example given by Becchi and Crawley
       
   654 	that NFA simulation can consume large
       
   655 	amounts of memory: $.^*a.^{\{n\}}bc$ matching
       
   656 	strings of the form $aaa\ldots aaaabc$.
       
   657 	When traversing in a breadth-first manner,
       
   658 all states from 0 till $n+1$ will become active.}
       
   659 \end{figure}
       
   660 %Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
       
   661 %type of $\mathit{NFA}$ simulation and guarantees a linear runtime
       
   662 %in terms of input string length.
       
   663 %TODO:try out these lexers
       
   664 These problems can of course be solved in matching algorithms where 
   608 automata go beyond the classic notion and for instance include explicit
   665 automata go beyond the classic notion and for instance include explicit
   609 counters \cite{Turo_ov__2020}.
   666 counters \cite{Turo_ov__2020}.
       
   667 These solutions can be quite effective,
       
   668 with the ability to process
       
   669 gigabytes of string input per second
       
   670 even with large counters \cite{Becchi08}.
       
   671 But formally reasoning about these automata can be challenging
       
   672 and un-intuitive.
       
   673 Therefore, correctness and runtime claims made about these solutions need to be
       
   674 taken with a grain of salt.
   610 
   675 
   611 In the work reported in \cite{CSL2022} and here, 
   676 In the work reported in \cite{CSL2022} and here, 
   612 we add better support using derivatives
   677 we add better support using derivatives
   613 for bounded regular expressions $r^{\{n\}}$.
   678 for bounded regular expressions $r^{\{n\}}$.
   614 The results
   679 The results
   627 %recursive functions and inductive datatypes.
   692 %recursive functions and inductive datatypes.
   628 Finally, bounded regular expressions do not destroy our finite
   693 Finally, bounded regular expressions do not destroy our finite
   629 boundedness property, which we shall prove later on.
   694 boundedness property, which we shall prove later on.
   630 
   695 
   631 
   696 
       
   697 
       
   698 
       
   699 
   632 \subsection{Back-References}
   700 \subsection{Back-References}
   633 %Some constructs can even raise the expressive
   701 The other way to simulate an $\mathit{NFA}$ for matching is choosing  
   634 %power to the non-regular realm, for example
   702 a single transition each time, keeping all the other options in 
   635 %the back-references.
   703 a queue or stack, and backtracking if that choice eventually 
   636 %bounded repetitions, as we have discussed in the 
   704 fails. This method, often called a  "depth-first-search", 
   637 %previous section.
   705 is efficient in a lot of cases, but could end up
   638 %This super-linear behaviour of the 
   706 with exponential run time.
   639 %regex matching engines we have?
   707 The backtracking method is employed in regex libraries
       
   708 that support \emph{back-references}, for example
       
   709 in Java and Python.
   640 The syntax and expressive power of regexes 
   710 The syntax and expressive power of regexes 
   641 can go quickly beyond the regular language hierarchy
   711 go beyond the regular language hierarchy
   642 with back-references.
   712 with back-references.
   643 %\section{Back-references and The Terminology Regex}
   713 %\section{Back-references and The Terminology Regex}
   644 
   714 
   645 %When one constructs an $\NFA$ out of a regular expression
   715 %When one constructs an $\NFA$ out of a regular expression
   646 %there is often very little to be done in the first phase, one simply 
   716 %there is often very little to be done in the first phase, one simply 
   650 %one by keeping track of all active states after consuming 
   720 %one by keeping track of all active states after consuming 
   651 %a character, and update that set of states iteratively.
   721 %a character, and update that set of states iteratively.
   652 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$
   722 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$
   653 %for a path terminating
   723 %for a path terminating
   654 %at an accepting state.
   724 %at an accepting state.
   655 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
   725 
   656 type of $\mathit{NFA}$ simulation and guarantees a linear runtime
   726 
   657 in terms of input string length.
   727 
   658 %TODO:try out these lexers
   728 Given a regular expression like this (the sequence
   659 The other way to use $\mathit{NFA}$ for matching is choosing  
       
   660 a single transition each time, keeping all the other options in 
       
   661 a queue or stack, and backtracking if that choice eventually 
       
   662 fails. This method, often called a  "depth-first-search", 
       
   663 is efficient in a lot of cases, but could end up
       
   664 with exponential run time.\\
       
   665 %TODO:COMPARE java python lexer speed with Rust and Go
       
   666 The reason behind backtracking algorithms in languages like
       
   667 Java and Python is that they support back-references.
       
   668 \subsubsection{Back References}
       
   669 If we have a regular expression like this (the sequence
       
   670 operator is omitted for brevity):
   729 operator is omitted for brevity):
   671 \begin{center}
   730 \begin{center}
   672 	$r_1(r_2(r_3r_4))$
   731 	$r_1r_2r_3r_4$
   673 \end{center}
   732 \end{center}
   674 We could label sub-expressions of interest 
   733 one could label sub-expressions of interest 
   675 by parenthesizing them and giving 
   734 by parenthesizing them and giving 
   676 them a number by the order in which their opening parentheses appear.
   735 them a number by the order in which their opening parentheses appear.
   677 One possible way of parenthesizing and labelling is given below:
   736 One possible way of parenthesizing and labelling is given below:
   678 \begin{center}
   737 \begin{center}
   679 	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
   738 	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
   680 \end{center}
   739 \end{center}
   681 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled
   740 The sub-expressions
   682 by 1 to 4. $1$ would refer to the entire expression 
   741 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
   683 $(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc.
   742 by 1 to 4, and can be ``referred back'' by their respective numbers. 
   684 These sub-expressions are called "capturing groups".
   743 %These sub-expressions are called "capturing groups".
   685 We can use the following syntax to denote that we want a string just matched by a 
   744 To do so, we use the syntax $\backslash i$ 
   686 sub-expression (capturing group) to appear at a certain location again, 
   745 to denote that we want the sub-string 
   687 exactly as it was:
   746 of the input just matched by the i-th
       
   747 sub-expression to appear again, 
       
   748 exactly the same as it first appeared: 
   688 \begin{center}
   749 \begin{center}
   689 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
   750 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
   690 \underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
   751 \underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
   691 \end{center}
   752 \end{center}
   692 The backslash and number $i$ are used to denote such 
   753 %The backslash and number $i$ are the
   693 so-called "back-references".
   754 %so-called "back-references".
   694 Let $e$ be an expression made of regular expressions 
   755 %Let $e$ be an expression made of regular expressions 
   695 and back-references. $e$ contains the expression $e_i$
   756 %and back-references. $e$ contains the expression $e_i$
   696 as its $i$-th capturing group.
   757 %as its $i$-th capturing group.
   697 The semantics of back-reference can be recursively
   758 %The semantics of back-reference can be recursively
   698 written as:
   759 %written as:
       
   760 %\begin{center}
       
   761 %	\begin{tabular}{c}
       
   762 %		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
       
   763 %		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
       
   764 %	\end{tabular}
       
   765 %\end{center}
       
   766 A concrete example
       
   767 for back-references would be
   699 \begin{center}
   768 \begin{center}
   700 	\begin{tabular}{c}
   769 $((a|b|c|\ldots|z)^*)\backslash 1$,
   701 		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
       
   702 		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
       
   703 	\end{tabular}
       
   704 \end{center}
   770 \end{center}
   705 The concrete example
   771 which would match 
   706 $((a|b|c|\ldots|z)^*)\backslash 1$
   772 strings that can be split into two identical halves,
   707 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
   773 for example $\mathit{bobo}$, $\mathit{weewee}$ and etc.
   708 Back-reference is a construct in the "regex" standard
   774 Back-reference is a regex construct 
   709 that programmers found useful, but not exactly 
   775 that programmers found useful, but not exactly 
   710 regular any more.
   776 regular any more.
   711 In fact, that allows the regex construct to express 
   777 In fact, that allows the regex construct to express 
   712 languages that cannot be contained in context-free
   778 languages that cannot be contained in context-free
   713 languages either.
   779 languages either.