ChengsongTanPhdThesis/Chapters/Chapter2.tex
changeset 531 c334f0b3ef52
parent 530 823d9b19d21c
equal deleted inserted replaced
530:823d9b19d21c 531:c334f0b3ef52
   351 consequently the bigger the size of the derivative the slower the
   351 consequently the bigger the size of the derivative the slower the
   352 algorithm. 
   352 algorithm. 
   353 
   353 
   354 Brzozowski was quick in finding that during this process a lot useless
   354 Brzozowski was quick in finding that during this process a lot useless
   355 $\ONE$s and $\ZERO$s are generated and therefore not optimal.
   355 $\ONE$s and $\ZERO$s are generated and therefore not optimal.
   356 He also introduced some "similarity rules" such
   356 He also introduced some "similarity rules", such
   357 as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
   357 as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
   358 different but language-equivalent sub-regexes to further decrease the size
   358 different but language-equivalent sub-regexes to further decrease the size
   359 of the intermediate regexes. 
   359 of the intermediate regexes. 
   360 
   360 
   361 More simplifications are possible, such as deleting duplicates
   361 More simplifications are possible, such as deleting duplicates
   366 a matcher with simpler regular expressions.
   366 a matcher with simpler regular expressions.
   367 
   367 
   368 If we want the size of derivatives in the algorithm to
   368 If we want the size of derivatives in the algorithm to
   369 stay even lower, we would need more aggressive simplifications.
   369 stay even lower, we would need more aggressive simplifications.
   370 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
   370 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
   371 deleting duplicates whenever possible. For example, the parentheses in
   371 delete duplicates whenever possible. For example, the parentheses in
   372 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
   372 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
   373 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
   373 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
   374 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
   374 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
   375 $a^*+a+\ONE$.  These more aggressive simplification rules are for
   375 $a^*+a+\ONE$.  These more aggressive simplification rules are for
   376  a very tight size bound, possibly as low
   376  a very tight size bound, possibly as low
   377   as that of the \emph{partial derivatives}\parencite{Antimirov1995}. 
   377   as that of the \emph{partial derivatives}\parencite{Antimirov1995}. 
   378 
   378 
   379 Building derivatives and then simplify them.
   379 Building derivatives and then simplifying them.
   380 So far so good. But what if we want to 
   380 So far, so good. But what if we want to 
   381 do lexing instead of just  getting a YES/NO answer?
   381 do lexing instead of just getting a YES/NO answer?
   382 This requires us to go back again to the world 
   382 This requires us to go back again to the world 
   383 without simplification first for a moment.
   383 without simplification first for a moment.
   384 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
   384 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
   385 elegant(arguably as beautiful as the original
   385 elegant(arguably as beautiful as the original
   386 derivatives definition) solution for this.
   386 derivatives definition) solution for this.
   420 	\end{tabular}
   420 	\end{tabular}
   421 \end{center}
   421 \end{center}
   422 
   422 
   423 \noindent
   423 \noindent
   424 
   424 
   425 Building on top of Sulzmann and Lu's attempt to formalize the 
   425 Building on top of Sulzmann and Lu's attempt to formalise the 
   426 notion of POSIX lexing rules \parencite{Sulzmann2014}, 
   426 notion of POSIX lexing rules \parencite{Sulzmann2014}, 
   427 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
   427 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
   428 POSIX matching as a ternary relation recursively defined in a
   428 POSIX matching as a ternary relation recursively defined in a
   429 natural deduction style.
   429 natural deduction style.
   430 With the formally-specified rules for what a POSIX matching is,
   430 With the formally-specified rules for what a POSIX matching is,
   438 One regular expression can have multiple lexical values. For example
   438 One regular expression can have multiple lexical values. For example
   439 for the regular expression $(a+b)^*$, it has a infinite list of
   439 for the regular expression $(a+b)^*$, it has a infinite list of
   440 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
   440 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
   441 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
   441 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
   442 $\ldots$, and vice versa.
   442 $\ldots$, and vice versa.
   443 Even for the regular expression matching a certain string, there could 
   443 Even for the regular expression matching a particular string, there could 
   444 still be more than one value corresponding to it.
   444 be more than one value corresponding to it.
   445 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
   445 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
   446 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
   446 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
   447 If we do not allow any empty iterations in its lexical values,
   447 If we do not allow any empty iterations in its lexical values,
   448 there will be $n - 1$ "splitting points" on $s$ we can choose to 
   448 there will be $n - 1$ "splitting points" on $s$ we can choose to 
   449 split or not so that each sub-string
   449 split or not so that each sub-string
   461 between the two $a^*$s.
   461 between the two $a^*$s.
   462 It is not surprising there are exponentially many lexical values
   462 It is not surprising there are exponentially many lexical values
   463 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
   463 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
   464 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
   464 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
   465 
   465 
   466 A lexer aimed at keeping all the possible values will naturally 
   466 A lexer to keep all the possible values will naturally 
   467 have an exponential runtime on ambiguous regular expressions.
   467 have an exponential runtime on ambiguous regular expressions.
   468 Somehow one has to decide which lexical value to keep and
   468 Somehow one has to decide which lexical value to keep and
   469 output in a lexing algorithm.
   469 output in a lexing algorithm.
   470 In practice, we are usually 
   470 In practice, we are usually 
   471 interested about POSIX values, which by intuition always
   471 interested in POSIX values, which by intuition always
   472 \begin{itemize}
   472 \begin{itemize}
   473 \item
   473 \item
   474 match the leftmost regular expression when multiple options of matching
   474 match the leftmost regular expression when multiple options of matching
   475 are available  
   475 are available  
   476 \item 
   476 \item 
   511 
   511 
   512 
   512 
   513 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   513 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   514 algorithm by a second phase (the first phase being building successive
   514 algorithm by a second phase (the first phase being building successive
   515 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
   515 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
   516 is generated in case the regular expression matches  the string.
   516 is generated if the regular expression matches the string.
   517 How can we construct a value out of regular expressions and character
   517 How can we construct a value out of regular expressions and character
   518 sequences only?
   518 sequences only?
   519 Two functions are involved: $\inj$ and $\mkeps$.
   519 Two functions are involved: $\inj$ and $\mkeps$.
   520 The function $\mkeps$ constructs a value from the last
   520 The function $\mkeps$ constructs a value from the last
   521 one of all the successive derivatives:
   521 one of all the successive derivatives:
   543 		\end{tabular}
   543 		\end{tabular}
   544 	\end{center}
   544 	\end{center}
   545 
   545 
   546 
   546 
   547 \noindent 
   547 \noindent 
   548 We favour the left to match an empty string in case there is a choice.
   548 We favour the left to match an empty string if there is a choice.
   549 When there is a star for us to match the empty string,
   549 When there is a star for us to match the empty string,
   550 we simply give the $\Stars$ constructor an empty list, meaning
   550 we give the $\Stars$ constructor an empty list, meaning
   551 no iterations are taken.
   551 no iterations are taken.
   552 
   552 
   553 
   553 
   554 After the $\mkeps$-call, we inject back the characters one by one in order to build
   554 After the $\mkeps$-call, we inject back the characters one by one in order to build
   555 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
   555 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
   556 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
   556 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
   557 After injecting back $n$ characters, we get the lexical value for how $r_0$
   557 After injecting back $n$ characters, we get the lexical value for how $r_0$
   558 matches $s$. The POSIX value is maintained throught out the process.
   558 matches $s$. The POSIX value is maintained throughout the process.
   559 For this Sulzmann and Lu defined a function that reverses
   559 For this Sulzmann and Lu defined a function that reverses
   560 the ``chopping off'' of characters during the derivative phase. The
   560 the ``chopping off'' of characters during the derivative phase. The
   561 corresponding function is called \emph{injection}, written
   561 corresponding function is called \emph{injection}, written
   562 $\textit{inj}$; it takes three arguments: the first one is a regular
   562 $\textit{inj}$; it takes three arguments: the first one is a regular
   563 expression ${r_{i-1}}$, before the character is chopped off, the second
   563 expression ${r_{i-1}}$, before the character is chopped off, the second
   591 \end{tabular}
   591 \end{tabular}
   592 \end{center}
   592 \end{center}
   593 
   593 
   594 \noindent This definition is by recursion on the ``shape'' of regular
   594 \noindent This definition is by recursion on the ``shape'' of regular
   595 expressions and values. 
   595 expressions and values. 
   596 The clauses basically do one thing--identifying the ``holes'' on 
   596 The clauses do one thing--identifying the ``hole'' on a
   597 value to inject the character back into.
   597 value to inject the character back into.
   598 For instance, in the last clause for injecting back to a value
   598 For instance, in the last clause for injecting back to a value
   599 that would turn into a new star value that corresponds to a star,
   599 that would turn into a new star value that corresponds to a star,
   600 we know it must be a sequence value. And we know that the first 
   600 we know it must be a sequence value. And we know that the first 
   601 value of that sequence corresponds to the child regex of the star
   601 value of that sequence corresponds to the child regex of the star
   602 with the first character being chopped off--an iteration of the star
   602 with the first character being chopped off--an iteration of the star
   603 that had just been unfolded. This value is followed by the already
   603 that had just been unfolded. This value is followed by the already
   604 matched star iterations we collected before. So we inject the character 
   604 matched star iterations we collected before. So we inject the character 
   605 back to the first value and form a new value with this new iteration
   605 back to the first value and form a new value with this latest iteration
   606 being added to the previous list of iterations, all under the $\Stars$
   606 being added to the previous list of iterations, all under the $\Stars$
   607 top level.
   607 top level.
   608 
   608 
   609 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
   609 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
   610 we have a lexer with the following recursive definition:
   610 we have a lexer with the following recursive definition:
   633 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
   633 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
   634 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
   634 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
   635 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
   635 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
   636 the derivative $r_n$. We test whether this derivative is
   636 the derivative $r_n$. We test whether this derivative is
   637 $\textit{nullable}$ or not. If not, we know the string does not match
   637 $\textit{nullable}$ or not. If not, we know the string does not match
   638 $r$ and no value needs to be generated. If yes, we start building the
   638 $r$, and no value needs to be generated. If yes, we start building the
   639 values incrementally by \emph{injecting} back the characters into the
   639 values incrementally by \emph{injecting} back the characters into the
   640 earlier values $v_n, \ldots, v_0$. This is the second phase of the
   640 earlier values $v_n, \ldots, v_0$. This is the second phase of the
   641 algorithm from the right to left. For the first value $v_n$, we call the
   641 algorithm from right to left. For the first value $v_n$, we call the
   642 function $\textit{mkeps}$, which builds a POSIX lexical value
   642 function $\textit{mkeps}$, which builds a POSIX lexical value
   643 for how the empty string has been matched by the (nullable) regular
   643 for how the empty string has been matched by the (nullable) regular
   644 expression $r_n$. This function is defined as
   644 expression $r_n$. This function is defined as
   645 
   645 
   646 
   646 
   647 
   647 
   648 We have mentioned before that derivatives without simplification 
   648 We have mentioned before that derivatives without simplification 
   649 can get clumsy, and this is true for values as well--they reflect
   649 can get clumsy, and this is true for values as well--they reflect
   650 the regular expressions size by definition.
   650 the size of the regular expression by definition.
   651 
   651 
   652 One can introduce simplification on the regex and values, but have to
   652 One can introduce simplification on the regex and values but have to
   653 be careful in not breaking the correctness as the injection 
   653 be careful not to break the correctness, as the injection 
   654 function heavily relies on the structure of the regexes and values
   654 function heavily relies on the structure of the regexes and values
   655 being correct and match each other.
   655 being correct and matching each other.
   656 It can be achieved by recording some extra rectification functions
   656 It can be achieved by recording some extra rectification functions
   657 during the derivatives step, and applying these rectifications in 
   657 during the derivatives step, and applying these rectifications in 
   658 each run during the injection phase.
   658 each run during the injection phase.
   659 And we can prove that the POSIX value of how
   659 And we can prove that the POSIX value of how
   660 regular expressions match strings will not be affected---although is much harder
   660 regular expressions match strings will not be affected---although it is much harder
   661 to establish. 
   661 to establish. 
   662 Some initial results in this regard have been
   662 Some initial results in this regard have been
   663 obtained in \cite{AusafDyckhoffUrban2016}. 
   663 obtained in \cite{AusafDyckhoffUrban2016}. 
   664 
   664 
   665 
   665 
   666 
   666 
   667 %Brzozowski, after giving the derivatives and simplification,
   667 %Brzozowski, after giving the derivatives and simplification,
   668 %did not explore lexing with simplification or he may well be 
   668 %did not explore lexing with simplification, or he may well be 
   669 %stuck on an efficient simplificaiton with a proof.
   669 %stuck on an efficient simplification with proof.
   670 %He went on to explore the use of derivatives together with 
   670 %He went on to examine the use of derivatives together with 
   671 %automaton, and did not try lexing using derivatives.
   671 %automaton, and did not try lexing using products.
   672 
   672 
   673 We want to get rid of complex and fragile rectification of values.
   673 We want to get rid of the complex and fragile rectification of values.
   674 Can we not create those intermediate values $v_1,\ldots v_n$,
   674 Can we not create those intermediate values $v_1,\ldots v_n$,
   675 and get the lexing information that should be already there while
   675 and get the lexing information that should be already there while
   676 doing derivatives in one pass, without a second phase of injection?
   676 doing derivatives in one pass, without a second injection phase?
   677 In the meantime, can we make sure that simplifications
   677 In the meantime, can we make sure that simplifications
   678 are easily handled without breaking the correctness of the algorithm?
   678 are easily handled without breaking the correctness of the algorithm?
   679 
   679 
   680 Sulzmann and Lu solved this problem by
   680 Sulzmann and Lu solved this problem by
   681 introducing additional informtaion to the 
   681 introducing additional information to the 
   682 regular expressions called \emph{bitcodes}.
   682 regular expressions called \emph{bitcodes}.
   683 
   683 
   684 
   684 
   685 
   685 
   686 
   686