etnms/etnms.tex
changeset 118 c7825cfacc76
parent 117 0acf6b58236e
child 119 cc12352272f2
equal deleted inserted replaced
117:0acf6b58236e 118:c7825cfacc76
   529   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   529   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   530   $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   530   $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   531   $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
   531   $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
   532         $\textit{if}\;c=d\; \;\textit{then}\;
   532         $\textit{if}\;c=d\; \;\textit{then}\;
   533          _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
   533          _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
   534   $(_{bs}\textit{ALTS}\;as)\,\backslash c$ & $\dn$ &
   534   $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
   535   $_{bs}\textit{ALTS}\;(as.map(\backslash c))$\\
   535   $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
   536   $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
   536   $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
   537      $\textit{if}\;\textit{bnullable}\,a_1$\\
   537      $\textit{if}\;\textit{bnullable}\,a_1$\\
   538 					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
   538 					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
   539 					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
   539 					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
   540   & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
   540   & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
   541   $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
   541   $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
   542       $_{bs}\textit{SEQ}\;(\textit{fuse}\, [\Z] (r\,\backslash c))\,
   542       $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
   543        (_{bs}\textit{STAR}\,[]\,r)$
   543        (_{bs}\textit{STAR}\,[]\,r)$
   544 \end{tabular}    
   544 \end{tabular}    
   545 \end{center}    
   545 \end{center}    
   546 %\end{definition}
   546 %\end{definition}
   547 \fi
   547 \fi
   551   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
   551   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
   552   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
   552   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
   553   $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
   553   $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
   554         $\textit{if}\;c=d\; \;\textit{then}\;
   554         $\textit{if}\;c=d\; \;\textit{then}\;
   555          _{bs}\ONE\;\textit{else}\;\ZERO$\\  
   555          _{bs}\ONE\;\textit{else}\;\ZERO$\\  
   556   $(_{bs}\oplus \;as)\,\backslash c$ & $\dn$ &
   556   $(_{bs}\oplus \;\textit{as})\,\backslash c$ & $\dn$ &
   557   $_{bs}\oplus\;(as.map(\backslash c))$\\
   557   $_{bs}\oplus\;(\textit{as.map}(\backslash c))$\\
   558   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
   558   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
   559      $\textit{if}\;\textit{bnullable}\,a_1$\\
   559      $\textit{if}\;\textit{bnullable}\,a_1$\\
   560 					       & &$\textit{then}\;_{bs}\oplus\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
   560 					       & &$\textit{then}\;_{bs}\oplus\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
   561 					       & &$\phantom{\textit{then},\;_{bs}\oplus\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
   561 					       & &$\phantom{\textit{then},\;_{bs}\oplus\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
   562   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
   562   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
   563   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
   563   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
   564       $_{bs}\;((\textit{fuse}\, [\Z] (r\,\backslash c))\cdot
   564       $_{bs}\;((\textit{fuse}\, [0] (r\,\backslash c))\cdot
   565        (_{[]}r^*))$
   565        (_{[]}r^*))$
   566 \end{tabular}    
   566 \end{tabular}    
   567 \end{center}    
   567 \end{center}    
   568 
   568 
   569 %\end{definition}
   569 %\end{definition}
   570 \noindent
   570 \noindent
   571 For instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence,
   571 For instance, when we unfold $_{bs}a^*$ into a sequence,
   572 we need to attach an additional bit $Z$ to the front of $r \backslash c$
   572 we need to attach an additional bit $0$ to the front of $r \backslash c$
   573 to indicate that there is one more star iteration. Also the $SEQ$ clause
   573 to indicate that there is one more star iteration. Also the sequence clause
   574 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
   574 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
   575 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
   575 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
   576 that it is for annotated regular expressions, therefore we omit the
   576 that it is for annotated regular expressions, therefore we omit the
   577 definition). Assume that $bmkeps$ correctly extracts the bitcode for how
   577 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
   578 $a_1$ matches the string prior to character $c$ (more on this later),
   578 $a_1$ matches the string prior to character $c$ (more on this later),
   579 then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
   579 then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
   580 \backslash c)$ will collapse the regular expression $a_1$(as it has
   580 \backslash c)$ will collapse the regular expression $a_1$(as it has
   581 already been fully matched) and store the parsing information at the
   581 already been fully matched) and store the parsing information at the
   582 head of the regular expression $a_2 \backslash c$ by fusing to it. The
   582 head of the regular expression $a_2 \backslash c$ by fusing to it. The
   583 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
   583 bitsequence $\textit{bs}$, which was initially attached to the
   584 now been elevated to the top-level of $ALTS$, as this information will be
   584 first element of the sequence $a_1 \cdot a_2$, has
   585 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
   585 now been elevated to the top-level of $\oplus$, as this information will be
       
   586 needed whichever way the sequence is matched---no matter whether $c$ belongs
   586 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
   587 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
   587 the lexing information, we complete the lexing by collecting the
   588 the lexing information, we complete the lexing by collecting the
   588 bitcodes using a generalised version of the $\textit{mkeps}$ function
   589 bitcodes using a generalised version of the $\textit{mkeps}$ function
   589 for annotated regular expressions, called $\textit{bmkeps}$:
   590 for annotated regular expressions, called $\textit{bmkeps}$:
   590 
   591 
  1723 
  1724 
  1724 
  1725 
  1725 
  1726 
  1726 \noindent\rule[0.5ex]{\linewidth}{1pt}
  1727 \noindent\rule[0.5ex]{\linewidth}{1pt}
  1727 
  1728 
  1728 This PhD-project is about regular expression matching and
  1729 
  1729 lexing. Given the maturity of this topic, the reader might wonder:
  1730 
  1730 Surely, regular expressions must have already been studied to death?
  1731 
  1731 What could possibly be \emph{not} known in this area? And surely all
  1732 
  1732 implemented algorithms for regular expression matching are blindingly
       
  1733 fast?
       
  1734 
       
  1735 Unfortunately these preconceptions are not supported by evidence: Take
       
  1736 for example the regular expression $(a^*)^*\,b$ and ask whether
       
  1737 strings of the form $aa..a$ match this regular
       
  1738 expression. Obviously this is not the case---the expected $b$ in the last
       
  1739 position is missing. One would expect that modern regular expression
       
  1740 matching engines can find this out very quickly. Alas, if one tries
       
  1741 this example in JavaScript, Python or Java 8 with strings like 28
       
  1742 $a$'s, one discovers that this decision takes around 30 seconds and
       
  1743 takes considerably longer when adding a few more $a$'s, as the graphs
       
  1744 below show:
       
  1745 
       
  1746 \begin{center}
       
  1747 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
  1748 \begin{tikzpicture}
       
  1749 \begin{axis}[
       
  1750     xlabel={$n$},
       
  1751     x label style={at={(1.05,-0.05)}},
       
  1752     ylabel={time in secs},
       
  1753     enlargelimits=false,
       
  1754     xtick={0,5,...,30},
       
  1755     xmax=33,
       
  1756     ymax=35,
       
  1757     ytick={0,5,...,30},
       
  1758     scaled ticks=false,
       
  1759     axis lines=left,
       
  1760     width=5cm,
       
  1761     height=4cm, 
       
  1762     legend entries={JavaScript},  
       
  1763     legend pos=north west,
       
  1764     legend cell align=left]
       
  1765 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
  1766 \end{axis}
       
  1767 \end{tikzpicture}
       
  1768   &
       
  1769 \begin{tikzpicture}
       
  1770 \begin{axis}[
       
  1771     xlabel={$n$},
       
  1772     x label style={at={(1.05,-0.05)}},
       
  1773     %ylabel={time in secs},
       
  1774     enlargelimits=false,
       
  1775     xtick={0,5,...,30},
       
  1776     xmax=33,
       
  1777     ymax=35,
       
  1778     ytick={0,5,...,30},
       
  1779     scaled ticks=false,
       
  1780     axis lines=left,
       
  1781     width=5cm,
       
  1782     height=4cm, 
       
  1783     legend entries={Python},  
       
  1784     legend pos=north west,
       
  1785     legend cell align=left]
       
  1786 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
  1787 \end{axis}
       
  1788 \end{tikzpicture}
       
  1789   &
       
  1790 \begin{tikzpicture}
       
  1791 \begin{axis}[
       
  1792     xlabel={$n$},
       
  1793     x label style={at={(1.05,-0.05)}},
       
  1794     %ylabel={time in secs},
       
  1795     enlargelimits=false,
       
  1796     xtick={0,5,...,30},
       
  1797     xmax=33,
       
  1798     ymax=35,
       
  1799     ytick={0,5,...,30},
       
  1800     scaled ticks=false,
       
  1801     axis lines=left,
       
  1802     width=5cm,
       
  1803     height=4cm, 
       
  1804     legend entries={Java 8},  
       
  1805     legend pos=north west,
       
  1806     legend cell align=left]
       
  1807 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
  1808 \end{axis}
       
  1809 \end{tikzpicture}\\
       
  1810 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
  1811            of the form $\underbrace{aa..a}_{n}$.}
       
  1812 \end{tabular}    
       
  1813 \end{center}  
       
  1814 
       
  1815 \noindent These are clearly abysmal and possibly surprising results. One
       
  1816 would expect these systems to do  much better than that---after all,
       
  1817 given a DFA and a string, deciding whether a string is matched by this
       
  1818 DFA should be linear in terms of the size of the regular expression and
       
  1819 the string?
       
  1820 
       
  1821 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
       
  1822 exhibit this super-linear behaviour.  But unfortunately, such regular
       
  1823 expressions are not just a few outliers. They are actually 
       
  1824 frequent enough to have a separate name created for
       
  1825 them---\emph{evil regular expressions}. In empiric work, Davis et al
       
  1826 report that they have found thousands of such evil regular expressions
       
  1827 in the JavaScript and Python ecosystems \cite{Davis18}. Static analysis
       
  1828 approach that is both sound and complete exists\cite{17Bir}, but the running 
       
  1829 time on certain examples in the RegExLib and Snort regular expressions
       
  1830 libraries is unacceptable. Therefore the problem of efficiency still remains.
       
  1831 
       
  1832 This superlinear blowup in matching algorithms sometimes causes
       
  1833 considerable grief in real life: for example on 20 July 2016 one evil
       
  1834 regular expression brought the webpage
       
  1835 \href{http://stackexchange.com}{Stack Exchange} to its
       
  1836 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
       
  1837 In this instance, a regular expression intended to just trim white
       
  1838 spaces from the beginning and the end of a line actually consumed
       
  1839 massive amounts of CPU-resources---causing web servers to grind to a
       
  1840 halt. This happened when a post with 20,000 white spaces was submitted,
       
  1841 but importantly the white spaces were neither at the beginning nor at
       
  1842 the end. As a result, the regular expression matching engine needed to
       
  1843 backtrack over many choices. In this example, the time needed to process
       
  1844 the string was $O(n^2)$ with respect to the string length. This
       
  1845 quadratic overhead was enough for the homepage of Stack Exchange to
       
  1846 respond so slowly that the load balancer assumed there must be some
       
  1847 attack and therefore stopped the servers from responding to any
       
  1848 requests. This made the whole site become unavailable. Another very
       
  1849 recent example is a global outage of all Cloudflare servers on 2 July
       
  1850 2019. A poorly written regular expression exhibited exponential
       
  1851 behaviour and exhausted CPUs that serve HTTP traffic. Although the
       
  1852 outage had several causes, at the heart was a regular expression that
       
  1853 was used to monitor network
       
  1854 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
       
  1855 
       
  1856 The underlying problem is that many ``real life'' regular expression
       
  1857 matching engines do not use DFAs for matching. This is because they
       
  1858 support regular expressions that are not covered by the classical
       
  1859 automata theory, and in this more general setting there are quite a few
       
  1860 research questions still unanswered and fast algorithms still need to be
       
  1861 developed (for example how to treat efficiently bounded repetitions, negation and
       
  1862 back-references).
       
  1863 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
       
  1864 %how do they avoid dfas exponential states if they use them for fast matching?
       
  1865 
       
  1866 There is also another under-researched problem to do with regular
       
  1867 expressions and lexing, i.e.~the process of breaking up strings into
       
  1868 sequences of tokens according to some regular expressions. In this
       
  1869 setting one is not just interested in whether or not a regular
       
  1870 expression matches a string, but also in \emph{how}.  Consider for
       
  1871 example a regular expression $r_{key}$ for recognising keywords such as
       
  1872 \textit{if}, \textit{then} and so on; and a regular expression $r_{id}$
       
  1873 for recognising identifiers (say, a single character followed by
       
  1874 characters or numbers). One can then form the compound regular
       
  1875 expression $(r_{key} + r_{id})^*$ and use it to tokenise strings.  But
       
  1876 then how should the string \textit{iffoo} be tokenised?  It could be
       
  1877 tokenised as a keyword followed by an identifier, or the entire string
       
  1878 as a single identifier.  Similarly, how should the string \textit{if} be
       
  1879 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
       
  1880 ``fire''---so is it an identifier or a keyword?  While in applications
       
  1881 there is a well-known strategy to decide these questions, called POSIX
       
  1882 matching, only relatively recently precise definitions of what POSIX
       
  1883 matching actually means have been formalised
       
  1884 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Such a
       
  1885 definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014},
       
  1886 but the corresponding correctness proof turned out to be  faulty
       
  1887 \cite{AusafDyckhoffUrban2016}. Roughly, POSIX matching means matching
       
  1888 the longest initial substring. In the case of a tie, the initial
       
  1889 sub-match is chosen according to some priorities attached to the regular
       
  1890 expressions (e.g.~keywords have a higher priority than identifiers).
       
  1891 This sounds rather simple, but according to Grathwohl et al \cite[Page
       
  1892 36]{CrashCourse2014} this is not the case. They wrote:
       
  1893 
       
  1894 \begin{quote}
       
  1895 \it{}``The POSIX strategy is more complicated than the greedy because of 
       
  1896 the dependence on information about the length of matched strings in the 
       
  1897 various subexpressions.''
       
  1898 \end{quote}
       
  1899 
       
  1900 \noindent
       
  1901 This is also supported by evidence collected by Kuklewicz
       
  1902 \cite{Kuklewicz} who noticed that a number of POSIX regular expression
       
  1903 matchers calculate incorrect results.
       
  1904 
       
  1905 Our focus in this project is on an algorithm introduced by Sulzmann and
       
  1906 Lu in 2014 for regular expression matching according to the POSIX
       
  1907 strategy \cite{Sulzmann2014}. Their algorithm is based on an older
       
  1908 algorithm by Brzozowski from 1964 where he introduced the notion of
       
  1909 derivatives of regular expressions~\cite{Brzozowski1964}. We shall
       
  1910 briefly explain this algorithm next.
       
  1911 
       
  1912 \section{The Algorithm by Brzozowski based on Derivatives of Regular
       
  1913 Expressions}
       
  1914 
       
  1915 Suppose (basic) regular expressions are given by the following grammar:
       
  1916 \[			r ::=   \ZERO \mid  \ONE
       
  1917 			 \mid  c  
       
  1918 			 \mid  r_1 \cdot r_2
       
  1919 			 \mid  r_1 + r_2   
       
  1920 			 \mid r^*         
       
  1921 \]
       
  1922 
       
  1923 \noindent
       
  1924 The intended meaning of the constructors is as follows: $\ZERO$
       
  1925 cannot match any string, $\ONE$ can match the empty string, the
       
  1926 character regular expression $c$ can match the character $c$, and so
       
  1927 on.
       
  1928 
       
  1929 The ingenious contribution by Brzozowski is the notion of
       
  1930 \emph{derivatives} of regular expressions.  The idea behind this
       
  1931 notion is as follows: suppose a regular expression $r$ can match a
       
  1932 string of the form $c\!::\! s$ (that is a list of characters starting
       
  1933 with $c$), what does the regular expression look like that can match
       
  1934 just $s$? Brzozowski gave a neat answer to this question. He started
       
  1935 with the definition of $nullable$:
       
  1936 \begin{center}
       
  1937 		\begin{tabular}{lcl}
       
  1938 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
  1939 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
  1940 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
  1941 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
  1942 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
  1943 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
  1944 		\end{tabular}
       
  1945 	\end{center}
       
  1946 This function simply tests whether the empty string is in $L(r)$.
       
  1947 He then defined
       
  1948 the following operation on regular expressions, written
       
  1949 $r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
       
  1950 
       
  1951 \begin{center}
       
  1952 \begin{tabular}{lcl}
       
  1953 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
  1954 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
  1955 		$d \backslash c$     & $\dn$ & 
       
  1956 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
  1957 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
  1958 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
  1959 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
  1960 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
  1961 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
  1962 \end{tabular}
       
  1963 \end{center}
       
  1964 
       
  1965 %Assuming the classic notion of a
       
  1966 %\emph{language} of a regular expression, written $L(\_)$, t
       
  1967 
       
  1968 \noindent
       
  1969 The main property of the derivative operation is that
       
  1970 
       
  1971 \begin{center}
       
  1972 $c\!::\!s \in L(r)$ holds
       
  1973 if and only if $s \in L(r\backslash c)$.
       
  1974 \end{center}
       
  1975 
       
  1976 \noindent
       
  1977 For us the main advantage is that derivatives can be
       
  1978 straightforwardly implemented in any functional programming language,
       
  1979 and are easily definable and reasoned about in theorem provers---the
       
  1980 definitions just consist of inductive datatypes and simple recursive
       
  1981 functions. Moreover, the notion of derivatives can be easily
       
  1982 generalised to cover extended regular expression constructors such as
       
  1983 the not-regular expression, written $\neg\,r$, or bounded repetitions
       
  1984 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
       
  1985 straightforwardly realised within the classic automata approach.
       
  1986 For the moment however, we focus only on the usual basic regular expressions.
       
  1987 
       
  1988 
       
  1989 Now if we want to find out whether a string $s$ matches with a regular
       
  1990 expression $r$, we can build the derivatives of $r$ w.r.t.\ (in succession)
       
  1991 all the characters of the string $s$. Finally, test whether the
       
  1992 resulting regular expression can match the empty string.  If yes, then
       
  1993 $r$ matches $s$, and no in the negative case. To implement this idea
       
  1994 we can generalise the derivative operation to strings like this:
       
  1995 
       
  1996 \begin{center}
       
  1997 \begin{tabular}{lcl}
       
  1998 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
       
  1999 $r \backslash [\,] $ & $\dn$ & $r$
       
  2000 \end{tabular}
       
  2001 \end{center}
       
  2002 
       
  2003 \noindent
       
  2004 and then define as  regular-expression matching algorithm: 
       
  2005 \[
       
  2006 match\;s\;r \;\dn\; nullable(r\backslash s)
       
  2007 \]
       
  2008 
       
  2009 \noindent
       
  2010 This algorithm looks graphically as follows:
       
  2011 \begin{equation}\label{graph:*}
       
  2012 \begin{tikzcd}
       
  2013 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
       
  2014 \end{tikzcd}
       
  2015 \end{equation}
       
  2016 
       
  2017 \noindent
       
  2018 where we start with  a regular expression  $r_0$, build successive
       
  2019 derivatives until we exhaust the string and then use \textit{nullable}
       
  2020 to test whether the result can match the empty string. It can  be
       
  2021 relatively  easily shown that this matcher is correct  (that is given
       
  2022 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
       
  2023 
       
  2024  
       
  2025 \section{Values and the Algorithm by Sulzmann and Lu}
       
  2026 
       
  2027 One limitation of Brzozowski's algorithm is that it only produces a
       
  2028 YES/NO answer for whether a string is being matched by a regular
       
  2029 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
       
  2030 to allow generation of an actual matching, called a \emph{value} or
       
  2031 sometimes also \emph{lexical value}.  These values and regular
       
  2032 expressions correspond to each other as illustrated in the following
       
  2033 table:
       
  2034 
       
  2035 
       
  2036 \begin{center}
       
  2037 	\begin{tabular}{c@{\hspace{20mm}}c}
       
  2038 		\begin{tabular}{@{}rrl@{}}
       
  2039 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
  2040 			$r$ & $::=$  & $\ZERO$\\
       
  2041 			& $\mid$ & $\ONE$   \\
       
  2042 			& $\mid$ & $c$          \\
       
  2043 			& $\mid$ & $r_1 \cdot r_2$\\
       
  2044 			& $\mid$ & $r_1 + r_2$   \\
       
  2045 			\\
       
  2046 			& $\mid$ & $r^*$         \\
       
  2047 		\end{tabular}
       
  2048 		&
       
  2049 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
  2050 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
  2051 			$v$ & $::=$  & \\
       
  2052 			&        & $\Empty$   \\
       
  2053 			& $\mid$ & $\Char(c)$          \\
       
  2054 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
  2055 			& $\mid$ & $\Left(v)$   \\
       
  2056 			& $\mid$ & $\Right(v)$  \\
       
  2057 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
  2058 		\end{tabular}
       
  2059 	\end{tabular}
       
  2060 \end{center}
       
  2061 
       
  2062 \noindent
       
  2063 No value  corresponds to $\ZERO$; $\Empty$ corresponds to $\ONE$;
       
  2064 $\Char$ to the character regular expression; $\Seq$ to the sequence
       
  2065 regular expression and so on. The idea of values is to encode a kind of
       
  2066 lexical value for how the sub-parts of a regular expression match the
       
  2067 sub-parts of a string. To see this, suppose a \emph{flatten} operation,
       
  2068 written $|v|$ for values. We can use this function to extract the
       
  2069 underlying string of a value $v$. For example, $|\mathit{Seq} \,
       
  2070 (\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$.  Using
       
  2071 flatten, we can describe how values encode lexical values: $\Seq\,v_1\,
       
  2072 v_2$ encodes a tree with two children nodes that tells how the string
       
  2073 $|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches
       
  2074 the substring $|v_1|$ and, respectively, $r_2$ matches the substring
       
  2075 $|v_2|$. Exactly how these two are matched is contained in the children
       
  2076 nodes $v_1$ and $v_2$ of parent $\textit{Seq}$. 
       
  2077 
       
  2078 To give a concrete example of how values work, consider the string $xy$
       
  2079 and the regular expression $(x + (y + xy))^*$. We can view this regular
       
  2080 expression as a tree and if the string $xy$ is matched by two Star
       
  2081 ``iterations'', then the $x$ is matched by the left-most alternative in
       
  2082 this tree and the $y$ by the right-left alternative. This suggests to
       
  2083 record this matching as
       
  2084 
       
  2085 \begin{center}
       
  2086 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
       
  2087 \end{center}
       
  2088 
       
  2089 \noindent
       
  2090 where $\Stars \; [\ldots]$ records all the
       
  2091 iterations; and $\Left$, respectively $\Right$, which
       
  2092 alternative is used. The value for
       
  2093 matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
       
  2094 would look as follows
       
  2095 
       
  2096 \begin{center}
       
  2097 $\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$
       
  2098 \end{center}
       
  2099 
       
  2100 \noindent
       
  2101 where $\Stars$ has only a single-element list for the single iteration
       
  2102 and $\Seq$ indicates that $xy$ is matched by a sequence regular
       
  2103 expression.
       
  2104 
       
  2105 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
  2106 algorithm by a second phase (the first phase being building successive
       
  2107 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
       
  2108 is generated in case the regular expression matches  the string. 
       
  2109 Pictorially, the Sulzmann and Lu algorithm is as follows:
       
  2110 
       
  2111 \begin{ceqn}
       
  2112 \begin{equation}\label{graph:2}
       
  2113 \begin{tikzcd}
       
  2114 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
  2115 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
  2116 \end{tikzcd}
       
  2117 \end{equation}
       
  2118 \end{ceqn}
       
  2119 
       
  2120 \noindent
       
  2121 For convenience, we shall employ the following notations: the regular
       
  2122 expression we start with is $r_0$, and the given string $s$ is composed
       
  2123 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
  2124 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
  2125 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
  2126 the derivative $r_n$. We test whether this derivative is
       
  2127 $\textit{nullable}$ or not. If not, we know the string does not match
       
  2128 $r$ and no value needs to be generated. If yes, we start building the
       
  2129 values incrementally by \emph{injecting} back the characters into the
       
  2130 earlier values $v_n, \ldots, v_0$. This is the second phase of the
       
  2131 algorithm from the right to left. For the first value $v_n$, we call the
       
  2132 function $\textit{mkeps}$, which builds the lexical value
       
  2133 for how the empty string has been matched by the (nullable) regular
       
  2134 expression $r_n$. This function is defined as
       
  2135 
       
  2136 	\begin{center}
       
  2137 		\begin{tabular}{lcl}
       
  2138 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
  2139 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
  2140 			& \textit{if} $\nullable(r_{1})$\\ 
       
  2141 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
  2142 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
  2143 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
  2144 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
  2145 		\end{tabular}
       
  2146 	\end{center}
       
  2147 
       
  2148 
       
  2149 \noindent There are no cases for $\ZERO$ and $c$, since
       
  2150 these regular expression cannot match the empty string. Note
       
  2151 also that in case of alternatives we give preference to the
       
  2152 regular expression on the left-hand side. This will become
       
  2153 important later on about what value is calculated.
       
  2154 
       
  2155 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
  2156 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
  2157 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
  2158 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
  2159 matches $s$. For this Sulzmann and Lu defined a function that reverses
       
  2160 the ``chopping off'' of characters during the derivative phase. The
       
  2161 corresponding function is called \emph{injection}, written
       
  2162 $\textit{inj}$; it takes three arguments: the first one is a regular
       
  2163 expression ${r_{i-1}}$, before the character is chopped off, the second
       
  2164 is a character ${c_{i-1}}$, the character we want to inject and the
       
  2165 third argument is the value ${v_i}$, into which one wants to inject the
       
  2166 character (it corresponds to the regular expression after the character
       
  2167 has been chopped off). The result of this function is a new value. The
       
  2168 definition of $\textit{inj}$ is as follows: 
       
  2169 
       
  2170 \begin{center}
       
  2171 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
  2172   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
  2173   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
  2174   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
  2175   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
  2176   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
  2177   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
  2178   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
  2179 \end{tabular}
       
  2180 \end{center}
       
  2181 
       
  2182 \noindent This definition is by recursion on the ``shape'' of regular
       
  2183 expressions and values. To understands this definition better consider
       
  2184 the situation when we build the derivative on regular expression $r_{i-1}$.
       
  2185 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a
       
  2186 ``hole'' in $r_i$ and its corresponding value $v_i$. 
       
  2187 To calculate $v_{i-1}$, we need to
       
  2188 locate where that hole is and fill it. 
       
  2189 We can find this location by
       
  2190 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape
       
  2191 $r_a \cdot r_b$, and $v_i$ is of shape $\Left(Seq(v_1,v_2))$, we know immediately that 
       
  2192 %
       
  2193 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\]
       
  2194 
       
  2195 \noindent
       
  2196 otherwise if $r_a$ is not nullable,
       
  2197 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\]
       
  2198 
       
  2199 \noindent
       
  2200 the value $v_i$ should be  $\Seq(\ldots)$, contradicting the fact that
       
  2201 $v_i$ is actually of shape $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
       
  2202 $\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left
       
  2203 branch of \[ (r_a \cdot r_b)\backslash c =
       
  2204 \bold{\underline{ (r_a\backslash c) \cdot r_b} }\,+\, r_b\backslash c,\](underlined)
       
  2205  is taken instead of the right one. This means $c$ is chopped off 
       
  2206 from $r_a$ rather than $r_b$.
       
  2207 We have therefore found out 
       
  2208 that the hole will be on $r_a$. So we recursively call $\inj\, 
       
  2209 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
       
  2210 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
       
  2211 Other clauses can be understood in a similar way.
       
  2212 
       
  2213 %\comment{Other word: insight?}
       
  2214 The following example gives an insight of $\textit{inj}$'s effect and
       
  2215 how Sulzmann and Lu's algorithm works as a whole. Suppose we have a
       
  2216 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it
       
  2217 against the string $abc$ (when $abc$ is written as a regular expression,
       
  2218 the standard way of expressing it is $a \cdot (b \cdot c)$. But we
       
  2219 usually omit the parentheses and dots here for better readability. This
       
  2220 algorithm returns a POSIX value, which means it will produce the longest
       
  2221 matching. Consequently, it matches the string $abc$ in one star
       
  2222 iteration, using the longest alternative $abc$ in the sub-expression (we shall use $r$ to denote this
       
  2223 sub-expression for conciseness):
       
  2224 
       
  2225 \[((((a+b)+ab)+c)+\underbrace{abc}_r)\] 
       
  2226 
       
  2227 \noindent
       
  2228 Before $\textit{inj}$ is called, our lexer first builds derivative using
       
  2229 string $abc$ (we simplified some regular expressions like $\ZERO \cdot
       
  2230 b$ to $\ZERO$ for conciseness; we also omit parentheses if they are
       
  2231 clear from the context):
       
  2232 
       
  2233 %Similarly, we allow
       
  2234 %$\textit{ALT}$ to take a list of regular expressions as an argument
       
  2235 %instead of just 2 operands to reduce the nested depth of
       
  2236 %$\textit{ALT}$
       
  2237 
       
  2238 \begin{center}
       
  2239 \begin{tabular}{lcl}
       
  2240 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r^*$\\
       
  2241       & $\xrightarrow{\backslash b}$ & $r_2 = (\ZERO+\ZERO+\ONE \cdot \ONE + \ZERO + \ONE \cdot \ONE \cdot c) \cdot r^* +(\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*$\\
       
  2242       & $\xrightarrow{\backslash c}$ & $r_3 = ((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot \ONE) \cdot r^* + (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^*) + $\\ 
       
  2243       &                              & $\phantom{r_3 = (} ((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^* + (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* )$
       
  2244 \end{tabular}
       
  2245 \end{center}
       
  2246 
       
  2247 \noindent
       
  2248 In  case $r_3$ is nullable, we can call $\textit{mkeps}$ 
       
  2249 to construct a lexical value for how $r_3$ matched the string $abc$. 
       
  2250 This function gives the following value $v_3$: 
       
  2251 
       
  2252 
       
  2253 \begin{center}
       
  2254 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
       
  2255 \end{center}
       
  2256 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
       
  2257 
       
  2258 \begin{center}
       
  2259 	\begin{tabular}{l@{\hspace{2mm}}l}
       
  2260     & $\big(\underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} 
       
  2261     \;+\; (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^*\big)$ \smallskip\\
       
  2262     $+$ & $\big((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*
       
  2263     \;+\; (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* \big)$
       
  2264   	\end{tabular}
       
  2265  \end{center}
       
  2266 
       
  2267 \noindent
       
  2268  Note that the leftmost location of term $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot
       
  2269  \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows
       
  2270  $\textit{mkeps}$ to pick it up because $\textit{mkeps}$ is defined to always choose the
       
  2271  left one when it is nullable. In the case of this example, $abc$ is
       
  2272  preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is
       
  2273  generated by two applications of the splitting clause
       
  2274 
       
  2275 \begin{center}
       
  2276      $(r_1 \cdot r_2)\backslash c  \;\;(when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
       
  2277 \end{center}
       
  2278        
       
  2279 \noindent
       
  2280 By this clause, we put $r_1 \backslash c \cdot r_2 $ at the
       
  2281 $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This
       
  2282 allows $\textit{mkeps}$ to always pick up among two matches the one with a longer
       
  2283 initial sub-match. Removing the outside $\Left(\Left(...))$, the inside
       
  2284 sub-value 
       
  2285  
       
  2286 \begin{center}
       
  2287  $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$
       
  2288 \end{center}
       
  2289 
       
  2290 \noindent
       
  2291 tells us how the empty string $[]$ is matched with $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot
       
  2292 \ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of two nullable regular
       
  2293 expressions. The first one is an alternative, we take the rightmost
       
  2294 alternative---whose language contains the empty string. The second
       
  2295 nullable regular expression is a Kleene star. $\Stars$ tells us how it
       
  2296 generates the nullable regular expression: by 0 iterations to form
       
  2297 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally
       
  2298 builds a lexical value based on $v_3$. Using the value $v_3$, the character
       
  2299 c, and the regular expression $r_2$, we can recover how $r_2$ matched
       
  2300 the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us
       
  2301  \begin{center}
       
  2302  $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$
       
  2303  \end{center}
       
  2304 which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get
       
  2305 \begin{center}
       
  2306 $v_1 = \Seq(\Right(\Seq(\Empty, \Seq(b, c))), \Stars [])$
       
  2307 \end{center}
       
  2308  for how 
       
  2309  \begin{center}
       
  2310  $r_1= (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r*$
       
  2311  \end{center}
       
  2312   matched  the string $bc$ before it split into two substrings. 
       
  2313   Finally, after injecting character $a$ back to $v_1$, 
       
  2314   we get  the lexical value tree 
       
  2315   \begin{center}
       
  2316   $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
       
  2317   \end{center}
       
  2318    for how $r$ matched $abc$. This completes the algorithm.
       
  2319    
       
  2320 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
       
  2321 Readers might have noticed that the lexical value information is actually
       
  2322 already available when doing derivatives. For example, immediately after
       
  2323 the operation $\backslash a$ we know that if we want to match a string
       
  2324 that starts with $a$, we can either take the initial match to be 
       
  2325 
       
  2326  \begin{center}
       
  2327 \begin{enumerate}
       
  2328     \item[1)] just $a$ or
       
  2329     \item[2)] string $ab$ or 
       
  2330     \item[3)] string $abc$.
       
  2331 \end{enumerate}
       
  2332 \end{center}
       
  2333 
       
  2334 \noindent
       
  2335 In order to differentiate between these choices, we just need to
       
  2336 remember their positions---$a$ is on the left, $ab$ is in the middle ,
       
  2337 and $abc$ is on the right. Which of these alternatives is chosen
       
  2338 later does not affect their relative position because the algorithm does
       
  2339 not change this order. If this parsing information can be determined and
       
  2340 does not change because of later derivatives, there is no point in
       
  2341 traversing this information twice. This leads to an optimisation---if we
       
  2342 store the information for lexical values inside the regular expression,
       
  2343 update it when we do derivative on them, and collect the information
       
  2344 when finished with derivatives and call $\textit{mkeps}$ for deciding which
       
  2345 branch is POSIX, we can generate the lexical value in one pass, instead of
       
  2346 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel
       
  2347 idea of using bitcodes in derivatives.
       
  2348 
       
  2349 In the next section, we shall focus on the bitcoded algorithm and the
       
  2350 process of simplification of regular expressions. This is needed in
       
  2351 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
       
  2352 and Lu's algorithms.  This is where the PhD-project aims to advance the
       
  2353 state-of-the-art.
       
  2354 
       
  2355 
       
  2356 \section{Simplification of Regular Expressions}
       
  2357 
       
  2358 Using bitcodes to guide  parsing is not a novel idea. It was applied to
       
  2359 context free grammars and then adapted by Henglein and Nielson for
       
  2360 efficient regular expression  lexing using DFAs~\cite{nielson11bcre}.
       
  2361 Sulzmann and Lu took this idea of bitcodes a step further by integrating
       
  2362 bitcodes into derivatives. The reason why we want to use bitcodes in
       
  2363 this project is that we want to introduce more aggressive simplification
       
  2364 rules in order to keep the size of derivatives small throughout. This is
       
  2365 because the main drawback of building successive derivatives according
       
  2366 to Brzozowski's definition is that they can grow very quickly in size.
       
  2367 This is mainly due to the fact that the derivative operation generates
       
  2368 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if
       
  2369 implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
       
  2370 are excruciatingly slow. For example when starting with the regular
       
  2371 expression $(a + aa)^*$ and building 12 successive derivatives
       
  2372 w.r.t.~the character $a$, one obtains a derivative regular expression
       
  2373 with more than 8000 nodes (when viewed as a tree). Operations like
       
  2374 $\textit{der}$ and $\nullable$ need to traverse such trees and
       
  2375 consequently the bigger the size of the derivative the slower the
       
  2376 algorithm. 
       
  2377 
       
  2378 Fortunately, one can simplify regular expressions after each derivative
       
  2379 step. Various simplifications of regular expressions are possible, such
       
  2380 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
       
  2381 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
       
  2382 affect the answer for whether a regular expression matches a string or
       
  2383 not, but fortunately also do not affect the POSIX strategy of how
       
  2384 regular expressions match strings---although the latter is much harder
       
  2385 to establish. Some initial results in this regard have been
       
  2386 obtained in \cite{AusafDyckhoffUrban2016}. 
       
  2387 
       
  2388 Unfortunately, the simplification rules outlined above  are not
       
  2389 sufficient to prevent a size explosion in all cases. We
       
  2390 believe a tighter bound can be achieved that prevents an explosion in
       
  2391 \emph{all} cases. Such a tighter bound is suggested by work of Antimirov who
       
  2392 proved that (partial) derivatives can be bound by the number of
       
  2393 characters contained in the initial regular expression
       
  2394 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular
       
  2395 expressions as follows:
       
  2396 
       
  2397 \begin{center}
       
  2398 \begin{tabular}{lcl}
       
  2399  $\textit{pder} \; c \; \ZERO$ & $\dn$ & $\emptyset$\\
       
  2400  $\textit{pder} \; c \; \ONE$ & $\dn$ & $\emptyset$ \\
       
  2401  $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{  \ONE   \}  \; \textit{else} \; \emptyset$ \\ 
       
  2402   $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \;  r_2$ \\
       
  2403    $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 $\\
       
  2404      & & $\textit{then} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \}  \cup pder \; c \; r_2 \;$\\
       
  2405      & & $\textit{else} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \} $ \\ 
       
  2406      $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{  r' \cdot r^* \mid r' \in pder \; c \; r   \}  $ \\  
       
  2407  \end{tabular}
       
  2408  \end{center}
       
  2409 
       
  2410 \noindent
       
  2411 A partial derivative of a regular expression $r$ is essentially a set of
       
  2412 regular expressions that are either $r$'s children expressions or a
       
  2413 concatenation of them. Antimirov has proved a tight bound of the sum of
       
  2414 the size of \emph{all} partial derivatives no matter what the string
       
  2415 looks like. Roughly speaking the size sum will be at most cubic in the
       
  2416 size of the regular expression.
       
  2417 
       
  2418 If we want the size of derivatives in Sulzmann and Lu's algorithm to
       
  2419 stay below this bound, we would need more aggressive simplifications.
       
  2420 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
  2421 deleting duplicates whenever possible. For example, the parentheses in
       
  2422 $(a+b) \cdot c + bc$ can be opened up to get $a\cdot c +  b \cdot c + b
       
  2423 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
  2424 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
  2425 $a^*+a+\ONE$. Adding these more aggressive simplification rules helps us
       
  2426 to achieve the same size bound as that of the partial derivatives. 
       
  2427 
       
  2428 In order to implement the idea of ``spilling out alternatives'' and to
       
  2429 make them compatible with the $\text{inj}$-mechanism, we use
       
  2430 \emph{bitcodes}. Bits and bitcodes (lists of bits) are just:
       
  2431 
       
  2432 %This allows us to prove a tight
       
  2433 %bound on the size of regular expression during the running time of the
       
  2434 %algorithm if we can establish the connection between our simplification
       
  2435 %rules and partial derivatives.
       
  2436 
       
  2437  %We believe, and have generated test
       
  2438 %data, that a similar bound can be obtained for the derivatives in
       
  2439 %Sulzmann and Lu's algorithm. Let us give some details about this next.
       
  2440 
       
  2441 
       
  2442 \begin{center}
       
  2443 		$b ::=   0 \mid  1 \qquad
       
  2444 bs ::= [] \mid b:bs    
       
  2445 $
       
  2446 \end{center}
       
  2447 
       
  2448 \noindent
       
  2449 The $0$ and $1$ are arbitrary names for the bits  and not in bold
       
  2450 in order to avoid 
       
  2451 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
       
  2452 bit-lists) can be used to encode values (or incomplete values) in a
       
  2453 compact form. This can be straightforwardly seen in the following
       
  2454 coding function from values to bitcodes: 
       
  2455 
       
  2456 \begin{center}
       
  2457 \begin{tabular}{lcl}
       
  2458   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
       
  2459   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
       
  2460   $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
       
  2461   $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
       
  2462   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
       
  2463   $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
       
  2464   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
       
  2465                                                  code(\Stars\,vs)$
       
  2466 \end{tabular}    
       
  2467 \end{center} 
       
  2468 
       
  2469 \noindent
       
  2470 Here $\textit{code}$ encodes a value into a bitcodes by converting
       
  2471 $\Left$ into $0$, $\Right$ into $1$, the start point of a non-empty
       
  2472 star iteration into $\S$, and the border where a local star terminates
       
  2473 into $0$. This coding is lossy, as it throws away the information about
       
  2474 characters, and also does not encode the ``boundary'' between two
       
  2475 sequence values. Moreover, with only the bitcode we cannot even tell
       
  2476 whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
       
  2477 reason for choosing this compact way of storing information is that the
       
  2478 relatively small size of bits can be easily manipulated and ``moved
       
  2479 around'' in a regular expression. In order to recover values, we will 
       
  2480 need the corresponding regular expression as an extra information. This
       
  2481 means the decoding function is defined as:
       
  2482 
       
  2483 
       
  2484 %\begin{definition}[Bitdecoding of Values]\mbox{}
       
  2485 \begin{center}
       
  2486 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
  2487   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
       
  2488   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
       
  2489   $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
  2490      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       
  2491        (\Left\,v, bs_1)$\\
       
  2492   $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
  2493      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       
  2494        (\Right\,v, bs_1)$\\                           
       
  2495   $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
       
  2496         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
       
  2497   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
       
  2498   & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
       
  2499   $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
       
  2500   $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
       
  2501          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
  2502   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
       
  2503   & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
       
  2504   
       
  2505   $\textit{decode}\,bs\,r$ & $\dn$ &
       
  2506      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
  2507   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       
  2508        \textit{else}\;\textit{None}$                       
       
  2509 \end{tabular}    
       
  2510 \end{center}    
       
  2511 %\end{definition}
       
  2512 
       
  2513 Sulzmann and Lu's integrated the bitcodes into regular expressions to
       
  2514 create annotated regular expressions \cite{Sulzmann2014}.
       
  2515 \emph{Annotated regular expressions} are defined by the following
       
  2516 grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
       
  2517 
       
  2518 \begin{center}
       
  2519 \begin{tabular}{lcl}
       
  2520   $\textit{a}$ & $::=$  & $\ZERO$\\
       
  2521                   & $\mid$ & $_{bs}\ONE$\\
       
  2522                   & $\mid$ & $_{bs}{\bf c}$\\
       
  2523                   & $\mid$ & $_{bs}\oplus \textit{as}$\\
       
  2524                   & $\mid$ & $_{bs}a_1\cdot a_2$\\
       
  2525                   & $\mid$ & $_{bs}a^*$
       
  2526 \end{tabular}    
       
  2527 \end{center}  
       
  2528 %(in \textit{ALTS})
       
  2529 
       
  2530 \noindent
       
  2531 where $\textit{bs}$ stands for bitcodes, $a$  for $\textit{annotated}$ regular
       
  2532 expressions and $\textit{as}$ for a list of annotated regular expressions.
       
  2533 The alternative constructor($\oplus$) has been generalized to 
       
  2534 accept a list of annotated regular expressions rather than just 2.
       
  2535 We will show that these bitcodes encode information about
       
  2536 the (POSIX) value that should be generated by the Sulzmann and Lu
       
  2537 algorithm.
       
  2538 
       
  2539 
       
  2540 To do lexing using annotated regular expressions, we shall first
       
  2541 transform the usual (un-annotated) regular expressions into annotated
       
  2542 regular expressions. This operation is called \emph{internalisation} and
       
  2543 defined as follows:
       
  2544 
       
  2545 %\begin{definition}
       
  2546 \begin{center}
       
  2547 \begin{tabular}{lcl}
       
  2548   $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
       
  2549   $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
       
  2550   $(c)^\uparrow$ & $\dn$ & $_{bs}\textit{\bf c}$\\
       
  2551   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
  2552   $_{[]}\oplus\,[(\textit{fuse}\,[0]\,r_1^\uparrow),\,
       
  2553   (\textit{fuse}\,[1]\,r_2^\uparrow)]$\\
       
  2554   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
  2555          $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
       
  2556   $(r^*)^\uparrow$ & $\dn$ &
       
  2557          $_{[]}r^\uparrow$\\
       
  2558 \end{tabular}    
       
  2559 \end{center}    
       
  2560 %\end{definition}
       
  2561 
       
  2562 \noindent
       
  2563 We use up arrows here to indicate that the basic un-annotated regular
       
  2564 expressions are ``lifted up'' into something slightly more complex. In the
       
  2565 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
       
  2566 attach bits to the front of an annotated regular expression. Its
       
  2567 definition is as follows:
       
  2568 
       
  2569 \begin{center}
       
  2570 \begin{tabular}{lcl}
       
  2571   $\textit{fuse}\;bs\,(\ZERO)$ & $\dn$ & $\textit{ZERO}$\\
       
  2572   $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ &
       
  2573      $\textit{ONE}\,(bs\,@\,bs')$\\
       
  2574   $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
       
  2575      $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
       
  2576   $\textit{fuse}\;bs\,(\textit{ALTS}\,bs'\,as)$ & $\dn$ &
       
  2577      $\textit{ALTS}\,(bs\,@\,bs')\,as$\\
       
  2578   $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
       
  2579      $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
       
  2580   $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
       
  2581      $\textit{STAR}\,(bs\,@\,bs')\,a$
       
  2582 \end{tabular}    
       
  2583 \end{center}  
       
  2584 
       
  2585 \noindent
       
  2586 After internalising the regular expression, we perform successive
       
  2587 derivative operations on the annotated regular expressions. This
       
  2588 derivative operation is the same as what we had previously for the
       
  2589 basic regular expressions, except that we beed to take care of
       
  2590 the bitcodes:
       
  2591 
       
  2592  %\begin{definition}{bder}
       
  2593 \begin{center}
       
  2594   \begin{tabular}{@{}lcl@{}}
       
  2595   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  2596   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  2597   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
       
  2598         $\textit{if}\;c=d\; \;\textit{then}\;
       
  2599          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
       
  2600   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
       
  2601   $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
       
  2602   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
  2603      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
  2604 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
  2605 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
  2606   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
       
  2607   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
       
  2608       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
       
  2609        (\textit{STAR}\,[]\,r)$
       
  2610 \end{tabular}    
       
  2611 \end{center}    
       
  2612 %\end{definition}
       
  2613 
       
  2614 \noindent
       
  2615 For instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence,
       
  2616 we need to attach an additional bit $Z$ to the front of $r \backslash c$
       
  2617 to indicate that there is one more star iteration. Also the $SEQ$ clause
       
  2618 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
       
  2619 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
       
  2620 that it is for annotated regular expressions, therefore we omit the
       
  2621 definition). Assume that $bmkeps$ correctly extracts the bitcode for how
       
  2622 $a_1$ matches the string prior to character $c$ (more on this later),
       
  2623 then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
       
  2624 \backslash c)$ will collapse the regular expression $a_1$(as it has
       
  2625 already been fully matched) and store the parsing information at the
       
  2626 head of the regular expression $a_2 \backslash c$ by fusing to it. The
       
  2627 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
       
  2628 now been elevated to the top-level of $ALTS$, as this information will be
       
  2629 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
       
  2630 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
       
  2631 the lexing information, we complete the lexing by collecting the
       
  2632 bitcodes using a generalised version of the $\textit{mkeps}$ function
       
  2633 for annotated regular expressions, called $\textit{bmkeps}$:
       
  2634 
       
  2635 
       
  2636 %\begin{definition}[\textit{bmkeps}]\mbox{}
       
  2637 \begin{center}
       
  2638 \begin{tabular}{lcl}
       
  2639   $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\
       
  2640   $\textit{bmkeps}\,(\textit{ALTS}\;bs\,a::as)$ & $\dn$ &
       
  2641      $\textit{if}\;\textit{bnullable}\,a$\\
       
  2642   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
       
  2643   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(\textit{ALTS}\;bs\,as)$\\
       
  2644   $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ &
       
  2645      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
       
  2646   $\textit{bmkeps}\,(\textit{STAR}\;bs\,a)$ & $\dn$ &
       
  2647      $bs \,@\, [\S]$
       
  2648 \end{tabular}    
       
  2649 \end{center}    
       
  2650 %\end{definition}
       
  2651 
       
  2652 \noindent
       
  2653 This function completes the value information by travelling along the
       
  2654 path of the regular expression that corresponds to a POSIX value and
       
  2655 collecting all the bitcodes, and using $S$ to indicate the end of star
       
  2656 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
       
  2657 decode them, we get the value we expect. The corresponding lexing
       
  2658 algorithm looks as follows:
       
  2659 
       
  2660 \begin{center}
       
  2661 \begin{tabular}{lcl}
       
  2662   $\textit{blexer}\;r\,s$ & $\dn$ &
       
  2663       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
  2664   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  2665   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  2666   & & $\;\;\textit{else}\;\textit{None}$
       
  2667 \end{tabular}
       
  2668 \end{center}
       
  2669 
       
  2670 \noindent
       
  2671 In this definition $\_\backslash s$ is the  generalisation  of the derivative
       
  2672 operation from characters to strings (just like the derivatives for un-annotated
       
  2673 regular expressions).
       
  2674 
       
  2675 The main point of the bitcodes and annotated regular expressions is that
       
  2676 we can apply rather aggressive (in terms of size) simplification rules
       
  2677 in order to keep derivatives small. We have developed such
       
  2678 ``aggressive'' simplification rules and generated test data that show
       
  2679 that the expected bound can be achieved. Obviously we could only
       
  2680 partially cover  the search space as there are infinitely many regular
       
  2681 expressions and strings. 
       
  2682 
       
  2683 One modification we introduced is to allow a list of annotated regular
       
  2684 expressions in the \textit{ALTS} constructor. This allows us to not just
       
  2685 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
       
  2686 also unnecessary ``copies'' of regular expressions (very similar to
       
  2687 simplifying $r + r$ to just $r$, but in a more general setting). Another
       
  2688 modification is that we use simplification rules inspired by Antimirov's
       
  2689 work on partial derivatives. They maintain the idea that only the first
       
  2690 ``copy'' of a regular expression in an alternative contributes to the
       
  2691 calculation of a POSIX value. All subsequent copies can be pruned away from
       
  2692 the regular expression. A recursive definition of our  simplification function 
       
  2693 that looks somewhat similar to our Scala code is given below:
       
  2694 %\comment{Use $\ZERO$, $\ONE$ and so on. 
       
  2695 %Is it $ALTS$ or $ALTS$?}\\
       
  2696 
       
  2697 \begin{center}
       
  2698   \begin{tabular}{@{}lcl@{}}
       
  2699    
       
  2700   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
  2701    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
  2702    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
  2703    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
  2704    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
  2705    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
       
  2706 
       
  2707   $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
       
  2708   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
  2709    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
  2710    &&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$\\ 
       
  2711 
       
  2712    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
  2713 \end{tabular}    
       
  2714 \end{center}    
       
  2715 
       
  2716 \noindent
       
  2717 The simplification does a pattern matching on the regular expression.
       
  2718 When it detected that the regular expression is an alternative or
       
  2719 sequence, it will try to simplify its children regular expressions
       
  2720 recursively and then see if one of the children turn into $\ZERO$ or
       
  2721 $\ONE$, which might trigger further simplification at the current level.
       
  2722 The most involved part is the $\textit{ALTS}$ clause, where we use two
       
  2723 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
       
  2724 $\textit{ALTS}$ and reduce as many duplicates as possible. Function
       
  2725 $\textit{distinct}$  keeps the first occurring copy only and remove all later ones
       
  2726 when detected duplicates. Function $\textit{flatten}$ opens up nested \textit{ALTS}.
       
  2727 Its recursive definition is given below:
       
  2728 
       
  2729  \begin{center}
       
  2730   \begin{tabular}{@{}lcl@{}}
       
  2731   $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
       
  2732      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
       
  2733   $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
       
  2734     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
       
  2735 \end{tabular}    
       
  2736 \end{center}  
       
  2737 
       
  2738 \noindent
       
  2739 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
       
  2740 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
       
  2741 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
       
  2742 
       
  2743 Suppose we apply simplification after each derivative step, and view
       
  2744 these two operations as an atomic one: $a \backslash_{simp}\,c \dn
       
  2745 \textit{simp}(a \backslash c)$. Then we can use the previous natural
       
  2746 extension from derivative w.r.t.~character to derivative
       
  2747 w.r.t.~string:%\comment{simp in  the [] case?}
       
  2748 
       
  2749 \begin{center}
       
  2750 \begin{tabular}{lcl}
       
  2751 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
  2752 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
  2753 \end{tabular}
       
  2754 \end{center}
       
  2755 
       
  2756 \noindent
       
  2757 we obtain an optimised version of the algorithm:
       
  2758 
       
  2759  \begin{center}
       
  2760 \begin{tabular}{lcl}
       
  2761   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
  2762       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
  2763   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  2764   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  2765   & & $\;\;\textit{else}\;\textit{None}$
       
  2766 \end{tabular}
       
  2767 \end{center}
       
  2768 
       
  2769 \noindent
       
  2770 This algorithm keeps the regular expression size small, for example,
       
  2771 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
  2772 will be reduced to just 6 and stays constant, no matter how long the
       
  2773 input string is.
       
  2774 
       
  2775 
       
  2776 
       
  2777 \section{Current Work}
       
  2778 
       
  2779 We are currently engaged in two tasks related to this algorithm. The
       
  2780 first task is proving that our simplification rules actually do not
       
  2781 affect the POSIX value that should be generated by the algorithm
       
  2782 according to the specification of a POSIX value and furthermore obtain a
       
  2783 much tighter bound on the sizes of derivatives. The result is that our
       
  2784 algorithm should be correct and faster on all inputs.  The original
       
  2785 blow-up, as observed in JavaScript, Python and Java, would be excluded
       
  2786 from happening in our algorithm. For this proof we use the theorem prover
       
  2787 Isabelle. Once completed, this result will advance the state-of-the-art:
       
  2788 Sulzmann and Lu wrote in their paper~\cite{Sulzmann2014} about the
       
  2789 bitcoded ``incremental parsing method'' (that is the lexing algorithm
       
  2790 outlined in this section):
       
  2791 
       
  2792 \begin{quote}\it
       
  2793   ``Correctness Claim: We further claim that the incremental parsing
       
  2794   method in Figure~5 in combination with the simplification steps in
       
  2795   Figure 6 yields POSIX parse tree [our lexical values]. We have tested this claim
       
  2796   extensively by using the method in Figure~3 as a reference but yet
       
  2797   have to work out all proof details.''
       
  2798 \end{quote}  
       
  2799 
       
  2800 \noindent 
       
  2801 We like to settle this correctness claim. It is relatively
       
  2802 straightforward to establish that after one simplification step, the part of a
       
  2803 nullable derivative that corresponds to a POSIX value remains intact and can
       
  2804 still be collected, in other words, we can show that
       
  2805 %\comment{Double-check....I
       
  2806 %think this  is not the case}
       
  2807 %\comment{If i remember correctly, you have proved this lemma.
       
  2808 %I feel this is indeed not true because you might place arbitrary 
       
  2809 %bits on the regex r, however if this is the case, did i remember wrongly that
       
  2810 %you proved something like simplification does not affect $\textit{bmkeps}$ results?
       
  2811 %Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
       
  2812 %to a regex. Maybe it works now.}
       
  2813 
       
  2814 \begin{center}
       
  2815 	$\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;($\textit{provided}$ \; a\; is \; \textit{bnullable} )$
       
  2816 \end{center} 
       
  2817 
       
  2818 \noindent
       
  2819 as this basically comes down to proving actions like removing the
       
  2820 additional $r$ in $r+r$  does not delete important POSIX information in
       
  2821 a regular expression. The hard part of this proof is to establish that
       
  2822 
       
  2823 \begin{center}
       
  2824 	$ \textit{blexer}\_{simp}(r, \; s) =  \textit{blexer}(r, \; s)$
       
  2825 \end{center}
       
  2826 %comment{This is not true either...look at the definion blexer/blexer-simp}
       
  2827 
       
  2828 \noindent That is, if we do derivative on regular expression $r$ and then
       
  2829 simplify it, and repeat this process until we exhaust the string, we get a
       
  2830 regular expression $r''$($\textit{LHS}$)  that provides the POSIX matching
       
  2831 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the
       
  2832 normal derivative algorithm that only does derivative repeatedly and has no
       
  2833 simplification at all.  This might seem at first glance very unintuitive, as
       
  2834 the $r'$ could be exponentially larger than $r''$, but can be explained in the
       
  2835 following way: we are pruning away the possible matches that are not POSIX.
       
  2836 Since there could be exponentially many 
       
  2837 non-POSIX matchings and only 1 POSIX matching, it
       
  2838 is understandable that our $r''$ can be a lot smaller.  we can still provide
       
  2839 the same POSIX value if there is one.  This is not as straightforward as the
       
  2840 previous proposition, as the two regular expressions $r'$ and $r''$ might have
       
  2841 become very different.  The crucial point is to find the
       
  2842 $\textit{POSIX}$  information of a regular expression and how it is modified,
       
  2843 augmented and propagated 
       
  2844 during simplification in parallel with the regular expression that
       
  2845 has not been simplified in the subsequent derivative operations.  To aid this,
       
  2846 we use the helper function retrieve described by Sulzmann and Lu:
       
  2847 \begin{center}
       
  2848 \begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}}
       
  2849   $\textit{retrieve}\,(\textit{ONE}\,bs)\,\Empty$ & $\dn$ & $bs$\\
       
  2850   $\textit{retrieve}\,(\textit{CHAR}\,bs\,c)\,(\Char\,d)$ & $\dn$ & $bs$\\
       
  2851   $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Left\,v)$ & $\dn$ &
       
  2852      $bs \,@\, \textit{retrieve}\,a\,v$\\
       
  2853   $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Right\,v)$ & $\dn$ &
       
  2854   $bs \,@\, \textit{retrieve}\,(\textit{ALTS}\,bs\,as)\,v$\\
       
  2855   $\textit{retrieve}\,(\textit{SEQ}\,bs\,a_1\,a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ &
       
  2856      $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\
       
  2857   $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,[])$ & $\dn$ &
       
  2858      $bs \,@\, [\S]$\\
       
  2859   $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\
       
  2860   \multicolumn{3}{l}{
       
  2861      \hspace{3cm}$bs \,@\, [\Z] \,@\, \textit{retrieve}\,a\,v\,@\,
       
  2862                     \textit{retrieve}\,(\textit{STAR}\,[]\,a)\,(\Stars\,vs)$}\\
       
  2863 \end{tabular}
       
  2864 \end{center}
       
  2865 %\comment{Did not read further}\\
       
  2866 This function assembles the bitcode 
       
  2867 %that corresponds to a lexical value for how
       
  2868 %the current derivative matches the suffix of the string(the characters that
       
  2869 %have not yet appeared, but will appear as the successive derivatives go on.
       
  2870 %How do we get this "future" information? By the value $v$, which is
       
  2871 %computed by a pass of the algorithm that uses
       
  2872 %$inj$ as described in the previous section).  
       
  2873 using information from both the derivative regular expression and the
       
  2874 value. Sulzmann and Lu poroposed this function, but did not prove
       
  2875 anything about it. Ausaf and Urban used it to connect the bitcoded
       
  2876 algorithm to the older algorithm by the following equation:
       
  2877  
       
  2878  \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
       
  2879 	 (r^\uparrow)\backslash_{simp} \,c)\,v)$ 
       
  2880  \end{center} 
       
  2881 
       
  2882 \noindent
       
  2883 whereby $r^\uparrow$ stands for the internalised version of $r$. Ausaf
       
  2884 and Urban also used this fact to prove  the correctness of bitcoded
       
  2885 algorithm without simplification.  Our purpose of using this, however,
       
  2886 is to establish 
       
  2887 
       
  2888 \begin{center}
       
  2889 $ \textit{retrieve} \;
       
  2890 a \; v \;=\; \textit{retrieve}  \; (\textit{simp}\,a) \; v'.$
       
  2891 \end{center}
       
  2892 The idea is that using $v'$, a simplified version of $v$ that had gone
       
  2893 through the same simplification step as $\textit{simp}(a)$, we are able
       
  2894 to extract the bitcode that gives the same parsing information as the
       
  2895 unsimplified one. However, we noticed that constructing such a  $v'$
       
  2896 from $v$ is not so straightforward. The point of this is that  we might
       
  2897 be able to finally bridge the gap by proving
       
  2898 
       
  2899 \begin{center}
       
  2900 $\textit{retrieve} \; (r^\uparrow   \backslash  s) \; v = \;\textit{retrieve} \;
       
  2901 (\textit{simp}(r^\uparrow)  \backslash  s) \; v'$
       
  2902 \end{center}
       
  2903 
       
  2904 \noindent
       
  2905 and subsequently
       
  2906 
       
  2907 \begin{center}
       
  2908 $\textit{retrieve} \; (r^\uparrow \backslash  s) \; v\; = \; \textit{retrieve} \;
       
  2909 (r^\uparrow  \backslash_{simp}  \, s) \; v'$.
       
  2910 \end{center}
       
  2911 
       
  2912 \noindent
       
  2913 The $\textit{LHS}$ of the above equation is the bitcode we want. This
       
  2914 would prove that our simplified version of regular expression still
       
  2915 contains all the bitcodes needed. The task here is to find a way to
       
  2916 compute the correct $v'$.
       
  2917 
       
  2918 The second task is to speed up the more aggressive simplification.  Currently
       
  2919 it is slower than the original naive simplification by Ausaf and Urban (the
       
  2920 naive version as implemented by Ausaf   and Urban of course can ``explode'' in
       
  2921 some cases).  It is therefore not surprising that the speed is also much slower
       
  2922 than regular expression engines in popular programming languages such as Java
       
  2923 and Python on most inputs that are linear. For example, just by rewriting the
       
  2924 example regular expression in the beginning of this report  $(a^*)^*\,b$ into
       
  2925 $a^*\,b$ would eliminate the ambiguity in the matching and make the time
       
  2926 for matching linear with respect to the input string size. This allows the 
       
  2927 DFA approach to become blindingly fast, and dwarf the speed of our current
       
  2928 implementation. For example, here is a comparison of Java regex engine 
       
  2929 and our implementation on this example.
       
  2930 
       
  2931 \begin{center}
       
  2932 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
  2933 \begin{tikzpicture}
       
  2934 \begin{axis}[
       
  2935     xlabel={$n*1000$},
       
  2936     x label style={at={(1.05,-0.05)}},
       
  2937     ylabel={time in secs},
       
  2938     enlargelimits=false,
       
  2939     xtick={0,5,...,30},
       
  2940     xmax=33,
       
  2941     ymax=9,
       
  2942     scaled ticks=true,
       
  2943     axis lines=left,
       
  2944     width=5cm,
       
  2945     height=4cm, 
       
  2946     legend entries={Bitcoded Algorithm},  
       
  2947     legend pos=north west,
       
  2948     legend cell align=left]
       
  2949 \addplot[red,mark=*, mark options={fill=white}] table {bad-scala.data};
       
  2950 \end{axis}
       
  2951 \end{tikzpicture}
       
  2952   &
       
  2953 \begin{tikzpicture}
       
  2954 \begin{axis}[
       
  2955     xlabel={$n*1000$},
       
  2956     x label style={at={(1.05,-0.05)}},
       
  2957     %ylabel={time in secs},
       
  2958     enlargelimits=false,
       
  2959     xtick={0,5,...,30},
       
  2960     xmax=33,
       
  2961     ymax=9,
       
  2962     scaled ticks=false,
       
  2963     axis lines=left,
       
  2964     width=5cm,
       
  2965     height=4cm, 
       
  2966     legend entries={Java},  
       
  2967     legend pos=north west,
       
  2968     legend cell align=left]
       
  2969 \addplot[cyan,mark=*, mark options={fill=white}] table {good-java.data};
       
  2970 \end{axis}
       
  2971 \end{tikzpicture}\\
       
  2972 \multicolumn{3}{c}{Graphs: Runtime for matching $a^*\,b$ with strings 
       
  2973            of the form $\underbrace{aa..a}_{n}$.}
       
  2974 \end{tabular}    
       
  2975 \end{center}  
       
  2976 
       
  2977 
       
  2978 Java regex engine can match string of thousands of characters in a few milliseconds,
       
  2979 whereas our current algorithm gets excruciatingly slow on input of this size.
       
  2980 The running time in theory is linear, however it does not appear to be the 
       
  2981 case in an actual implementation. So it needs to be explored how to
       
  2982 make our algorithm faster on all inputs.  It could be the recursive calls that are
       
  2983 needed to manipulate bits that are causing the slow down. A possible solution
       
  2984 is to write recursive functions into tail-recusive form.
       
  2985 Another possibility would be to explore
       
  2986 again the connection to DFAs to speed up the algorithm on 
       
  2987 subcalls that are small enough. This is very much work in progress.
       
  2988 
       
  2989 \section{Conclusion}
       
  2990 
       
  2991 In this PhD-project we are interested in fast algorithms for regular
       
  2992 expression matching. While this seems to be a ``settled'' area, in
       
  2993 fact interesting research questions are popping up as soon as one steps
       
  2994 outside the classic automata theory (for example in terms of what kind
       
  2995 of regular expressions are supported). The reason why it is
       
  2996 interesting for us to look at the derivative approach introduced by
       
  2997 Brzozowski for regular expression matching, and then much further
       
  2998 developed by Sulzmann and Lu, is that derivatives can elegantly deal
       
  2999 with some of the regular expressions that are of interest in ``real
       
  3000 life''. This includes the not-regular expression, written $\neg\,r$
       
  3001 (that is all strings that are not recognised by $r$), but also bounded
       
  3002 regular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There is
       
  3003 also hope that the derivatives can provide another angle for how to
       
  3004 deal more efficiently with back-references, which are one of the
       
  3005 reasons why regular expression engines in JavaScript, Python and Java
       
  3006 choose to not implement the classic automata approach of transforming
       
  3007 regular expressions into NFAs and then DFAs---because we simply do not
       
  3008 know how such back-references can be represented by DFAs.
       
  3009 We also plan to implement the bitcoded algorithm
       
  3010 in some imperative language like C to see if the inefficiency of the 
       
  3011 Scala implementation
       
  3012 is language specific. To make this research more comprehensive we also plan
       
  3013 to contrast our (faster) version of bitcoded algorithm with the
       
  3014 Symbolic Regex Matcher, the RE2, the Rust Regex Engine, and the static
       
  3015 analysis approach by implementing them in the same language and then compare
       
  3016 their performance.
       
  3017 
  1733 
  3018 \bibliographystyle{plain}
  1734 \bibliographystyle{plain}
  3019 \bibliography{root}
  1735 \bibliography{root}
  3020 
  1736 
  3021 
  1737 
  3022 \end{document}
  1738 \end{document}
       
  1739