2005 \end{axis} |
2005 \end{axis} |
2006 \end{tikzpicture}\\ |
2006 \end{tikzpicture}\\ |
2007 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.} |
2007 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.} |
2008 \end{tabular} |
2008 \end{tabular} |
2009 \end{center} |
2009 \end{center} |
2010 |
|
2011 |
|
2012 |
|
2013 |
|
2014 |
|
2015 \noindent |
2010 \noindent |
2016 Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the |
2011 Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the |
2017 original size. |
2012 original size. |
2018 This suggests a link towrads "partial derivatives" |
2013 We will discuss improvements to this bound in the next chapter. |
2019 introduced by Antimirov \cite{Antimirov95}. |
|
2020 |
|
2021 \section{Antimirov's partial derivatives} |
|
2022 The idea behind Antimirov's partial derivatives |
|
2023 is to do derivatives in a similar way as suggested by Brzozowski, |
|
2024 but maintain a set of regular expressions instead of a single one: |
|
2025 |
|
2026 %TODO: antimirov proposition 3.1, needs completion |
|
2027 \begin{center} |
|
2028 \begin{tabular}{ccc} |
|
2029 $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\ |
|
2030 $\partial_x(\ONE)$ & $=$ & $\phi$ |
|
2031 \end{tabular} |
|
2032 \end{center} |
|
2033 |
|
2034 Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together |
|
2035 using the alternatives constructor, Antimirov cleverly chose to put them into |
|
2036 a set instead. This breaks the terms in a derivative regular expression up, |
|
2037 allowing us to understand what are the "atomic" components of it. |
|
2038 For example, To compute what regular expression $x^*(xx + y)^*$'s |
|
2039 derivative against $x$ is made of, one can do a partial derivative |
|
2040 of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$ |
|
2041 from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$. |
|
2042 To get all the "atomic" components of a regular expression's possible derivatives, |
|
2043 there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes |
|
2044 whatever character is available at the head of the string inside the language of a |
|
2045 regular expression, and gives back the character and the derivative regular expression |
|
2046 as a pair (which he called "monomial"): |
|
2047 \begin{center} |
|
2048 \begin{tabular}{ccc} |
|
2049 $\lf(\ONE)$ & $=$ & $\phi$\\ |
|
2050 $\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\ |
|
2051 $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ |
|
2052 $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\ |
|
2053 \end{tabular} |
|
2054 \end{center} |
|
2055 %TODO: completion |
|
2056 |
|
2057 There is a slight difference in the last three clauses compared |
|
2058 with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular |
|
2059 expression $r$ with every element inside $\textit{rset}$ to create a set of |
|
2060 sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates |
|
2061 on a set of monomials (which Antimirov called "linear form") and a regular |
|
2062 expression, and returns a linear form: |
|
2063 \begin{center} |
|
2064 \begin{tabular}{ccc} |
|
2065 $l \bigodot (\ZERO)$ & $=$ & $\phi$\\ |
|
2066 $l \bigodot (\ONE)$ & $=$ & $l$\\ |
|
2067 $\phi \bigodot t$ & $=$ & $\phi$\\ |
|
2068 $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\ |
|
2069 $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\ |
|
2070 $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\ |
|
2071 $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ |
|
2072 $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\ |
|
2073 \end{tabular} |
|
2074 \end{center} |
|
2075 %TODO: completion |
|
2076 |
|
2077 Some degree of simplification is applied when doing $\bigodot$, for example, |
|
2078 $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$, |
|
2079 and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and |
|
2080 $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$, |
|
2081 and so on. |
|
2082 |
|
2083 With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regex $r$ with |
|
2084 an iterative procedure: |
|
2085 \begin{center} |
|
2086 \begin{tabular}{llll} |
|
2087 $\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\ |
|
2088 & $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\ |
|
2089 & $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\ |
|
2090 $\partial_{UNIV}(r)$ & $=$ & $\PD$ & |
|
2091 \end{tabular} |
|
2092 \end{center} |
|
2093 |
|
2094 |
|
2095 $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, |
|
2096 |
|
2097 |
|
2098 However, if we introduce them in our |
|
2099 setting we would lose the POSIX property of our calculated values. |
|
2100 A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$. |
|
2101 If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer |
|
2102 would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of |
|
2103 what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information |
|
2104 in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$ |
|
2105 occurs, and apply them in the right order once we get |
|
2106 a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value. |
|
2107 This is unlike the simplification we had before, where the rewriting rules |
|
2108 such as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value. |
|
2109 We will discuss better |
|
2110 bounds in the last section of this chapter.\\[-6.5mm] |
|
2111 |
|
2112 |
2014 |
2113 |
2015 |
2114 |
2016 |
2115 %---------------------------------------------------------------------------------------- |
2017 %---------------------------------------------------------------------------------------- |
2116 % SECTION ?? |
2018 % SECTION ?? |