|
1 \documentclass[runningheads]{lipics-v2021} |
|
2 \usepackage{times} |
|
3 \usepackage{isabelle} |
|
4 \usepackage{isabellesym} |
|
5 \usepackage{amsmath} |
|
6 \usepackage{amssymb} |
|
7 \usepackage{mathpartir} |
|
8 \usepackage{tikz} |
|
9 \usepackage{pgf} |
|
10 \usetikzlibrary{positioning} |
|
11 %\usepackage{pdfsetup} |
|
12 \usepackage{stmaryrd} |
|
13 %\usepackage{url} |
|
14 %\usepackage{color} |
|
15 %\usepackage[safe]{tipa} |
|
16 %\usepackage[sc]{mathpazo} |
|
17 %\usepackage{fontspec} |
|
18 %\setmainfont[Ligatures=TeX]{Palatino Linotype} |
|
19 |
|
20 |
|
21 |
|
22 |
|
23 \urlstyle{rm} |
|
24 \isabellestyle{it} |
|
25 \renewcommand{\isastyleminor}{\it}% |
|
26 \renewcommand{\isastyle}{\normalsize\it}% |
|
27 |
|
28 |
|
29 \def\dn{\,\stackrel{\mbox{\scriptsize def}}{=}\,} |
|
30 \renewcommand{\isasymequiv}{$\dn$} |
|
31 \renewcommand{\isasymemptyset}{$\varnothing$} |
|
32 \renewcommand{\isacharunderscore}{\mbox{$\_\!\_$}} |
|
33 \renewcommand{\isasymiota}{\makebox[0mm]{${}^{\prime}$}} |
|
34 \renewcommand{\isasymin}{\ensuremath{\,\in\,}} |
|
35 |
|
36 \addtolength{\oddsidemargin}{-1.5mm} |
|
37 \addtolength{\evensidemargin}{-1.5mm} |
|
38 \addtolength{\textwidth}{4mm} |
|
39 \addtolength{\textheight}{1.5mm} |
|
40 |
|
41 \def\lexer{\mathit{lexer}} |
|
42 \def\mkeps{\mathit{mkeps}} |
|
43 \def\inj{\mathit{inj}} |
|
44 \def\Empty{\mathit{Empty}} |
|
45 \def\Left{\mathit{Left}} |
|
46 \def\Right{\mathit{Right}} |
|
47 \def\Stars{\mathit{Stars}} |
|
48 \def\Char{\mathit{Char}} |
|
49 \def\Seq{\mathit{Seq}} |
|
50 \def\Der{\mathit{Der}} |
|
51 \def\nullable{\mathit{nullable}} |
|
52 \def\Z{\mathit{Z}} |
|
53 \def\S{\mathit{S}} |
|
54 \newcommand{\ZERO}{\mbox{\bf 0}} |
|
55 \newcommand{\ONE}{\mbox{\bf 1}} |
|
56 \def\rs{\mathit{rs}} |
|
57 |
|
58 \def\Brz{Brzozowski} |
|
59 \def\der{\backslash} |
|
60 \newtheorem{falsehood}{Falsehood} |
|
61 \newtheorem{conject}{Conjecture} |
|
62 |
|
63 \bibliographystyle{plainurl} |
|
64 |
|
65 \title{{POSIX} {L}exing with {B}itcoded {D}erivatives} |
|
66 \titlerunning{POSIX Lexing with Bitcoded Derivatives} |
|
67 \author{Chengsong Tan}{King's College London}{chengsong.tan@kcl.ac.uk}{}{} |
|
68 \author{Christian Urban}{King's College London}{christian.urban@kcl.ac.uk}{}{} |
|
69 \authorrunning{C.~Tan and C.~Urban} |
|
70 \keywords{POSIX matching and lexing, derivatives of regular expressions, Isabelle/HOL} |
|
71 \category{} |
|
72 \ccsdesc[100]{Design and analysis of algorithms} |
|
73 \ccsdesc[100]{Formal languages and automata theory} |
|
74 \Copyright{\mbox{}} |
|
75 \renewcommand{\DOIPrefix}{} |
|
76 \nolinenumbers |
|
77 |
|
78 |
|
79 \begin{document} |
|
80 \maketitle |
|
81 |
|
82 \begin{abstract} |
|
83 Sulzmann and Lu describe a lexing algorithm that calculates |
|
84 Brzozowski derivatives using bitcodes annotated to regular |
|
85 expressions. Their algorithm generates POSIX values which encode |
|
86 the information of \emph{how} a regular expression matches a |
|
87 string---that is, which part of the string is matched by which part |
|
88 of the regular expression. This information is needed in the |
|
89 context of lexing in order to extract and to classify tokens. |
|
90 The purpose of the bitcodes is to generate POSIX values incrementally while |
|
91 derivatives are calculated. They also help with designing |
|
92 an ``aggressive'' simplification function that keeps the size of |
|
93 derivatives finitely bounded. Without simplification the size of some derivatives can grow |
|
94 arbitrarily big, resulting in an extremely slow lexing algorithm. In this |
|
95 paper we describe a variant of Sulzmann and Lu's algorithm: Our |
|
96 variant is a recursive functional program, whereas Sulzmann |
|
97 and Lu's version involves a fixpoint construction. We \textit{(i)} |
|
98 prove in Isabelle/HOL that our algorithm is correct and generates |
|
99 unique POSIX values; we also \textit{(ii)} establish finite |
|
100 bounds for the size of the derivatives. |
|
101 |
|
102 %The size can be seen as a |
|
103 %proxy measure for the efficiency of the lexing algorithm: because of |
|
104 %the polynomial bound our algorithm does not suffer from |
|
105 %the exponential blowup in earlier works. |
|
106 |
|
107 % Brzozowski introduced the notion of derivatives for regular |
|
108 % expressions. They can be used for a very simple regular expression |
|
109 % matching algorithm. Sulzmann and Lu cleverly extended this |
|
110 % algorithm in order to deal with POSIX matching, which is the |
|
111 % underlying disambiguation strategy for regular expressions needed |
|
112 % in lexers. Their algorithm generates POSIX values which encode |
|
113 % the information of \emph{how} a regular expression matches a |
|
114 % string---that is, which part of the string is matched by which |
|
115 % part of the regular expression. In this paper we give our |
|
116 % inductive definition of what a POSIX value is and show $(i)$ that |
|
117 % such a value is unique (for given regular expression and string |
|
118 % being matched) and $(ii)$ that Sulzmann and Lu's algorithm always |
|
119 % generates such a value (provided that the regular expression |
|
120 % matches the string). We show that $(iii)$ our inductive definition |
|
121 % of a POSIX value is equivalent to an alternative definition by |
|
122 % Okui and Suzuki which identifies POSIX values as least elements |
|
123 % according to an ordering of values. We also prove the correctness |
|
124 % of Sulzmann's bitcoded version of the POSIX matching algorithm and |
|
125 % extend the results to additional constructors for regular |
|
126 % expressions. \smallskip |
|
127 \end{abstract} |
|
128 |
|
129 |
|
130 |
|
131 \input{session} |
|
132 |
|
133 |
|
134 |
|
135 \end{document} |
|
136 |
|
137 %%% Local Variables: |
|
138 %%% mode: latex |
|
139 %%% TeX-master: t |
|
140 %%% End: |