+----------------------------------------------------------+
|                                                          |
|                                                          |
|                                                          |
+----------------------------------------------------------+


$

RegElk

--> See the repository <--

RegElk - OCaml Linear Engine for JavaScript Regexes

Authors: Aurèle Barrière and Clément Pit-Claudel.

About

This is a linear regular expression engine for a subset of JavaScript regexes. The underlying algorithm is an extension of the PikeVM, supporting more JavaScript features. This engine implements the algorithms described in the paper Linear Matching of JavaScript Regular Expressions by the same authors.

In particular, it supports, for the first time with linear time and space complexity:

RegElk means Regex Engine with Linear looKarounds. Elks are diagonal walkers, meaning that they reuse their front legs prints for their rear legs to conserve energy, evoking how a PikeVM merges threads reaching the same state to preserve linearity.

RegElk

Complexity

Given a regex of size |r| and a string of size |s|, this engine has linear worst-case time complexity in both of them O(|r|*|s|). While counted quantifiers are supported, they increase the regex size. For instance, e{4-8} will multiply the size of e 8 times. However, the greedy plus (+ or {1,}) or the nonnullable lazy plus (as in (ab)+?) are handled without duplication.

The engine also has O(|r|*|s|) space complexity. If one wants to avoid a string-size dependent space complexity, we provide alternative register data-structures, presenting various time-space complexity tradeoff.

Time ComplexitySpace Complexity
List (default)O(|r|*|s|)O(|r|*|s|)
ArrayO(|r|^2*|s|)O(|r|^2)
TreeO(|r|*log(|r|)*|s|)O(|r|^2)

Note however that a O(|r|*|s|) space complexity cannot be avoided when using our linear lookaround algorithm.

Supported Features

FeatureExample
Lookaheadsa(?=(b)), a(?!=b)
Lookbehinds(?<=b)a, (?<!b)a
Capture Groups(a*)b
Noncapturing Groups(?:a*)b
Greedy Quantifiers*, +, ?
Lazy Quantifiers*?, +?, ??
Counted Quantifiersa{6,12}, a{7,}, a{9}, a{4,5}?
Character Classes[a-z], [^h], [aeiouy]
Character Groups\w, \d, \s, \W, \D, \S
Anchors$, ^
Word Boundaries\b, \B

Backreferences are not supported, as they make the matching problem NP-hard. Named capture groups, hexadecimal escapes, unicode escapes, unicode properties and regex flags are not supported yet, although they could be in the future.

Dependencies

You need the following Opam packages. Other version numbers may also work.

You also need to install Node.JS and have node in your path.

Usage

Build all executables with make. Make sure to have configured your opam switch so that it has all dependencies listed above. This creates several executables:

main.native, fuzzer.native and benchmark.native have command line options that can be printed with the argument --help.

Files

OCaml Engine Files

Other Tools

Correspondence between the Paper and the Code

Renamings

Algorithm 1

This is implemented by the functions advance_epsilon and find_match in interpreter.ml.

Section 4.1

Section 4.2

Section 4.3

Section 4.4

Switch to the strlb directory for this algorithm, and see the corresponding README.md file.

Section 4.5

Section 4.6

Section 5.1

Section 5.2

Tests