I want to determine whether a given string can be created by joining any of a set of substrings. As a specific example, I want to split a string "sgene"
according to what part of the regex sg|ge|ne|n|s
it matches. The answer is "s"
, "ge"
, "ne"
, because those three parts are how the string can be decomposed into parts from the regex, the desired set of substrings.
Go has regexp.(*Regexp).FindAllString
, and Ruby has Regexp.scan
to do this. In my code, one match is lost regardless of whether I order the substrings before or after the superstrings since my regexes overlap.
Here is a program to reproduce the problem in Go:
package main
import (
"fmt"
"regexp"
)
func main() {
str := "sgene"
superBeforeSub := regexp.MustCompile("sg|ge|ne|n|s")
subBeforeSuper := regexp.MustCompile("n|s|sg|ge|ne")
regexes := []*regexp.Regexp{superBeforeSub, subBeforeSuper}
for _, rgx := range regexes {
fmt.Println(rgx.MatchString(str), rgx.FindAllString(str, -1))
}
}
This program outputs:
true [sg ne]
true [s ge n]
And here is the same program in Ruby (problem for Ruby is also seen here):
str = "sgene"
regexes = [/sg|ge|ne|n|s/, /n|s|sg|ge|ne/]
regexes.each do |regex|
puts "%s %s" % [(regex === str).to_s, str.scan(regex).inspect]
end
It outputs:
true ["sg", "ne"]
true ["s", "ge", "n"]
The regex engines are aware that the string can be matched by the regex, but FindAllString
and scan
do not match it the way the boolean match does. They seem to use a greedy longest match search that ignores at least one e. How can I use regex to split the string into [s ge ne]
in either language?