Looking for a smart, very light and creative way to convert a title string into tokenized object but take into consideration non-splittable known two-worded predefined dictionary words.
I.e.: dictionary contains over 300 words / wordsets such as: sheet set, jacket, suit, oxford shoes
String may contain something like: 4-Piece 1000TC 100% Cotton Queen Sheet Set in Ivory
I would like to get resulted array that is stripped off all noisy words (ie. remove any words that have numbers or not long enough)
so first i do regex and strip everything that is not a-zA-Z at least {2,} char long
then I want to receive the following array:
- cotton
- queen
- sheet set
- ivory
where sheet set would remain as a single token since it is contained in our dictionary.
And I'm looking for a solution that would work very very fast since there're thousands of parallel processes and I'm trying to come up with a way to save on as many iterations as possible and the dictionary keeps on growing as well.