We have a C2C website and we discourage selling branded products on our website. We have built a database of brand words such as Nike and D&G and made an algorithm that filters product information for these words and disables products if it contains these words.
Our current algorithm removes all white space and special characters from provided text and matches text with word from database. These cases are required to be caught by algorithm and are caught efficiently:
- i am nike world
- i have n ikee shoes
- i have nikeeshoes
- i sell i-phone casings
- i sell iphone-casings
- you can have iphone
Now the problem is that it also catches following:
- rapiD Garment factory (for D&G)
- rosNIK Electronics (for Nike)
What can be done to prevent such false matches while preserving efficiency with catching true cases?
EDIT
Here's the code for those of you who understand code better:
$orignal_txt = preg_replace('/&.{0,}?;/', '', (strip_tags($orignal_txt)));
$orignal_txt_nospace = preg_replace('/\W/', '', $orignal_txt);
{
$qry_kws = array("nike", "iphone", "d&g");
foreach($qry_kws as $rs_kw)
{
$no_space_db_kw = preg_replace('/\W/', '', $rs_kw);
if(stristr($orignal_txt_nospace, $rs_kw))
{
$ipr_banned_keywords[] = strtolower($rs_kw);
}
else if(stristr($orignal_txt_nospace, $no_space_db_kw))
{
$ipr_banned_keywords[] = strtolower($rs_kw);
}
}
}