dtla92562 2012-12-21 09:37
浏览 12
已采纳

PHP筛选禁止单词的文本

We have a C2C website and we discourage selling branded products on our website. We have built a database of brand words such as Nike and D&G and made an algorithm that filters product information for these words and disables products if it contains these words.

Our current algorithm removes all white space and special characters from provided text and matches text with word from database. These cases are required to be caught by algorithm and are caught efficiently:

  • i am nike world
  • i have n ikee shoes
  • i have nikeeshoes
  • i sell i-phone casings
  • i sell iphone-casings
  • you can have iphone

Now the problem is that it also catches following:

  • rapiD Garment factory (for D&G)
  • rosNIK Electronics (for Nike)

What can be done to prevent such false matches while preserving efficiency with catching true cases?

EDIT

Here's the code for those of you who understand code better:

$orignal_txt = preg_replace('/&.{0,}?;/', '', (strip_tags($orignal_txt)));
$orignal_txt_nospace = preg_replace('/\W/', '', $orignal_txt);
{
    $qry_kws = array("nike", "iphone", "d&g");
    foreach($qry_kws as $rs_kw)
    {       
        $no_space_db_kw = preg_replace('/\W/', '', $rs_kw);
        if(stristr($orignal_txt_nospace, $rs_kw))
        {
            $ipr_banned_keywords[] = strtolower($rs_kw);
        }
        else if(stristr($orignal_txt_nospace, $no_space_db_kw))
        {
                $ipr_banned_keywords[] = strtolower($rs_kw);
        }

    }
}
  • 写回答

4条回答 默认 最新

  • dongxie548548 2012-12-21 12:03
    关注

    Just playing around .... (Not to be used in production)

    $data = array(
            "i am nike world",
            "i have n ikee shoes",
            "i have nikeeshoes",
            "i sell i-phone casings",
            "i sell iphone-casings",
            "you can have iphone",
            "rapiD Garment factor",
            "rosNIK Electronics",
            "Buy you self N I K E",
            "B*U*Y I*P*H*O*N*E BABY",
            "My Phone Is not available");
    
    
    $ban = array("nike","d&g","iphone");
    

    Example 1:

    $filter = new BrandFilterIterator($data);
    $filter->parseBan($ban);
    foreach ( $filter as $word ) {
        echo $word, PHP_EOL;
    }
    

    Output 1

    rapiD Garment factor
    rosNIK Electronics
    My Phone Is not available
    

    Example 2

    $filter = new BrandFilterIterator($data,true); //reverse filter
    $filter->parseBan($ban);
    foreach ( $filter as $word ) {
        echo $word, " " , json_encode($word->getBan()) ,  PHP_EOL;
    }
    

    Output 2

    i am nike world ["nike"]
    i have n ikee shoes ["nike"]
    i have nikeeshoes ["nike"]
    i sell i-phone casings ["iphone"]
    i sell iphone-casings ["iphone"]
    you can have iphone ["iphone"]
    Buy you self N I K E ["nike"]
    B*U*Y I*P*H*O*N*E BABY ["iphone"]
    

    Class Used

    class BrandFilterIterator extends FilterIterator {
        private $words = array();
        private $reverse = false;
    
        function __construct(array $words, $reverse = false) {
            $this->reverse = $reverse;
            foreach ( $words as $word ) {
                $this->words[] = new Word($word);
            }
            parent::__construct(new ArrayIterator($this->words));
        }
    
        function parseBan(array $ban) {
            foreach ( $ban as $item ) {
                foreach ( $this->words as $word ) {
                    $word->checkMetrix($item);
                }
            }
        }
    
        public function accept() {
            if ($this->reverse) {
                return $this->getInnerIterator()->current()->accept() ? false : true;
            }
            return $this->getInnerIterator()->current()->accept();
        }
    }
    
    
    class Word {
        private $ban = array();
        private $word;
        private $parts;
        private $accept = true;
    
        function __construct($word) {
            $this->word = $word;
            $this->parts = explode(" ", $word);
        }
    
        function __toString() {
            return $this->word;
        }
    
        function getTrim() {
            return preg_replace('/\W/', '', $this->word);
        }
    
        function accept() {
            return $this->accept;
        }
    
        function getBan() {
            return array_unique($this->ban);
        }
    
        function reject($ban = null) {
            $ban === null or $this->ban[] = $ban;
            $this->accept = false;
            return $this->accept;
        }
    
        function checkMetrix($ban) {
            foreach ( $this->parts as $part ) {
                $part = strtolower($part);
                $ban = strtolower($ban);
                $t = ceil(strlen(strtolower($ban)) / strlen($part) * 100);
                $s = similar_text($part, $ban, $p);
                $l = levenshtein($part, $part);
                if (ceil($p) >= $t || ($t == 100 && $p >= 75 && $l == 0)) {
                    $this->reject($ban);
                }
            }
            // Detect Bad Use of space
            if (ceil(strlen($this->getTrim()) / strlen($this->word) * 100) < 75) {
                if (stripos($this->getTrim(), $ban) !== false) {
                    $this->reject($ban);
                }
            }
            return $this->accept;
        }
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(3条)

报告相同问题?

悬赏问题

  • ¥15 基于卷积神经网络的声纹识别
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 stm32开发clion时遇到的编译问题