dongyumiao5210 2012-03-22 18:43
浏览 66
已采纳

调整Sphinx匹配任何/部分匹配[通过PHP]

We're running sphinx on a mid-sized product database (10 mm records, 2gb) using the standard EXTENDED2 / SPH_RANK_PROXIMITY_BM25 approach. Speed is great, relevancy is spot on.

However we're running into increasing complaints from end-users who are searching with terms that are more complex than what our database has and thus getting no results.

For example, we have the product "KitchenAid Artisan 5-Quart Mixers" while a common search is "KitchenAid Artisan 5-Quart Stand Mixers brown". The result with our current settings is no match when we should be able to return the item we have.

We've tried using the MATCH_ANY sorting by @weight mode but relevancy goes completely sideways [think dolls and board games showing up] as sphinx picks up other products with individual words.

Is there a best practice way to build our query parameters that will allow for more open matching while still ranking off of proximity and word density?

Here is our current PHP API commands if that helps

$cl = new SphinxClient();
$cl->SetServer('1.23.4', 456);
$cl->SetMaxQueryTime(15000);
$cl->SetMatchMode(SPH_MATCH_EXTENDED2);
$cl->SetRankingMode(SPH_RANK_PROXIMITY_BM25);
$cl->SetArrayResult(true);
$cl->SetFilter('active', array(1)); 
$cl->SetSortMode(SPH_SORT_RELEVANCE, '@weight DESC, priced ASC');
$cl->SetLimits(intval($try), 1, 20, 500);
$cl->SetFieldWeights(array('ptitle' => 60, 'description' => 40));
$res = $cl->query($searchterm,"products");
  • 写回答

1条回答 默认 最新

  • doulanyan6455 2012-03-22 20:36
    关注

    One thing to explore is Quorum. This can be useful for long queries as you can require a certain number of keywords. While ANY will only require one word to match, quorum can require say 4 out of 7.

    This will rule out a number of really bad matches right off.

    And because quorum is just a syntax as part of extended match mode - you can try all the different ranking modes. Using SPH_RANK_MATCHANY is still available to try - as it should be reasonably good with 'partial' matches. But you can also try the other modes.

    If you are using morphology, you can also enable index_exact_words and give them a boost in the rankings.

    So would do something like ...

    //this works as long as the user is not using special syntax, but if using -="() etc, need to be more clever
    $bits = preg_split('/\s+/',trim($searchterm));
    $quorum = ceil(count($bits)*0.66);
    $searchterm2 = '='.implode(' =',$bits);
    
    $searchterm = '"'.$searchterm.'"/'.$quorum.' | "'.$searchterm2.'"/'.$quorum;
    

    Also, I have doubts about your setLimits. max_matches of 20 seems very low. And the cutoff looks unnecessary; it might even be causing your issues. It will find 500 reasonable documents, and then stop searching - even if there are better matches later in the dataset.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 如何用stata画出文献中常见的安慰剂检验图
  • ¥15 c语言链表结构体数据插入
  • ¥40 使用MATLAB解答线性代数问题
  • ¥15 COCOS的问题COCOS的问题
  • ¥15 FPGA-SRIO初始化失败
  • ¥15 MapReduce实现倒排索引失败
  • ¥15 ZABBIX6.0L连接数据库报错,如何解决?(操作系统-centos)
  • ¥15 找一位技术过硬的游戏pj程序员
  • ¥15 matlab生成电测深三层曲线模型代码
  • ¥50 随机森林与房贷信用风险模型