doujing5435 2014-06-22 09:04
浏览 27
已采纳

使用标题确定SphinxQL中的可能类别

I have a database with over 60 million records indexed by SphinxQL 2.1.1. Each record has a title and a catid (among other things). When a new record is inserted into the database, I am trying to get sphinx to guess the catid based on the text in the title.

I have managed to get it working for single words like so:

SELECT @groupby, catid, count(*) c FROM sphinx WHERE MATCH('*LANDLORDS*') group by catid order by c desc

However the actual title is likely to be something like this:

Looking for Landlords - Long term lease - No fees!!!

Is there any way to just dump the whole title string into sphinx and have it break down each of the words and perform some sort of fuzzy match, returning the most likely category?

  • 写回答

1条回答 默认 最新

  • donglu4633 2014-06-23 14:38
    关注

    Well as such sphinx isnt 'magical', and it doesn't have a 'fuzzy match' function.

    But can approximate one :) Two main steps...

    1. Changing from requiring all 'words', to just requiring some,

    2. changing ranking, to try to make the best 'intersection' between the query and the title, get a high weight, and therefore 'bubble' to the top.

    Can then just take the top result, and take it be a 'best guess'.

    (there is actully a third, words lie 'for' and 'the' are likly to cause lots of false positives, so may want to exclude them, either using stopwords on the index, or just strip then from the query)

    A prototype of such a query might be something like

    SELECT catid FROM sphinx WHERE MATCH('"Looking Landlords Long term lease No fees"/1') OPTION ranker=wordcount LIMIT 1;
    

    Thats using quorum to affect matching, and choosing a different ranker.


    Using this version with grouping, proabbly wont work, as will include lots of low quality matches. Although could perhap try using avg, or sum to get a composite weight?

    SELECT SUM(WEIGHT()) as w, catid FROM sphinx WHERE MATCH('"Looking Landlords Long term lease No fees"/1') GROUP BY catid ORDER BY w DESC OPTION ranker=wordcount LIMIT 1
    

    There are lots of ways to tweak this...

    You can try other rankers, eg matchany. Or even some custom ranking expressions.

    Or change the quorum, eg rather rank requiring 1 word, could result at least a few.

    Or if can extract phrases, eg

    '"Looking Landlords" | "Long term lease" | "No fees"'

    might work?

    ALso could rather than just taking the top result, take the top 5-10 results, and show them all to the user, compenstates for the fact the results are very approximate.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 为什么使用javacv转封装rtsp为rtmp时出现如下问题:[h264 @ 000000004faf7500]no frame?
  • ¥15 乘性高斯噪声在深度学习网络中的应用
  • ¥15 运筹学排序问题中的在线排序
  • ¥15 关于docker部署flink集成hadoop的yarn,请教个问题 flink启动yarn-session.sh连不上hadoop,这个整了好几天一直不行,求帮忙看一下怎么解决
  • ¥30 求一段fortran代码用IVF编译运行的结果
  • ¥15 深度学习根据CNN网络模型,搭建BP模型并训练MNIST数据集
  • ¥15 C++ 头文件/宏冲突问题解决
  • ¥15 用comsol模拟大气湍流通过底部加热(温度不同)的腔体
  • ¥50 安卓adb backup备份子用户应用数据失败
  • ¥20 有人能用聚类分析帮我分析一下文本内容嘛