dousi5501 2011-04-13 14:50
浏览 69

在solr / lucene中过滤存储在远程数据库中的字段的最佳方法?

I have an index of about 100k documents that represent a movie entity.

Users can put films on various lists (like favorites etc.)

These lists are stored in a mysql database and are not indexed in solr.

I could store the user ids in multivalued fields that represent a list, but that is quite bad because the fields would get very, very long and the indexing would be problematic too.

So currently i do the following (pseudocode):

$favorites = SELECT document_id FROM favorites WHERE user_id = $user_id
$documents = 'http://solr.com:8393/select/?q=XYZ&fq=document_id:('.join(' OR ',$favorites);

this works great and fast but the number of items in filter queries is limited to 1024 (i tried that). also filter queries add up. so if i have one filter query with 500 values to filter i can have another values to 524 filters on another field.

It's okay for now because I limited the entries per list to 1024, and that's quite a lot but I think this approach is very clumsy and produces a lot of overhead.

Isn't there a better solution? Like writing a solr module that directly connects to the database or something? I'd like to do it in php.

If there is no other way, can i somehow raise the 1024 limit? because it works very fast now! I think with good hardware more wouldn't be a problem.

Edit: as asked in the comments i here post my original schema and a working example query.

<field name="film_id" type="int" indexed="true" stored="true" required="true"/> 
<field name="imdb_id" type="int" indexed="true" stored="true" /> 
<field name="parent_id" type="int" indexed="true" stored="true"/> 
<field name="malus" type="int" indexed="true" stored="true"/> 
<field name="type" type="int" indexed="true" stored="true"/> 
<field name="year" type="int" indexed="true" stored="true" termVectors="true"/> 
<field name="locale_title" type="string" indexed="false" stored="true"/> 
<field name="aka_title" type="filmtitle" indexed="true" stored="true" multiValued="true" omitNorms="true" termVectors="true" /> 
<field name="sort_title" type="string" indexed="true" stored="true"/> 
<field name="director" type="person" indexed="true" stored="true" multiValued="true" omitNorms="true"/> 
<field name="director_phonetic" type="person_phonetic" multiValued="true" omitNorms="true"/> 
<field name="actor" type="person" indexed="true" stored="true" multiValued="true" omitNorms="true"/> 
<field name="actor_phonetic" type="person_phonetic" multiValued="true" omitNorms="true"/> 
<field name="country" type="string" indexed="true" stored="true" multiValued="true"/> 
<field name="description" type="text" indexed="true" stored="true" /> 
<field name="genre" type="genre" indexed="true" stored="true" multiValued="true" termVectors="true"/> 
<field name="url" type="string" indexed="true" stored="true" multiValued="false"/> 
<field name="image_url" type="string" indexed="false" stored="true" multiValued="false"/>
<field name="rating" type="int" indexed="true" stored="true" required="false" default="50"/>
<field name="affiliate" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="product_type" type="string" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="product_*" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="blockbuster" type="boolean" indexed="true" stored="true" /> 
<copyField source="film_id" dest="id"/>
<field name="director_id" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>
<field name="actor_id" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>

theese are my additions to the default schema.xml

a sample search result can be viewed here.

a sample query would be:

http://my-server.com:8983/solr/select/?
q=description:nazis
&fq=product_bluray:amazon
&fq=film_id:(1185616 1054606 88763 361748 78748)

here the user would search for movies that are:

  • available on amazon as a bluray
  • that have the term "nazis" in the description
  • AND that are on his favorite list

the list includes the movies (documents) with the ids 1185616 1054606 88763 361748 78748 and are stored in the mysql database.

ps: I don't know whether I formulated the question well, I hope its understandable. If not, please feel free to edit!

  • 写回答

1条回答 默认 最新

  • duanbinmi8970 2011-04-18 03:06
    关注

    Step one is to make sure you really want to use Solr. Looking at your schema, there's an awful lot in there that is susceptible to a normal RDBMS with basic text indexing. Take half an hour and look at postgresql unless you've already determined that a regular good old fashioned RDBMS with some extra bells an whistles just won't do for you.

    There's a lot of interest in this problem in the Solr community, but there isn't a real solution.

    The obvious approach is to reindex a "favorited" document every time someone favorites it with their username in a multivalued field. This is brain-dead, of course, but that doesn't mean it won't work, depending on how often one of your users mess with his/her favorites list. If your documents are on the small size (I assume they are only a few K) and you have can get enough hardware to keep the whole index in memory (likely since you've only got 100K documents) this might be the approach to consider. You can test it by building an index of a size that you can actually fit into the memory available and implement the strategy. See if it's fast enough.

    You may also be able to 'batch' these operations if people don't add a gazillion favorites in one go, like this:

    • Day 1: I add ten items to my favorites. You stick their ID's in a database and use that list of ID's to filter my queries.
    • Night 1: You update all the documents that have been favorited by anyone during the day, adding my username to the "favoritedBy" multiValued field. Remove my favorited list from the DB, since it's now represented in the Solr index itself.
    • Day 2: I add three more items to my favorites. You filter on both favorited:myusername and id:(newID1 OR newID2 or newID3).

    This may work for you if people add a reasonable number of favorites per day and you don't have a lot of traffic at night.

    评论

报告相同问题?

悬赏问题

  • ¥15 idea右下角设置编码是灰色的
  • ¥15 全志H618ROM新增分区
  • ¥20 jupyter保存图像功能的实现
  • ¥15 在grasshopper里DrawViewportWires更改预览后,禁用电池仍然显示
  • ¥15 NAO机器人的录音程序保存问题
  • ¥15 C#读写EXCEL文件,不同编译
  • ¥15 MapReduce结果输出到HBase,一直连接不上MySQL
  • ¥15 扩散模型sd.webui使用时报错“Nonetype”
  • ¥15 stm32流水灯+呼吸灯+外部中断按键
  • ¥15 将二维数组,按照假设的规定,如0/1/0 == "4",把对应列位置写成一个字符并打印输出该字符