在solr / lucene中过滤存储在远程数据库中的字段的最佳方法？

I have an index of about 100k documents that represent a movie entity.

Users can put films on various lists (like favorites etc.)

These lists are stored in a mysql database and are not indexed in solr.

I could store the user ids in multivalued fields that represent a list, but that is quite bad because the fields would get very, very long and the indexing would be problematic too.

So currently i do the following (pseudocode):

$favorites = SELECT document_id FROM favorites WHERE user_id = $user_id
$documents = 'http://solr.com:8393/select/?q=XYZ&fq=document_id:('.join(' OR ',$favorites);

this works great and fast but the number of items in filter queries is limited to 1024 (i tried that). also filter queries add up. so if i have one filter query with 500 values to filter i can have another values to 524 filters on another field.

It's okay for now because I limited the entries per list to 1024, and that's quite a lot but I think this approach is very clumsy and produces a lot of overhead.

Isn't there a better solution? Like writing a solr module that directly connects to the database or something? I'd like to do it in php.

If there is no other way, can i somehow raise the 1024 limit? because it works very fast now! I think with good hardware more wouldn't be a problem.

Edit: as asked in the comments i here post my original schema and a working example query.

<field name="film_id" type="int" indexed="true" stored="true" required="true"/> 
<field name="imdb_id" type="int" indexed="true" stored="true" /> 
<field name="parent_id" type="int" indexed="true" stored="true"/> 
<field name="malus" type="int" indexed="true" stored="true"/> 
<field name="type" type="int" indexed="true" stored="true"/> 
<field name="year" type="int" indexed="true" stored="true" termVectors="true"/> 
<field name="locale_title" type="string" indexed="false" stored="true"/> 
<field name="aka_title" type="filmtitle" indexed="true" stored="true" multiValued="true" omitNorms="true" termVectors="true" /> 
<field name="sort_title" type="string" indexed="true" stored="true"/> 
<field name="director" type="person" indexed="true" stored="true" multiValued="true" omitNorms="true"/> 
<field name="director_phonetic" type="person_phonetic" multiValued="true" omitNorms="true"/> 
<field name="actor" type="person" indexed="true" stored="true" multiValued="true" omitNorms="true"/> 
<field name="actor_phonetic" type="person_phonetic" multiValued="true" omitNorms="true"/> 
<field name="country" type="string" indexed="true" stored="true" multiValued="true"/> 
<field name="description" type="text" indexed="true" stored="true" /> 
<field name="genre" type="genre" indexed="true" stored="true" multiValued="true" termVectors="true"/> 
<field name="url" type="string" indexed="true" stored="true" multiValued="false"/> 
<field name="image_url" type="string" indexed="false" stored="true" multiValued="false"/>
<field name="rating" type="int" indexed="true" stored="true" required="false" default="50"/>
<field name="affiliate" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="product_type" type="string" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="product_*" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="blockbuster" type="boolean" indexed="true" stored="true" /> 
<copyField source="film_id" dest="id"/>
<field name="director_id" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>
<field name="actor_id" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>

theese are my additions to the default schema.xml

a sample search result can be viewed here.

a sample query would be:

http://my-server.com:8983/solr/select/?
q=description:nazis
&fq=product_bluray:amazon
&fq=film_id:(1185616 1054606 88763 361748 78748)

here the user would search for movies that are:

available on amazon as a bluray
that have the term "nazis" in the description
AND that are on his favorite list

the list includes the movies (documents) with the ids 1185616 1054606 88763 361748 78748 and are stored in the mysql database.

ps: I don't know whether I formulated the question well, I hope its understandable. If not, please feel free to edit!

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duanbinmi8970 2011-04-18 03:06
关注
Step one is to make sure you really want to use Solr. Looking at your schema, there's an awful lot in there that is susceptible to a normal RDBMS with basic text indexing. Take half an hour and look at postgresql unless you've already determined that a regular good old fashioned RDBMS with some extra bells an whistles just won't do for you.

There's a lot of interest in this problem in the Solr community, but there isn't a real solution.

The obvious approach is to reindex a "favorited" document every time someone favorites it with their username in a multivalued field. This is brain-dead, of course, but that doesn't mean it won't work, depending on how often one of your users mess with his/her favorites list. If your documents are on the small size (I assume they are only a few K) and you have can get enough hardware to keep the whole index in memory (likely since you've only got 100K documents) this might be the approach to consider. You can test it by building an index of a size that you can actually fit into the memory available and implement the strategy. See if it's fast enough.

You may also be able to 'batch' these operations if people don't add a gazillion favorites in one go, like this:

Day 1: I add ten items to my favorites. You stick their ID's in a database and use that list of ID's to filter my queries.

Night 1: You update all the documents that have been favorited by anyone during the day, adding my username to the "favoritedBy" multiValued field. Remove my favorited list from the DB, since it's now represented in the Solr index itself.

Day 2: I add three more items to my favorites. You filter on both favorited:myusername and id:(newID1 OR newID2 or newID3).

This may work for you if people add a reasonable number of favorites per day and you don't have a lot of traffic at night.
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

Solr - 如何在不通过查询字段的情况下搜索所有字段？ lucene php solr
2015-05-02 12:23

回答 2 已采纳 make sure the fields have stored=true <field name="field_name" type="text_general" indexed="tr
lucene/solr选择哪个版本 lucene solr
2017-11-24 02:34

回答 3 已采纳用了Lucene 7.1.0，编译环境要求是jdk1.8，Lucene6+都是1.8，后来因为工程需要换成了用了Lucene-5.5.5，这俩版本用起来没啥区别
在PHP中解析Lucene / SOLR debug.explain.structured xml输出 lucene php solr xml
2012-05-05 01:17

回答 2 已采纳 I am doing this using the solr-php-client. I do use a regular expression to parse out a specific v
camel 数据库_使用Camel在来自不同来源的Solr中索引数据
2020-06-07 23:55

dnc8371的博客 Apache Solr是建立在Lucene之上的“流行的，快速的开源企业搜索平台”。为了进行搜索（并查找结果），通常需要从不同的源（例如内容管理系统，关系数据库，旧系统）中提取数据，这是您最初的要求……然后还要保持...
如何在laravel / solarium中创建一个新的solr核心 laravel php solr
2018-08-10 07:49

回答 1 已采纳 $solr = SolrClient::factory(array( 'base_url' => 'http://LOCALHOST:8983',
我应该在Solr（或任何数据库）中索引或存储这些字段吗？ database java mysql php sql
2010-01-26 18:00

回答 3 已采纳 and wonder if I have to index or store the fields. My understanding of Solr is very limited
关于solr在cmd启动项中报错“错误: 找不到或无法加载主类 org.apache.solr.util.SolrCLI” jar lucene solr 全文检索搜索引擎
2019-11-04 23:19

回答 1 已采纳我猜你下的是源代码版的，我也是，玩不起来，报跟你一样的错。换成二进制版的，愉快启动！
全文搜索引擎Solr原理和实战教程
2020-05-02 14:39

禅与计算机程序设计艺术的博客 Solr的核心是一个Web应用程序，但是由于它是建立在开放的协议之上的，任何类型的客户端应用程序都可以使用Solr。HTTP是客户端应用程序和Solr之间使用的基本协议。客户端提出请求，Solr做一些工作并提供响应。客户...
求教,数据库数据往solr索引中添加数据问题 solr 数据库
2017-04-24 06:23

回答 1 已采纳 ![图片说明](https://img-ask.csdn.net/upload/201704/24/1493014764_166044.png)报的这种错
Solr在multiValue字段中返回每个值的每个文档 php solr
2012-12-26 10:52

回答 2 已采纳 If you want the documents to be returned multiple times and sort on both the fields, you have the
在SOLR中存储标签 php solr
2014-01-31 23:34

回答 2 已采纳 You should choose a multiValued field of string type. The string field stores values without the p
Lucenesolr
2022-04-27 10:12

BoltBear的博客 Lucene 全文检索技术课程计划 Lucene介绍全文检索流程介绍 ... 搜索流程 ...上图就是原始搜索引擎技术，如果用户比较少而且数据库的数据量比较小，那么这种方式实现搜索功能在企业中是比较常见...
无法在solr 5.1.0中找到正确的solrconfig.xml文件以进行配置 java lucene php solr
2015-05-07 06:21

回答 1 已采纳 When you try SolrCloud for the first time using the bin/solr -e cloud, the related configset gets
二十七、商城 - 搜索解决方案-Solr（15）【1】
2022-08-25 14:02

Daniel521-Spark的博客获取 solr注意：需要提前配置好linux下的java环境上述正常可直接跳转4，创建核心库例如中级项目为：访问 8983端口连接失败，原因可能有两个：（1）Solr 服务没有启动，启动即可，可以通过命令查看Solr的当前状态...
solr开发文档
2023-03-13 22:30

首席撩妹指导官的博客由于搜索引擎功能在门户社区中对提高用户体验有着重在门户社区中涉及大量需要搜索引擎的功能需求，目前在实现搜索引擎的方案上有集中方案可供选择：基于以述的几种方案的综合分析，对于我们公司的搜索引擎方案，采用...
没有解决我的问题, 去提问

悬赏问题

¥15 idea右下角设置编码是灰色的
¥15 全志H618ROM新增分区
¥20 jupyter保存图像功能的实现
¥15 在grasshopper里DrawViewportWires更改预览后，禁用电池仍然显示
¥15 NAO机器人的录音程序保存问题
¥15 C#读写EXCEL文件，不同编译
¥15 MapReduce结果输出到HBase，一直连接不上MySQL
¥15 扩散模型sd.webui使用时报错“Nonetype”
¥15 stm32流水灯＋呼吸灯＋外部中断按键
¥15 将二维数组，按照假设的规定，如0/1/0 == "4"，把对应列位置写成一个字符并打印输出该字符

在solr / lucene中过滤存储在远程数据库中的字段的最佳方法？

1条回答 默认 最新

悬赏问题

1条回答默认最新