如何使用pylucene搜索本地的word文档

研究了好久，终于在windows环境下装上了jcc和pylucene，目前已经会用IndexFiles.py（建立索引）和SearchFiles.py（搜索文档）搜索txt文件了。
如何能用两个py文件搜索word文档呢（既能搜出文件名，又能搜出内容）？P.S.网上关于pylucence使用的内容不太多。
附IndexFiles.py


#!/usr/bin/env python

INDEX_DIR = "IndexFiles.index"

import sys, os, lucene, threading, time
from datetime import datetime

from java.nio.file import Paths
from org.apache.lucene.analysis.miscellaneous import LimitTokenCountAnalyzer
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field, FieldType
from org.apache.lucene.index import \
    FieldInfo, IndexWriter, IndexWriterConfig, IndexOptions
from org.apache.lucene.store import SimpleFSDirectory

"""
This class is loosely based on the Lucene (java implementation) demo class
org.apache.lucene.demo.IndexFiles.  It will take a directory as an argument
and will index all of the files in that directory and downward recursively.
It will index on the file path, the file name and the file contents.  The
resulting Lucene index will be placed in the current directory and called
'index'.
"""

class Ticker(object):

    def __init__(self):
        self.tick = True

    def run(self):
        while self.tick:
            sys.stdout.write('.')
            sys.stdout.flush()
            time.sleep(1.0)

class IndexFiles(object):
    """Usage: python IndexFiles <doc_directory>"""

    def __init__(self, root, storeDir, analyzer):

        if not os.path.exists(storeDir):
            os.mkdir(storeDir)

        store = SimpleFSDirectory(Paths.get(storeDir))
        analyzer = LimitTokenCountAnalyzer(analyzer, 1048576)
        config = IndexWriterConfig(analyzer)
        config.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
        writer = IndexWriter(store, config)

        self.indexDocs(root, writer)
        ticker = Ticker()
        print 'commit index',
        threading.Thread(target=ticker.run).start()
        writer.commit()
        writer.close()
        ticker.tick = False
        print 'done'

    def indexDocs(self, root, writer):

        t1 = FieldType()
        t1.setStored(True)
        t1.setTokenized(False)
        t1.setIndexOptions(IndexOptions.DOCS_AND_FREQS)

        t2 = FieldType()
        t2.setStored(False)
        t2.setTokenized(True)
        t2.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)

        for root, dirnames, filenames in os.walk(root):
            for filename in filenames:
                #if not filename.endswith('.txt'):
                if not filename.endswith('.docx'):
                    continue
                print "adding", filename
                try:
                    path = os.path.join(root, filename)
                    file = open(path)
                    contents = unicode(file.read(), 'iso-8859-1')
                    file.close()
                    doc = Document()
                    doc.add(Field("name", filename, t1))
                    doc.add(Field("path", root, t1))
                    if len(contents) > 0:
                        doc.add(Field("contents", contents, t2))
                    else:
                        print "warning: no content in %s" % filename
                    writer.addDocument(doc)
                except Exception, e:
                    print "Failed in indexDocs:", e

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print IndexFiles.__doc__
        sys.exit(1)
    lucene.initVM(vmargs=['-Djava.awt.headless=true'])
    print 'lucene', lucene.VERSION
    start = datetime.now()
    try:
        base_dir = os.path.dirname(os.path.abspath(sys.argv[0]))
        IndexFiles(sys.argv[1], os.path.join(base_dir, INDEX_DIR),
                   StandardAnalyzer())
        end = datetime.now()
        print end - start
    except Exception, e:
        print "Failed: ", e
        raise e

SearchFiles.py

#!/usr/bin/env python
#-*-coding:utf-8-*-
INDEX_DIR = "IndexFiles.index"

import sys, os, lucene

from java.nio.file import Paths
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.index import DirectoryReader
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.search import IndexSearcher

"""
This script is loosely based on the Lucene (java implementation) demo class
org.apache.lucene.demo.SearchFiles.  It will prompt for a search query, then it
will search the Lucene index in the current directory called 'index' for the
search query entered against the 'contents' field.  It will then display the
'path' and 'name' fields for each of the hits it finds in the index.  Note that
search.close() is currently commented out because it causes a stack overflow in
some cases.
"""
def run(searcher, analyzer):
    while True:
        print
        print "Hit enter with no input to quit."
        command = raw_input("Query:")
        if command == '':
            return

        print
        print "Searching for:", command
        query = QueryParser("contents", analyzer).parse(command)
        scoreDocs = searcher.search(query, 50).scoreDocs
        print "%s total matching documents." % len(scoreDocs)

        for scoreDoc in scoreDocs:
            doc = searcher.doc(scoreDoc.doc)
            #print 'path:', doc.get("path"), 'name:', doc.get("name")
            print 'path:', doc.get("path").encode('utf8', 'ignore'), 'name:', doc.get("name").encode('utf8', 'ignore')


if __name__ == '__main__':
    lucene.initVM(vmargs=['-Djava.awt.headless=true'])
    print 'lucene', lucene.VERSION
    base_dir = os.path.dirname(os.path.abspath(sys.argv[0]))
    directory = SimpleFSDirectory(Paths.get(os.path.join(base_dir, INDEX_DIR)))
    searcher = IndexSearcher(DirectoryReader.open(directory))
    analyzer = StandardAnalyzer()
    run(searcher, analyzer)
    del searcher

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

报告相同问题？

关注问题

python做一个本地搜索_用python做一个搜索引擎(Pylucene)的实例代码
2020-11-20 23:45

weixin_39607865的博客如图1是搜索引擎的一般结构，信息搜集模块从网络采集信息到网络信息库之中（一般使用爬虫）；然后信息整理模块对采集的信息进行分词、去停用词、赋权重等操作后建立索引表（一般是倒排索引）构成索引库；最后用户...
用python写搜索引擎_用python做一个搜索引擎(Pylucene)的实例代码
2020-12-28 20:19

The Operator的博客如图1是搜索引擎的一般结构，信息搜集模块从网络采集信息到网络信息库之中(一般使用爬虫)；然后信息整理模块对采集的信息进行分词、去停用词、赋权重等操作后建立索引表(一般是倒排索引)构成索引库；最后用户查询...
python开发搜索引擎_用python做一个搜索引擎(Pylucene)
2020-12-11 09:44

weixin_39726267的博客如图1是搜索引擎的一般结构，信息搜集模块从网络采集信息到网络信息库之中(一般使用爬虫)；然后信息整理模块对采集的信息进行分词、去停用词、赋权重等操作后建立索引表(一般是倒排索引)构成索引库；最后用户查询...
python网页制作实例教程_python实现一个搜索引擎(Pylucene)实例教程
2020-12-01 10:56

weixin_39857211的博客在Pylucene中建立索引的基本单位是“文档”(Document)，一个Document可能是一个网页、一篇文章、一封邮件。Document是用以构建索引的单位同时也是进行搜索时的结果单位，对它进行合理的设计能够提供个性化的搜索服务...
搜索python代码的软件_用python做一个搜索引擎(Pylucene)的实例代码
2020-11-20 22:14

weixin_39823269的博客如图1是搜索引擎的一般结构，信息搜集模块从网络采集信息到网络信息库之中（一般使用爬虫）；然后信息整理模块对采集的信息进行分词、去停用词、赋权重等操作后建立索引表（一般是倒排索引）构成索引库；最后用户...
python全文检索引擎_用python做一个搜索引擎(Pylucene)
2020-11-20 23:45

weixin_39743622的博客如图1是搜索引擎的一般结构，信息搜集模块从网络采集信息到网络信息库之中（一般使用爬虫）；然后信息整理模块对采集的信息进行分词、去停用词、赋权重等操作后建立索引表（一般是倒排索引）构成索引库；最后用户...
用python做一个搜索引擎(Pylucene)
2017-07-04 16:30

weixin_30679823的博客如图1是搜索引擎的一般结构，信息搜集模块从网络采集信息到网络信息库之中（一般使用爬虫）；然后信息整理模块对采集的信息进行分词、去停用词、赋权重等操作后建立索引表（一般是倒排索引）构成索引库；最后用户...
有没有可以搜索python程序的软件-用python做一个搜索引擎(Pylucene)的实例代码
2020-10-29 23:44

weixin_37988176的博客如图1是搜索引擎的一般结构，信息搜集模块从网络采集信息到网络信息库之中（一般使用爬虫）；然后信息整理模块对采集的信息进行分词、去停用词、赋权重等操作后建立索引表（一般是倒排索引）构成索引库；最后用户...
在哪里能收到python实例代码-用python做一个搜索引擎(Pylucene)的实例代码
2020-11-01 12:48

weixin_37988176的博客如图1是搜索引擎的一般结构，信息搜集模块从网络采集信息到网络信息库之中（一般使用爬虫）；然后信息整理模块对采集的信息进行分词、去停用词、赋权重等操作后建立索引表（一般是倒排索引）构成索引库；最后用户...
Elasticsearch7.x搜索实战
2021-09-01 21:43

张鹏辉的博客 Elasticsearch7.x第一部分全文搜索引擎Elasticsearch基础第1节 Elasticsearch是什么第2节 Elasticsearch的功能第3节 Elasticsearch的特点第4节 Elasticsearch企业使用场景第5节主流全文搜索方案对比第6节 Elastic...
Apache Lucene简介
2020-04-13 11:50

sun897827804的博客 1.搜索引擎的历史萌芽：Archie、Gopher起步：Robot（网络机器人）的出现与Spider（网络爬虫）发展：xcite、Galaxy、Yahoo等繁荣：Infoseek、AltaVista、Google和Baidu 2.什么是Lucene?? （1）Lucene是非常优秀的成熟...
python 开发个人日常操作笔记
2020-07-09 16:21

昵称得改的博客 **1,ApacheBench命令原理:** 十八, logging 模块设置使用十九, 定时任务二十，文件读取脚本代码二十一， python 操作 word文档 技术栈：一，实现思路：二，准备模板程序员必备储备资源网站 1.前端网站模板 ...
【Python自然语言处理】读书笔记：第四章：编写结构化程序
2019-05-04 21:05

Jack_Kuo的博客 ['``', 'when', 'i', 'use', 'a', 'word', ',', "''", 'humpty', 'dumpty', 'said', 'in', 'rather', 'a', 'scornful', 'tone', ',', '...', '``', 'it', 'means', 'just', 'what', 'i', 'choose', 'it', 'to', '...
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
系统已结题 12月7日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 11月29日

悬赏问题

¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
¥20 腾讯企业邮箱邮件可以恢复么
¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗？
¥15 错误 LNK2001 无法解析的外部符号
¥50 安装pyaudiokits失败
¥15 计组这些题应该咋做呀
¥60 更换迈创SOL6M4AE卡的时候，驱动要重新装才能使用，怎么解决？
¥15 让node服务器有自动加载文件的功能
¥15 jmeter脚本回放有的是对的有的是错的
¥15 r语言蛋白组学相关问题

如何使用pylucene搜索本地的word文档

0条回答 默认 最新

问题事件

悬赏问题

0条回答默认最新