Nutch+MongoDB+ElasticSearch+Kibana搭建inject操作异常

linux搭建Nutch+MongoDB+ElasticSearch+Kibana环境环境，nutch是apache-nutch-2.3.1-src.tar.gz源码编译的。
参考：http://blog.csdn.net/github_27609763/article/details/50597427进行搭建，
但是执行到./bin/nutch inject urls/报错，跪求大神指教

其中配置如下 nutch-site.xml

 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>
  <property>
    <name>http.agent.name</name>
    <value>Hist Crawler</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <value>protocol-(httphttpclient)urlfilter-regexindex-(basicmore)query-(basicsiteurllang)indexer-elasticnutch-extensionpointsparse-(texthtmlmsexcelmswordmspowerpointpdf)summary-basicscoring-opicurlnormalizer-(passregexbasic)parse-(htmltikametatags)index-(basicanchormoremetadata)</value>
  </property>
  <property>
    <name>elastic.host</name>
    <value>localhost</value>
  </property>

  <property>
    <name>elastic.cluster</name>
    <value>hist</value>
  </property>

  <property>
    <name>elastic.index</name>
    <value>nutch</value>
  </property>

  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>

  <property>
    <name>http.content.limit</name>
    <value>6553600</value>
  </property>
</configuration>

regex-urlfilter.txt的配置如下

 # Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/
# +.

另外urls下面的seed.txt配置如下cat

 [root@jdu4e00u53f7 urls]# pwd
/chen/nutch/runtime/local/urls
[root@jdu4e00u53f7 urls]# cat seed.txt 
http://blog.csdn.net/
[root@jdu4e00u53f7 urls]#

最后错误信息如下：

 2017-09-25 23:35:17,648 INFO  crawl.InjectorJob - InjectorJob: starting at 2017-09-25 23:35:17
2017-09-25 23:35:17,649 INFO  crawl.InjectorJob - InjectorJob: Injecting urlDir: urls
2017-09-25 23:35:18,058 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-09-25 23:35:19,115 INFO  crawl.InjectorJob - InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
2017-09-25 23:35:20,006 WARN  conf.Configuration - file:/tmp/hadoop-root/mapred/staging/root1639902035/.staging/job_local1639902035_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2017-09-25 23:35:20,009 WARN  conf.Configuration - file:/tmp/hadoop-root/mapred/staging/root1639902035/.staging/job_local1639902035_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2017-09-25 23:35:20,172 WARN  conf.Configuration - file:/tmp/hadoop-root/mapred/local/localRunner/root/job_local1639902035_0001/job_local1639902035_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2017-09-25 23:35:20,175 WARN  conf.Configuration - file:/tmp/hadoop-root/mapred/local/localRunner/root/job_local1639902035_0001/job_local1639902035_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2017-09-25 23:35:20,504 WARN  mapred.LocalJobRunner - job_local1639902035_0001
java.lang.Exception: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
    at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:141)
    at org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:94)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2017-09-25 23:35:21,198 ERROR crawl.InjectorJob - InjectorJob: java.lang.RuntimeException: job failed: name=apache-nutch-2.3.1.jar, jobid=job_local1639902035_0001
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:231)
    at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
西西里小胖 2021-06-18 10:08
关注
我也遇到了同样的问题，最后发现plugin.includes属性书写有问题，

改为

<property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property>

之后就可以了
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

Nutch+MongoDB+ElasticSearch+Kibana 搭建搜索引擎
2016-01-28 00:22

YatKam的博客文章讲述如何通过Nutch、MongoDB、ElasticSearch、Kibana搭建网络爬虫，其中Nutch用于网页数据爬取，MongoDB用于存储爬虫而来的数据，ElasticSearch用来作Index索引，Kibana用来形象化查看索引结果。具体步骤如下：...
基于Nutch+ElasticSearch+MySQL+SSM的简易搜索引擎.zip
2025-09-18 01:18

本次讨论的简易搜索引擎采用了Nutch、ElasticSearch、MySQL和SSM（Spring、SpringMVC、MyBatis）等技术组件来构建。这一组合涵盖了数据采集框架、全文搜索引擎、关系型数据库以及Java开发框架，通过它们的有机结合，...
Nutch2.3.1+MongoDB+ElasticSearch1.4.4 环境配置
2018-12-28 18:32

伏念先生的博客前言：本博客是nutch本地运行的一篇配置实践笔记，不包含分布式运行配置 1.环境准备 Ubuntu 16.04 jdk 1.8 Ant 1.9.13 2.Mongodb安装 1）mongodb数据库安装及基本概念学习参考：...
基于Nutch+ElasticSearch+MySQL+SSM的简易搜索引擎+源代码+文档说明
2023-11-11 02:08

<项目介绍> 该资源内项目源码是个人的毕设，代码都测试ok，都是运行成功后才上传资源，答辩评审平均分达到96分，放心下载使用！ 1、该资源内项目代码都经过测试运行成功，功能ok的情况下才上传的，请放心下载使用！...
人工智能-项目实践-搜索引擎-基于Nutch+ElasticSearch+MySQL+SSM的简易搜索引擎
2024-02-25 22:23

标题中的“人工智能-项目实践-搜索引擎-基于Nutch+ElasticSearch+MySQL+SSM的简易搜索引擎”揭示了一个项目，旨在构建一个简易的搜索引擎，利用了人工智能的一些原理和技术。这个项目结合了多个开源工具，包括Nutch...
用Nutch2.3+MongoDB+Elasticsearch1.4开发垂直搜索引擎
2016-12-23 20:58

长江之水向西流的博客 Nutch下载上nutch官网下载页面下载最新的代码包 ...Elasticsearch1.4下载最新的Elasticsearch是5.1版本，但由于Nutch2.3内置的是Elasticsearch1.4.1的连接客户端，所以还是需要安装Elasticsearch1.4 下载地址 http
Nutch+solr + hadoop相关框架搭建教程
2013-04-17 16:05

Apache Nutch 是一个开源的网络爬虫项目，主要用于构建大规模的搜索引擎。Nutch 提供了从互联网抓取网页、解析内容、提取链接到存储索引的完整流程。Nutch 1.2 版本后，它开始使用 Ivy 进行依赖管理，方便构建和集成...
基于Nutch+Hadoop+Hbase+ElasticSearch的网络爬虫及搜索引擎
2016-01-05 15:04

weixin_34099526的博客基于Nutch+Hadoop+Hbase+ElasticSearch的网络爬虫及搜索引擎网络爬虫架构在Nutch+Hadoop之上，是一个典型的分布式离线批量处理架构，有非常优异的吞吐量和抓取性能并提供了大量的配置定制选项。由于网络...
基于Nutch+ElasticSearch+MySQL+SSM的简易搜索引擎.zip(毕设&课设&实训&大作业&竞赛&项目)
2025-02-25 17:43

项目工程资源经过严格测试运行并且功能上ok，可复现复刻，拿到资料包后可实现复刻出一样的项目，本人系统开发经验充足（全栈），有任何使用问题欢迎随时与我联系，我会及时为您解惑，提供帮助【资源内容】：包含...
Elasticsearch+Mongodb+nutch
2016-01-12 17:46

CS番茄的博客 0.环境: Ubuntu14.04 ES 1.4.4 Mongodb 2.7.6 Nutch 2.3 1.参考链接: http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/
nutch eclipse mysql_搭建eclipse环境下 Nutch+Mysql 二次开发环境
2021-01-19 14:01

weixin_39834281的博客最近看了下Nutch，目前Nutch最新版本2.3.1，支持Hbase、MongoDB等存储，但在搭建和测试过程中发现对Mysql 的支持好像有点问题。后来将Nutch版本改为2.2.1。基于Nutch2.2.1+Mysql 的环境配置过程如下：1.下载Nutch...
Mac自己搭建爬虫搜索引擎scrapy+redis+elasticsearch+kibana
2019-09-29 16:19

aimiao8963的博客 1.引言看上一次失败的尝试，用...不过其中elasticsearch, kibana部分是可以重用的，只是替换nutch为scrapy + scrapy_redis。 2.基本的scrapy爬虫直接从scrapy官方的例子开始吧，本项目fork自scrapy/quotesbo...
Lucene+nutch搜索引擎开发
2013-05-08 13:16

自己在淘宝上花了大洋购买的，《Lucene+nutch搜索引擎开发》这本书的电子版由于pdf太大了，这个txt里面有下载链接很棒的学习资料，你值得拥有！
Nutch+Lucene搜索引擎开发实践
2014-10-14 09:26

风陵渡口的博客使用开源工具Nutch和Lucene在局域网下搭建垂直搜索引擎。
windows构建网页版搜索引擎 Nutch+Lucene+Mysql+Tomcat（一）
2017-11-10 17:40

xuejames的博客 windows构建网页版搜索引擎 Nutch+Lucene+Mysql+Tomcat（一）环境： nutch 2.2.0 lucene 7.1.0 apache-ant-1.10.1 apache-ivy-2.4.0 apache-tomcat-9.0.1 mysql一、eclipse环境下 Nutch+Mysql 二次开发环境1、...
没有解决我的问题, 去提问

Nutch+MongoDB+ElasticSearch+Kibana搭建inject操作异常

1条回答 默认 最新

1条回答默认最新