红联Linux门户
Linux帮助

Ubuntu 16.04环境下搭建nutch环境

发布时间:2016-04-14 10:44:13来源:linux网站作者:oba没有马

操作系统:Ubuntu 16.04 LTS

nutch版本:2.2.1

配置nutch之前,要先配置ant,不会的可以看我的另一篇文章UBUNTU环境配置ANT:http://www.linuxdiyf.com/linux/19764.html


然后去nutch官网(http://nutch.apache.org/downloads.html)下载nutch,不过2.3.1的版本编译时有问题,切换maven2库也没用,会一直卡在以下界面:

root@ubuntu:/opt/apache-nutch-2.3.1# ant runtime 
Buildfile: /opt/apache-nutch-2.3.1/build.xml 
ivy-probe-antlib: 
ivy-download: 
ivy-download-unchecked: 
ivy-init-antlib: 
ivy-init: 
init: 
[mkdir] Created dir: /opt/apache-nutch-2.3.1/build 
[mkdir] Created dir: /opt/apache-nutch-2.3.1/build/classes 
[mkdir] Created dir: /opt/apache-nutch-2.3.1/build/release 
[mkdir] Created dir: /opt/apache-nutch-2.3.1/build/test 
[mkdir] Created dir: /opt/apache-nutch-2.3.1/build/test/classes 
clean-lib: 
resolve-default: 
[ivy:resolve] :: Apache Ivy 2.3.0 - 20130110142753 :: http://ant.apache.org/ivy/ :: 
[ivy:resolve] :: loading settings :: file = /opt/apache-nutch-2.3.1/ivy/ivysettings.xml 


于是我放弃了,决定采用nutch2.2.1版本进行安装,nutch2.2.1下载地址:http://archive.apache.org/dist/nutch/2.2.1/

Ubuntu环境下的firefox默认下载存储路径为~/Downloads


1、用命令cd ~/Downloads切换路径,然后使用tar -xvf apache-nutch-2.2.1-src-tar-gz解压文件

然后移动到/opt目录下,用命令sudo mv apache-nutch-2.2.1 /opt/移动到/opt文件夹下


2、配置nutch对mysql的支持,修改 ${NUTCH_HOME}/ivy/ivy.xml文件

先取消以下行的注释:

<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/> 

然后修改以下行,从默认的:

<dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>

改成:

<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/> 

最后取消掉以下行的注释:

<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" /> 


3、数据库连接配置编辑 ${NUTCH_HOME}/conf/gora.properties文件,注释掉默认的数据库连接配置,同时添加以下配置内容:

############################### 
# Default MySQL properties    # 
############################### 
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver 
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true 
gora.sqlstore.jdbc.user=xxxx(MySQL用户名) 
gora.sqlstore.jdbc.password=xxxx(MySQL密码) 


4、数据表映射配置

修改 ${NUTCH_HOME}/conf/gora-sql-mapping.xml 文件

将primarykey 的长度从512修改成767,即 <primarykey column=”id” length=”767″/>


5、修改nutch-site.xml配置文件

可直接将nutch-default.xml保存为nutch-site.xml,使用命令sudo mv nutch-default-xml nutch-size.xml

然后sudo gedit nutch-site,在末尾的</configuration>前添加以下代码:

<property> 
<name>http.agent.name</name> 
<value>YourNutchSpider</value> 
</property> 
<property> 
<name>http.accept.language</name> 
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> 
<description>Value of the Accept-Language request header field. 
This allows selecting non-English language as default one to retrieve. 
It is a useful setting for search engines build for certain national group. 
</description> 
</property> 
<property> 
<name>storage.data.store.class</name> 
<value>org.apache.gora.sql.store.SqlStore</value> 
<description>The Gora DataStore class for storing and retrieving data. 
Currently the following stores are available:. 
</description> 
</property>  
<property> 
<name>parser.character.encoding.default</name> 
<value>utf-8</value> 
<description>The character encoding to fall back to when no other information 
is available</description> 
</property> 
<property> 
<name>generate.batch.id</name> 
<value>*</value> 
</property> 


6、使用ant编译

切换到NUTCH目录

cd ${NUTCH_HOME} 
ant runtime 


可能遇到的问题:

1)权限不足,创建文件夹例如build文件夹失败,使用命令sudo -i切换到root权限再进行ant编译


2)提示:
Trying to override old definition of task javac [taskdef]  
Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. 

先下载sonar-ant-task-2.2.jar(https://repo1.maven.org/maven2/org/codehaus/sonar-plugins/sonar-ant-task/2.2/sonar-ant-task-2.2.jar),将其拷贝到 ${NUTCH_HOME}/lib 目录下面

然后使用命令sudo gedit /${NUTCH_HOME}/build.xml

通过ctrl+F打开搜索功能,输入antlib:org,sonar.ant定位到以下代码,添加部分的代码即可:

<span style="color:#4b4b4b;"><!-- Define the Sonar task if this hasn't been done in a common script --> 
<taskdef url="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml"> 
<classpath path="${ant.library.dir}" /> 
<classpath path="${mysql.library.dir}" /> 
</span><span style="color:#ff0000;"><classpath><fileset dir="lib/" includes="sonar*.jar" /></classpath></span><span style="color:#4b4b4b;"> 
</taskdef></span> 


3)build failed,提示如:

[ivy:resolve]         :: com.google.code.findbugs#jsr305;1.3.9!jsr305.jar 
[ivy:resolve]         :::::::::::::::::::::::::::::::::::::::::::::: 
[ivy:resolve]  
[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS 
BUILD FAILED 
/opt/apache-nutch-2.2.1/build.xml:444: impossible to resolve dependencies: 
resolve failed - see output for details 

或者是其他的依赖性问题导致BUILD FAILED的,可通过修改maven中央库地址来解决
sudo gedit  ${NUTCH_HOME}/ivy/ivysettings.xml,找到以下代码:
<property name="repo.maven.org" 
value="http://repo1.maven.org/maven2/" 
override="false"/> 

将maven中央库地址 http://repo1.maven.org/maven2/  替换成国内OSC提供的镜像:http://maven.oschina.net/content/groups/public/


4)卡在以下界面

resolve-default: 
[ivy:resolve] :: Apache Ivy 2.3.0 - 20130110142753 :: http://ant.apache.org/ivy/ :: 
[ivy:resolve] :: loading settings :: file = /opt/apache-nutch-2.3.1/ivy/ivysettings.xml 

解决方案:耐心等待,加载需要时间,如果超过10分钟没反应就放弃吧,可以换个maven(见问题3)。

一般编译时间为半个小时左右!上个我成功的界面截图:

Ubuntu 16.04环境下搭建nutch环境


7、网站抓取测试

7.1.设置抓取网站

cd ${NUTCH_HOME}/runtime/local 
sudo mkdir -p urls 

cd urls 
sudo gedit seed.txt 

在seed.txt输入一个网站,例如http://blog.csdn.net/u010317005

然后输入冒号:wq保存
7.2.执行爬虫操作

bin/nutch crawl urls -depth 3 -topN 5


本文永久更新地址:http://www.linuxdiyf.com/linux/19765.html