Linux下关于Python爬虫程序scrapy的安装问题_Linux系统教程

我的安装过程：

sudo pip install scrapy

够简单吧。

但是在运行第一个爬虫例子时：

scrapy crawl dmoz

出现下面错误：

AttributeError: 'module' object has no attribute 'Spider'

解决方案如下：

http://stackoverflow.com/questions/30695866/attributeerror-module-object-has-no-attribute-spider

sudo pip install scrapy --upgrade

正常上述过程之后，问题应该能够解决。但是我又出了下面的问题lxml装不上：

creating build/temp.linux-x86_64-2.7/src/lxml

x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Isrc/lxml/includes -I/usr/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.linux-x86_64-2.7/src/lxml/lxml.etree.o -w

In file included from src/lxml/lxml.etree.c:320:0:

src/lxml/includes/etree_defs.h:14:31: fatal error: libxml/xmlversion.h: 没有那个文件或目录

#include "libxml/xmlversion.h"

compilation terminated.

Compile failed: command 'x86_64-linux-gnu-gcc' failed with exit status 1

creating tmp

cc -I/usr/include/libxml2 -c /tmp/xmlXPathInitM_KXBh.c -o tmp/xmlXPathInitM_KXBh.o

cc tmp/xmlXPathInitM_KXBh.o -lxml2 -o a.out

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

----------------------------------------

Rolling back uninstall of lxml

Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-F1ulO4/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-OMbiRQ-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-F1ulO4/lxml/

解决方案如下：

http://stackoverflow.com/questions/5178416/pip-install-lxml-error

sudo apt-get install python-dev libxml2-dev libxslt1-dev zlib1g-dev

安装好依赖之后：

sudo pip install lxml --upgrade

成功安装。

beast@beast:~/Code/python/tutorial$ sudo pip install lxml --upgradeThe directory '/home/beast/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.

The directory '/home/beast/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.

/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:318: SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#snimissingwarning.

SNIMissingWarning

/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:122: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.

InsecurePlatformWarning

Collecting lxml

Downloading lxml-3.6.0.tar.gz (3.7MB)

100% |████████████████████| 3.7MB 213kB/s

Installing collected packages: lxml

Found existing installation: lxml 3.3.3

Uninstalling lxml-3.3.3:

Successfully uninstalled lxml-3.3.3

Running setup.py install for lxml ... done

Successfully installed lxml-3.6.0

现在可以第一个爬虫例子了：

beast@beast:~/Code/python/tutorial$ scrapy crawl dmoz

/usr/local/lib/python2.7/dist-packages/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users@googlegroups.com for alternatives):

BOT_VERSION: no longer used (user agent defaults to Scrapy now)

warnings.warn(msg, ScrapyDeprecationWarning)

2016-07-11 16:41:56 [scrapy] INFO: Scrapy 1.1.0 started (bot: tutorial)

2016-07-11 16:41:56 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'USER_AGENT': 'tutorial/1.0', 'BOT_NAME': 'tutorial'}

2016-07-11 16:41:56 [scrapy] INFO: Enabled extensions:

['scrapy.extensions.logstats.LogStats',

'scrapy.extensions.telnet.TelnetConsole',

'scrapy.extensions.corestats.CoreStats']

2016-07-11 16:41:56 [scrapy] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

'scrapy.downloadermiddlewares.retry.RetryMiddleware',

'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',

'scrapy.downloadermiddlewares.stats.DownloaderStats']

2016-07-11 16:41:56 [scrapy] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

'scrapy.spidermiddlewares.referer.RefererMiddleware',

'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

'scrapy.spidermiddlewares.depth.DepthMiddleware']

2016-07-11 16:41:56 [scrapy] INFO: Enabled item pipelines:

[]

2016-07-11 16:41:56 [scrapy] INFO: Spider opened

2016-07-11 16:41:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2016-07-11 16:41:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023

2016-07-11 16:41:58 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

2016-07-11 16:41:58 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)

2016-07-11 16:41:58 [scrapy] INFO: Closing spider (finished)

2016-07-11 16:41:58 [scrapy] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 472,

'downloader/request_count': 2,

'downloader/request_method_count/GET': 2,

'downloader/response_bytes': 16392,

'downloader/response_count': 2,

'downloader/response_status_count/200': 2,

'finish_reason': 'finished',

'finish_time': datetime.datetime(2016, 7, 6, 8, 41, 58, 337488),

'log_count/DEBUG': 3,

'log_count/INFO': 7,

'response_received_count': 2,

'scheduler/dequeued': 2,

'scheduler/dequeued/memory': 2,

'scheduler/enqueued': 2,

'scheduler/enqueued/memory': 2,

'start_time': datetime.datetime(2016, 7, 6, 8, 41, 56, 777087)}

2016-07-11 16:41:58 [scrapy] INFO: Spider closed (finished)

本文永久更新地址：http://www.linuxdiyf.com/linux/22302.html

Linux下关于Python爬虫程序scrapy的安装问题

频道文章

网站推荐文章

推荐教程

热点推荐