红联Linux门户
Linux帮助

Linux下关于Python爬虫程序scrapy的安装问题

发布时间:2016-07-12 10:44:08来源:linux网站作者:江洋大盗与鸭子
我的安装过程:
sudo pip install scrapy
够简单吧。 
 
但是在运行第一个爬虫例子时:
scrapy crawl dmoz
出现下面错误:
AttributeError: 'module' object has no attribute 'Spider'
解决方案如下: 
http://stackoverflow.com/questions/30695866/attributeerror-module-object-has-no-attribute-spider
sudo pip install scrapy --upgrade
 
正常上述过程之后,问题应该能够解决。但是我又出了下面的问题lxml装不上:
creating build/temp.linux-x86_64-2.7/src/lxml
x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Isrc/lxml/includes -I/usr/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.linux-x86_64-2.7/src/lxml/lxml.etree.o -w
In file included from src/lxml/lxml.etree.c:320:0:
src/lxml/includes/etree_defs.h:14:31: fatal error: libxml/xmlversion.h: 没有那个文件或目录
#include "libxml/xmlversion.h"
compilation terminated.
Compile failed: command 'x86_64-linux-gnu-gcc' failed with exit status 1
creating tmp
cc -I/usr/include/libxml2 -c /tmp/xmlXPathInitM_KXBh.c -o tmp/xmlXPathInitM_KXBh.o
cc tmp/xmlXPathInitM_KXBh.o -lxml2 -o a.out
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
Rolling back uninstall of lxml
Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-F1ulO4/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-OMbiRQ-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-F1ulO4/lxml/
解决方案如下: 
http://stackoverflow.com/questions/5178416/pip-install-lxml-error
sudo apt-get install python-dev libxml2-dev libxslt1-dev zlib1g-dev
安装好依赖之后:
sudo pip install lxml --upgrade
成功安装。
 
beast@beast:~/Code/python/tutorial$ sudo pip install lxml --upgradeThe directory '/home/beast/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/beast/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:318: SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#snimissingwarning.
SNIMissingWarning
/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:122: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Collecting lxml
Downloading lxml-3.6.0.tar.gz (3.7MB)
100% |████████████████████| 3.7MB 213kB/s 
Installing collected packages: lxml
Found existing installation: lxml 3.3.3
Uninstalling lxml-3.3.3:
Successfully uninstalled lxml-3.3.3
Running setup.py install for lxml ... done
Successfully installed lxml-3.6.0
 
现在可以第一个爬虫例子了:
beast@beast:~/Code/python/tutorial$ scrapy crawl dmoz
/usr/local/lib/python2.7/dist-packages/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users@googlegroups.com for alternatives):
BOT_VERSION: no longer used (user agent defaults to Scrapy now)
warnings.warn(msg, ScrapyDeprecationWarning)
2016-07-11 16:41:56 [scrapy] INFO: Scrapy 1.1.0 started (bot: tutorial)
2016-07-11 16:41:56 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'USER_AGENT': 'tutorial/1.0', 'BOT_NAME': 'tutorial'}
2016-07-11 16:41:56 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-07-11 16:41:56 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-11 16:41:56 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-11 16:41:56 [scrapy] INFO: Enabled item pipelines:
[]
2016-07-11 16:41:56 [scrapy] INFO: Spider opened
2016-07-11 16:41:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-11 16:41:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-07-11 16:41:58 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-07-11 16:41:58 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-07-11 16:41:58 [scrapy] INFO: Closing spider (finished)
2016-07-11 16:41:58 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 472,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 16392,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 6, 8, 41, 58, 337488),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 7, 6, 8, 41, 56, 777087)}
2016-07-11 16:41:58 [scrapy] INFO: Spider closed (finished)
 
本文永久更新地址:http://www.linuxdiyf.com/linux/22302.html