红联Linux门户
Linux帮助

无显示器linux服务器下selenium+python+firefox爬虫

发布时间:2016-04-06 10:24:07来源:linux网站作者:piaotiejun

环境

ubuntu12.04 64位系统(azure虚拟机)
python版本2.7.6
selenium版本2.46.0


环境配置:

安装Xvfb——一个X虚拟框架,因为firefox之类的浏览器需要一个GUI环境
sudo apt-get install xvfb
执行xvfb:
可以运行xvfb服务上一个带有数字的显示设备上,这样是为了防止你在下阶段添加设备时引发冲突。我们分配一个显示设备 10(可以时其他的设备号),-ac代表关闭xvfb的访问控制
sudo Xvfb :10 -ac


遇到问题如下:
[dix] Could not init font path element /usr/share/fonts/X11/cyrillic, removing from list!
[dix] Could not init font path element /usr/share/fonts/X11/100dpi/:unscaled, removing from list!
[dix] Could not init font path element /usr/share/fonts/X11/75dpi/:unscaled, removing from list!
[dix] Could not init font path element /usr/share/fonts/X11/Type1, removing from list!
[dix] Could not init font path element /usr/share/fonts/X11/100dpi, removing from list!
[dix] Could not init font path element /usr/share/fonts/X11/75dpi, removing from list!


这是Xvfb没有找到字体,cyrillic,100dpi,75dpi都是一种字体,ubuntu下用如下方法来安装:
sudo apt-get install xfonts-base xfonts-cyrillic xfonts-100dpi xfonts-75dpi
defoma字体管理程序, x-ttcidfont-conf是truetype字体配置工具可自动生成fonts.dir和fonts.scale文件, TrueType是微软的字体,,ubuntu下用如下方法来安装:
sudo apt-get install defoma x-ttcidfont-conf
字体问题解决后,测试Xvfb,执行如下命令:
sudo Xvfb :10 -ac &
ps -A | grep Xvfb
可以看到62192 pts/1 00:00:00 Xvfb,这说明Xvfb运行正常,如果遇到什么报错情况可以忽略。


安装firefox
sudo apt-get install firefox
执行firefox遇到如下错误:
(firefox:61533): Gtk-WARNING **: Locale not supported by C library.
Using the fallback ‘C’ locale.
这是字符集问题,执行如下操作:
sudo vim /etc/profile
添加export LC_ALL=en_US.UTF-8
执行如下命令:
sudo locale-gen c
source /etc/profile
字符集问题解决后,执行命令:
firefox
这时可以进入firefox的终端交互,交互中报错情况可以忽略


安装selenium
sudo pip install selenium


python代码
根据关键词获取百度图片的原始url

#!/usr/bin/python
#coding=utf8

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

class baidu_image_spider(object):
def __init__(self, url="http://image.baidu.com"):
self.driver = webdriver.Firefox()
self.driver.get(url)
elem = self.driver.find_element_by_name("word")
elem.send_keys(unicode("query", "utf8"))
elem.send_keys(Keys.RETURN)
def get_url_list(self, query):
elem = self.driver.find_element_by_name("word")
elem.clear()
elem.send_keys(unicode(query, "utf8"))
elem.send_keys(Keys.RETURN)
js_down="var q=document.documentElement.scrollTop=100000"
js_up="var q=document.documentElement.scrollTop=0"
pagemore = self.driver.find_element_by_id("pageMore")
error_cnt = 0
while error_cnt < 5: 
try:
pagemore.send_keys(Keys.RETURN)
error_cnt = 0
except:
self.driver.execute_script(js_down)
time.sleep(1)
error_cnt += 1

urls = []
url_list = self.driver.find_elements_by_class_name("imgitem")
for url in url_list:
url = url.get_attribute("data-objurl")
urls.append(url)

fp = open("url_list", "w")
for url in urls:
print(url)
fp.write(url+"\n")
print(len(urls))
bs = baidu_image_spider()
bs.get_url_list("money")


本文永久更新地址:http://www.linuxdiyf.com/linux/19543.html