Pyspider 中采用了 tornado 库来做 http 请求,在请求过程中可以添加各种参数,例如请求链接超时时间,请求传输数据超时时间,请求头等等,但是根据pyspider的原始框架,给爬虫添加参数只能通过 crawl_config这个Python字典来完成(如下所示),框架代码将这个字典中的参数转换成 task 数据,进行http请求。这个参数的缺点是不方便给每一次请求做随机请求头。
crawl_config = { "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36", "timeout": 120, "connect_timeout": 60, "retries": 5, "fetch_type": 'js', "auto_recrawl": True, }
这里写出给爬虫添加随机请求头的方法:
1、编写脚本,将脚本放置在 pyspider 的 libs 文件夹下,命名为 header_switch.py
#!/usr/bin/env python # -*- coding:utf-8 -*- # Created on 2017-10-18 11:52:26 import random import time class HeadersSelector(object): """ Header 中缺少几个字段 Host 和 Cookie """ headers_1 = { "Proxy-Connection": "keep-alive", "Pragma": "no-cache", "Cache-Control": "no-cache", "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "DNT": "1", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4", "Referer": "https://www.baidu.com/s", "Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7", } # 网上找的浏览器 headers_2 = { "Proxy-Connection": "keep-alive", "Pragma": "no-cache", "Cache-Control": "no-cache", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0", "Accept": "image/gif,image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*", "DNT": "1", "Referer": "https://www.baidu.com/link", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4", } # window 7 系统浏览器 headers_3 = { "Proxy-Connection": "keep-alive", "Pragma": "no-cache", "Cache-Control": "no-cache", "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0", "Accept": "image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*", "DNT": "1", "Referer": "https://www.baidu.com/s", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.7,en;q=0.6", } # Linux 系统 firefox 浏览器 headers_4 = { "Proxy-Connection": "keep-alive", "Pragma": "no-cache", "Cache-Control": "no-cache", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0", "Accept": "*/*", "DNT": "1", "Referer": "https://www.baidu.com/link", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.7,en;q=0.6", } # Win10 系统 firefox 浏览器 headers_5 = { "Connection": "keep-alive", "Pragma": "no-cache", "Cache-Control": "no-cache", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64;) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Referer": "https://www.baidu.com/link", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.7,en;q=0.6", "Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7", } # Win10 系统 Chrome 浏览器 headers_6 = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.8", "Pragma": "no-cache", "Cache-Control": "no-cache", "Connection": "keep-alive", "DNT": "1", "Referer": "https://www.baidu.com/s", "Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0", } # win10 系统浏览器 def __init__(self): pass def select_header(self): n = random.randint(1, 6) switch={ 1: self.headers_1 2: self.headers_2 3: self.headers_3 4: self.headers_4 5: self.headers_5 6: self.headers_6 } headers = switch[n] return headers
其中,我只写了6个请求头,如果爬虫的量非常大,完全可以写更多的请求头,甚至上百个,然后将 random的随机范围扩大,进行选择。
2、在pyspider 脚本中编写如下代码:
#!/usr/bin/env python # -*- encoding: utf-8 -*- # Created on 2017-08-18 11:52:26 from pyspider.libs.base_handler import * from pyspider.addings.headers_switch import HeadersSelector import sys defaultencoding = 'utf-8' if sys.getdefaultencoding() != defaultencoding: reload(sys) sys.setdefaultencoding(defaultencoding) class Handler(BaseHandler): crawl_config = { "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36", "timeout": 120, "connect_timeout": 60, "retries": 5, "fetch_type": 'js', "auto_recrawl": True, } @every(minutes=24 * 60) def on_start(self): header_slt = HeadersSelector() header = header_slt.select_header() # 获取一个新的 header # header["X-Requested-With"] = "XMLHttpRequest" orig_href = 'http://sww.bjxch.gov.cn/gggs.html' self.crawl(orig_href, callback=self.index_page, headers=header) # 请求头必须写在 crawl 里,cookies 从 response.cookies 中找 @config(age=24 * 60 * 60) def index_page(self, response): header_slt = HeadersSelector() header = header_slt.select_header() # 获取一个新的 header # header["X-Requested-With"] = "XMLHttpRequest" if response.cookies: header["Cookies"] = response.cookies
其中最重要的就是在每个回调函数 on_start,index_page 等等 当中,每次调用时,都会实例化一个 header 选择器,给每一次请求添加不一样的 header。要注意添加的如下代码:
header_slt = HeadersSelector() header = header_slt.select_header() # 获取一个新的 header # header["X-Requested-With"] = "XMLHttpRequest" header["Host"] = "www.baidu.com" if response.cookies: header["Cookies"] = response.cookies
当使用 XHR 发送 AJAX 请求时会带上 Header,常被用来判断是不是 Ajax 请求, headers 要添加 {‘X-Requested-With': ‘XMLHttpRequest'} 才能抓取到内容。
确定了 url 也就确定了请求头中的 Host,需要按需添加,urlparse包里给出了根据 url解析出 host的方法函数,直接调用netloc即可。
如果响应中有 cookie,就需要将 cookie 添加到请求头中。
如果还有别的伪装需求,自行添加。
如此即可实现随机请求头,完。
以上这篇Pyspider中给爬虫伪造随机请求头的实例就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持。
免责声明:本站资源来自互联网收集,仅供用于学习和交流,请遵循相关法律法规,本站一切资源不代表本站立场,如有侵权、后门、不妥请联系本站删除!
稳了!魔兽国服回归的3条重磅消息!官宣时间再确认!
昨天有一位朋友在大神群里分享,自己亚服账号被封号之后居然弹出了国服的封号信息对话框。
这里面让他访问的是一个国服的战网网址,com.cn和后面的zh都非常明白地表明这就是国服战网。
而他在复制这个网址并且进行登录之后,确实是网易的网址,也就是我们熟悉的停服之后国服发布的暴雪游戏产品运营到期开放退款的说明。这是一件比较奇怪的事情,因为以前都没有出现这样的情况,现在突然提示跳转到国服战网的网址,是不是说明了简体中文客户端已经开始进行更新了呢?
更新日志
- 凤飞飞《我们的主题曲》飞跃制作[正版原抓WAV+CUE]
- 刘嘉亮《亮情歌2》[WAV+CUE][1G]
- 红馆40·谭咏麟《歌者恋歌浓情30年演唱会》3CD[低速原抓WAV+CUE][1.8G]
- 刘纬武《睡眠宝宝竖琴童谣 吉卜力工作室 白噪音安抚》[320K/MP3][193.25MB]
- 【轻音乐】曼托凡尼乐团《精选辑》2CD.1998[FLAC+CUE整轨]
- 邝美云《心中有爱》1989年香港DMIJP版1MTO东芝首版[WAV+CUE]
- 群星《情叹-发烧女声DSD》天籁女声发烧碟[WAV+CUE]
- 刘纬武《睡眠宝宝竖琴童谣 吉卜力工作室 白噪音安抚》[FLAC/分轨][748.03MB]
- 理想混蛋《Origin Sessions》[320K/MP3][37.47MB]
- 公馆青少年《我其实一点都不酷》[320K/MP3][78.78MB]
- 群星《情叹-发烧男声DSD》最值得珍藏的完美男声[WAV+CUE]
- 群星《国韵飘香·贵妃醉酒HQCD黑胶王》2CD[WAV]
- 卫兰《DAUGHTER》【低速原抓WAV+CUE】
- 公馆青少年《我其实一点都不酷》[FLAC/分轨][398.22MB]
- ZWEI《迟暮的花 (Explicit)》[320K/MP3][57.16MB]