香港云主机最佳企业级服务商!

ADSL拨号VPS包含了中国大陆(联通,移动,电信,)

中国香港,国外拨号VPS。

当前位置:云主机 > python >

电信ADSL拨号VPS
联通ADSL拨号VPS
移动ADSL拨号VPS

Python自定义scrapy中间模块避免重复采集的方法


时间:2021-11-09 10:30 作者:admin610456


本文实例讲述了python/' target='_blank'>python自定义scrapy中间模块避免重复采集的方法。分享给大家供大家参考。具体如下:

from scrapy import logfrom scrapy.http import Requestfrom scrapy.item import BaseItemfrom scrapy.utils.request import request_fingerprintfrom myproject.items import MyItemclass IgnoreVisitedItems(object):  """Middleware to ignore re-visiting item pages if they  were already visited before.   The requests to be filtered by have a meta['filter_visited']  flag enabled and optionally define an id to use   for identifying them, which defaults the request fingerprint,  although you'd want to use the item id,  if you already have it beforehand to make it more robust.  """  FILTER_VISITED = 'filter_visited'  VISITED_ID = 'visited_id'  CONTEXT_KEY = 'visited_ids'  def process_spider_output(self, response, result, spider):    context = getattr(spider, 'context', {})    visited_ids = context.setdefault(self.CONTEXT_KEY, {})    ret = []    for x in result:      visited = False      if isinstance(x, Request):        if self.FILTER_VISITED in x.meta:          visit_id = self._visited_id(x)          if visit_id in visited_ids:            log.msg("Ignoring already visited: %s" % x.url,                level=log.INFO, spider=spider)            visited = True      elif isinstance(x, BaseItem):        visit_id = self._visited_id(response.request)        if visit_id:          visited_ids[visit_id] = True          x['visit_id'] = visit_id          x['visit_status'] = 'new'      if visited:        ret.append(MyItem(visit_id=visit_id, visit_status='old'))      else:        ret.append(x)    return ret  def _visited_id(self, request):    return request.meta.get(self.VISITED_ID) or request_fingerprint(request)

希望本文所述对大家的Python程序设计有所帮助。

(责任编辑:admin)






帮助中心
会员注册
找回密码
新闻中心
快捷通道
域名登录面板
虚机登录面板
云主机登录面板
关于我们
关于我们
联系我们
联系方式

售前咨询:17830004266(重庆移动)

企业QQ:383546523

《中华人民共和国工业和信息化部》 编号:ICP备00012341号

Copyright © 2002 -2018 香港云主机 版权所有
声明:香港云主机品牌标志、品牌吉祥物均已注册商标,版权所有,窃用必究

云官方微信

在线客服

  • 企业QQ: 点击这里给我发消息
  • 技术支持:383546523

  • 公司总台电话:17830004266(重庆移动)
  • 售前咨询热线:17830004266(重庆移动)