scrapy 2.3 高级自定义

2021-06-10 17:20 更新

因为scrapy使用stdlib日志记录模块，所以可以使用stdlib日志记录的所有功能自定义日志记录。

例如，假设您正在抓取一个返回许多HTTP 404和500响应的网站，并且您希望隐藏像这样的所有消息：

2016-12-16 22:00:06 [scrapy.spidermiddlewares.httperror] INFO: Ignoring
response <500 http://quotes.toscrape.com/page/1-34/>: HTTP status code
is not handled or not allowed

首先要注意的是一个记录器名称-它在括号中： [scrapy.spidermiddlewares.httperror] . 如果你得到公正 [scrapy] 然后 LOG_SHORT_NAMES 可能设置为true；设置为false并重新运行爬网。

接下来，我们可以看到消息具有信息级别。为了隐藏它，我们应该为 scrapy.spidermiddlewares.httperror 高于信息；信息后的下一级是警告。可以这样做，例如在蜘蛛的 __init__ 方法：

import logging
import scrapy


class MySpider(scrapy.Spider):
    # ...
    def __init__(self, *args, **kwargs):
        logger = logging.getLogger('scrapy.spidermiddlewares.httperror')
        logger.setLevel(logging.WARNING)
        super().__init__(*args, **kwargs)

如果您再次运行此蜘蛛，则从 scrapy.spidermiddlewares.httperror 日志记录器将消失。

以上内容是否对您有帮助：

← scrapy 2.3 自定义日志格式

scrapy 2.3 统计数据集合 →

写笔记

我要补充

scrapy 2.3 高级自定义

推荐文章

推荐教程

推荐课程