scrapy 2.3 数据抓取实例

2021-06-02 11:18 更新

既然您知道了如何从页面中提取数据，那么让我们看看如何从页面中跟踪链接。

第一件事是提取到我们要跟踪的页面的链接。检查我们的页面，我们可以看到有一个链接指向下一个带有以下标记的页面：

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

我们可以尝试在外壳中提取：

>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

这将获取anchor元素，但我们需要该属性 href . 为此，Scrapy支持CSS扩展，允许您选择属性内容，如下所示：

>>> response.css('li.next a::attr(href)').get()
'/page/2/'

还有一个 attrib 可用属性（请参见选择元素属性更多信息）：

>>> response.css('li.next a').attrib['href']
'/page/2/'

现在让我们看看我们的spider被修改为递归地跟踪下一页的链接，从中提取数据：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

现在，在提取数据之后， parse() 方法查找到下一页的链接，并使用 urljoin() 方法（因为链接可以是相对的），并生成对下一页的新请求，将自身注册为回调，以处理下一页的数据提取，并保持爬行在所有页中进行。

这里您看到的是scrapy的以下链接机制：当您在回调方法中生成一个请求时，scrapy将计划发送该请求，并注册一个回调方法，以便在该请求完成时执行。

使用它，您可以构建复杂的爬虫程序，这些爬虫程序根据您定义的规则跟踪链接，并根据所访问的页面提取不同类型的数据。

在我们的示例中，它创建了一种循环，跟踪到下一页的所有链接，直到找不到一个为止——这对于爬行博客、论坛和其他带有分页的站点很方便。

创建请求的快捷方式

作为创建请求对象的快捷方式，您可以使用 response.follow ：：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

不像Scrapy.Request， response.follow 直接支持相对URL-无需调用URLJOIN。注意 response.follow 只返回一个请求实例；您仍然需要生成这个请求。

也可以将选择器传递给 response.follow 而不是字符串；此选择器应提取必要的属性：

for href in response.css('ul.pager a::attr(href)'):
    yield response.follow(href, callback=self.parse)

为了 <a> 元素有一个快捷方式： response.follow 自动使用其href属性。因此代码可以进一步缩短：

for a in response.css('ul.pager a'):
    yield response.follow(a, callback=self.parse)

要从iterable创建多个请求，可以使用 response.follow_all 取而代之的是：

anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

或者，进一步缩短：

yield from response.follow_all(css='ul.pager a', callback=self.parse)

scrapy 2.3 数据抓取实例

创建请求的快捷方式

更多示例和模式

推荐文章

推荐教程

推荐课程