标签: Python 网络爬虫专栏

03-实现功能强大、简洁易用的网址池(URL Pool)

你好，我是悦创。

对于比较大型的爬虫来说，URL 管理的管理是个核心问题，管理不好，就可能重复下载，也可能遗漏下载。这里，我们设计一个 URL Pool 来管理 URL。
这个 URL Pool 就是一个生产者-消费者模式：

生产者-消费者流程图

依葫芦画瓢，URLPool 就是这样的

设计的网络爬虫 URLPool

AI悦创原创2023/2/4...大约 2 分钟

02-实现一个更好的网络请求函数

你好，我是悦创。

上一节我们实现了一个简单的再也不能简单的新闻爬虫，这个爬虫有很多槽点，估计大家也会鄙视这个爬虫。上一节最后我们讨论了这些槽点，现在我们就来去除这些槽点来完善我们的新闻爬虫。

问题我们前面已经描述清楚，解决的方法也有了，那就废话不多讲，代码立刻上（Talk is cheap, show me the code!）。

downloader 的实现

# -*- coding: utf-8 -*-
# @Time    : 2023/1/18 08:28
# @Author  : AI悦创
# @FileName: demo.py
# @Software: PyCharm
# @Blog    ：https://bornforthis.cn/
import requests
import cchardet
import traceback


def downloader(url, timeout=10, headers=None, debug=False, binary=False):
    _headers = {
        'User-Agent': ('Mozilla/5.0 (compatible; MSIE 9.0; '
                       'Windows NT 6.1; Win64; x64; Trident/5.0)'),
    }
    redirected_url = url
    if headers:
        _headers = headers
    try:
        r = requests.get(url, headers=_headers, timeout=timeout)
        if binary:
            html = r.content
        else:
            encoding = cchardet.detect(r.content)['encoding']
            html = r.content.decode(encoding)
        status = r.status_code
        redirected_url = r.url
    except:
        if debug:
            traceback.print_exc()
        msg = 'failed download: {}'.format(url)
        print(msg)
        if binary:
            html = b''
        else:
            html = ''
        status = 0
    return status, html, redirected_url


if __name__ == '__main__':
    url = 'https://news.baidu.com/'
    s, html, last_url = downloader(url)
    print(s, len(html), last_url)

AI悦创原创2023/1/18...大约 5 分钟

01-简单的百度新闻爬虫

你好，我是悦创。

这个实战例子是构建一个大规模的异步新闻爬虫，但要分几步走，从简单到复杂，循序渐进的来构建这个 Python 爬虫。

要抓取新闻，首先得有新闻源，也就是抓取的目标网站。

国内的新闻网站，从中央到地方，从综合到垂直行业，大大小小有几千家新闻网站。百度新闻（https://news.baidu.com/）收录的大约两千多家。那么我们先从百度新闻入手。

打开百度新闻的网站首页：https://news.baidu.com/

AI悦创原创2023/1/17...大约 7 分钟