一、前言
最近都是在解决一些疑难问题,结果把前面最基础的东西都丢了,协程、xpath、css等等。借着这次的作业练习,把前面的知识复习一下。
二、爬取目标
爬取小说《动物庄园》全部章节并按章节分类保存为txt文件。
目标url:https://www.kanunu8.com/book3/6879/
三、分析过程
这是目录页面,通过结点定位可以获取每一章节的url,但是这样会多请求一次,而且还要写定位的代码。观察这10个url,它们是有规律的,从131779到131788,随着章节数的增加,这个参数也会跟着加一。所以很容易构造出这10个url。
def get_url():
""" 构造url :return: None """
for page in range(131779, 131789):
yield f'https://www.kanunu8.com/book3/6879/{str(page)}.html'
接着进入文章页面
定位一下章节和内容,为了复习,我用了三种不同的匹配方式,还有异步IO请求。
async def get_content(url):
""" 获取文章内容 :param url: :return: None """
async with aiohttp.ClientSession() as session:
async with session.get(url, headers=headers) as response:
if response.status == 200:
response = await response.text()
# 正则表达式匹配
section = re.search(r'<font.*?>(.*?)</font>', response).group(1)
content = re.search(r'<p>(.*?)</p>', response, re.S).group(1)
# # xpath匹配
# html=etree.HTML(response)
# section=html.xpath('//font/text()')[0]
# content=''
# for paragrath in html.xpath('//tr/td[2]/p/text()'):
# content+=paragrath
# # css匹配
# doc=pq(response)
# section=doc('font').text()
# content=doc('tr > td:nth-child(2) >p').text()
save_to_txt(section, content)
else:
logger.error(f'{url} 请求失败')
这样一来章节和内容就都获取到了,下面就是保存到txt文件,并且去掉一些脏数据,比如多余的空格,换行, <br>结点等。
def save_to_txt(section, content):
""" 保存为txt文件 :param section: 章节 :param content: 内容 :return: None """
if not os.path.exists(path):
os.mkdir(path)
# xpath匹配不用替换<br />
content = content.replace('<br />', '').strip('\n')
with open(f'{path}/{section}.txt', 'w', encoding='utf-8') as file:
file.write(content)
logger.success(f'{section} 下载成功')
注意:如果是用xpath,<br>结点会自动转换成换行,所以要将替换那部分的代码注释掉
四、完整代码
# -*- encoding: utf-8 -*-
''' @File :spider.py @Author :Pineapple @Modify Time :2020/9/29 21:11 @Contact :cppjavapython@foxmail.com @GitHub :https://github.com/Pineapple666 @Blog :https://blog.csdn.net/pineapple_C @Desciption :None '''
# import lib
from pyquery.pyquery import PyQuery as pq
from os.path import dirname, abspath
from loguru import logger
from lxml import etree
import aiohttp
import asyncio
import time
import re
import os
start = time.time()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}
loop = asyncio.get_event_loop()
tasks = []
path = dirname(abspath(__file__)) + '/content'
def get_url():
""" 构造url :return: None """
for page in range(131779, 131789):
yield f'https://www.kanunu8.com/book3/6879/{str(page)}.html'
async def get_content(url):
""" 获取文章内容 :param url: :return: None """
async with aiohttp.ClientSession() as session:
async with session.get(url, headers=headers) as response:
if response.status == 200:
response = await response.text()
# 正则表达式匹配
section = re.search(r'<font.*?>(.*?)</font>', response).group(1)
content = re.search(r'<p>(.*?)</p>', response, re.S).group(1)
# # xpath匹配
# html=etree.HTML(response)
# section=html.xpath('//font/text()')[0]
# content=''
# for paragrath in html.xpath('//tr/td[2]/p/text()'):
# content+=paragrath
# # css匹配
# doc=pq(response)
# section=doc('font').text()
# content=doc('tr > td:nth-child(2) >p').text()
save_to_txt(section, content)
else:
logger.error(f'{url} 请求失败')
def save_to_txt(section, content):
""" 保存为txt文件 :param section: 章节 :param content: 内容 :return: None """
if not os.path.exists(path):
os.mkdir(path)
# xpath匹配不用替换<br />
content = content.replace('<br />', '').strip('\n')
with open(f'{path}/{section}.txt', 'w', encoding='utf-8') as file:
file.write(content)
logger.success(f'{section} 下载成功')
if __name__ == '__main__':
for url in get_url():
tasks.append(get_content(url))
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
logger.info(f'用时 {end - start}s')
如有错误,欢迎私信纠正!
技术永无止境,谢谢支持!