百度阅读爬虫 项目总结

环境与依赖(Python 2.7)

  • 网络请求:urllib2 或 requests(建议 requests 低版本,兼容 2.7)
  • 解析:BeautifulSoup(bs4)或 lxml,或正则 re
1
2
3
# 安装(示例版本适配 2.7)
pip install requests==2.20.1
pip install beautifulsoup4==4.6.3

抓取范围与策略

  • 仅抓取公开列表页与书籍预览/详情的公开字段(标题、作者、分类、简介片段等)。
  • 请求前先检查 robots.txt;设置 UA、Referer,控制速率(如 1 秒/请求)。
  • 失败重试(最多 3 次),并在重试间隔中指数退避。

入口与 robots.txt 检查

1
2
3
4
5
6
7
8
9
10
11
import urllib2

BASE = 'https://yuedu.baidu.com'

def fetch(url, headers=None, timeout=15):
headers = headers or {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"}
req = urllib2.Request(url, headers=headers)
return urllib2.urlopen(req, timeout=timeout).read()

robots = fetch(BASE + '/robots.txt')
print robots.splitlines()[:10]

列表页抓取(示例)

  • 假设存在公开分类列表页(示意 URL),解析书卡片中的标题与链接。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import time
from bs4 import BeautifulSoup

def get_list(url):
html = fetch(url)
soup = BeautifulSoup(html, 'html.parser')
items = []
# 需根据实际 DOM 调整选择器:
for card in soup.select('div.book-list div.book-item'):
a = card.select_one('a.book-title')
author = card.select_one('.book-author')
if a:
items.append({
'title': a.get_text(strip=True),
'link': a['href'] if a.has_attr('href') else '',
'author': author.get_text(strip=True) if author else ''
})
return items

url = BASE + '/booklist' # 示例:按站点实际栏目替换
items = get_list(url)
print 'list size:', len(items)
for it in items[:5]:
print it['title'], it['author'], it['link']
time.sleep(1)

详情页解析(只取公开字段)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def get_detail(url):
if url.startswith('/'):
url = BASE + url
html = fetch(url)
soup = BeautifulSoup(html, 'html.parser')
title = soup.select_one('h1.book-title')
author = soup.select_one('.author a, .author')
intro = soup.select_one('.intro, .book-intro')
tags = [t.get_text(strip=True) for t in soup.select('.tags a')]
return {
'title': title.get_text(strip=True) if title else '',
'author': author.get_text(strip=True) if author else '',
'intro': intro.get_text(strip=True) if intro else '',
'tags': tags
}

# 试取前几个详情(注意限速)
details = []
for it in items[:3]:
d = get_detail(it['link'])
details.append(d)
time.sleep(1)
print details

失败重试与指数退避

1
2
3
4
5
6
7
8
9
10
11
12
import socket

def fetch_retry(url, retries=3):
delay = 1
for i in range(retries):
try:
return fetch(url)
except (urllib2.URLError, socket.timeout) as e:
if i == retries - 1:
raise
time.sleep(delay)
delay *= 2 # 1s → 2s → 4s

清洗与存储(CSV 示例)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import csv

def save_csv(rows, path='baidu_reading_public.csv'):
with open(path, 'wb') as f:
w = csv.writer(f)
w.writerow(['title', 'author', 'intro', 'tags'])
for r in rows:
w.writerow([
r.get('title', '').encode('utf-8'),
r.get('author', '').encode('utf-8'),
r.get('intro', '').encode('utf-8'),
','.join(r.get('tags', [])).encode('utf-8')
])

save_csv(details)

百度阅读爬虫 项目总结
https://blog.pangcy.cn/2018/10/31/后端编程相关/python/python2基础/百度阅读爬虫 项目总结/
作者
子洋
发布于
2018年10月31日
许可协议