爬取豆瓣读书的图书信息和评论信息

最近在做毕业设计,需要收集用户的评分数据做协同过滤算法,同时收集评论数据做情感分析

坑点

  1. 豆瓣图书可以没有评分,或者用户评论了但没给评分。而且豆瓣图书的编码方式很无奈呀,热门书籍附近总是冷门书籍,无评分、无评论那种,所以经常输出failed
  2. 不能爬得太快了,每分钟只能40-50张页面,一个requests只能访问一千次,否则就报状态码403

fake_useragent的用法

在这次爬虫中使用了fake_useragent来伪造请求头,因为听说豆瓣的反爬机制比较好
fake_useragent的用法简单如下,random是随机产生一个请求头

1from fake_useragent import UserAgent 2import requests 3ua=UserAgent() 4url="https://www.baidu.com" #请求的网址 5headers={"User-Agent":ua.random} #请求头 6response=requests.get(url=url,headers=headers) #请求网址 7print(headers) 8print(response.status_code) #响应状态信息 9text = response.headers 10for line in text.items(): 11 print(line) 12 13

爬取豆瓣读书的图书信息和评论信息

首先需要观察的是这些的链接
https://book.douban.com/subject/26953606/ 图书信息页面
https://book.douban.com/subject/26953606/comments/ 第一页评论页面
https://book.douban.com/subject/26953606/comments/hot?p=2 第二页评论页面
可以看到前面都是相同的https://book.douban.com/subject/再加一个图书id,评论页面后面接一个/comments/,第二页评论后面接一个hot?p=2,由此递推低3页是hot?p=3
其中一些写入文本的操作,因为我是要收集数据的
第二天又修改了一下,热门图书的分布实在是太稀疏了,所以在程序里先判断评论总数是否超过一千,如果超过一千条就继续爬取,否则continue
又改bug了,是数字的,写入文件一定要将其转换成str

1#coding=utf-8 2#下载豆瓣图书的评分、评论,需要建立四张表。auther:wuyou 3#表一:图书ID,图书名,平均分 4#表二:用户ID,用户名 5#表三:图书ID,热门评论 6#表四:用户ID,图书ID,评分,评分时间 7import requests 8import time 9import random 10from bs4 import BeautifulSoup 11from fake_useragent import UserAgent 12ua = UserAgent() 13header = { 14 'User-Agent': ua.random 15} 16def get_score(book_id,text): #获取(图书ID,图书名,图书评分) 17 soup = BeautifulSoup(text,'lxml') 18 try: 19 book_name = soup.select("#wrapper > h1 > span") #返回书名的列表 20 name = book_name[0].string 21 book_score = soup.select("#interest_sectl > div > div.rating_self.clearfix > strong") #返回分数的列表 22 score = book_score[0].string 23 #print("book name is " + str(name)+" and score is "+str(score)) 打印书名和分数 24 line = str(book_id) + "," + name + "," + str(score) + "\n" #拼接图书信息 25 with open("BookInfo.txt","a",encoding="utf-8") as file: #表一:图书ID,图书名,平均分 26 file.write(line) 27 file.close() 28 except: 29 print("book " + str(book_id) + "get score is failed!") 30 31 32 33def write_txt(soup,book_id): #参与为url,图书id,和网页页码 34 try: #为了防止报错,因为有些人可以不打分,那么在user_info下只有一个span 35 comment_list = soup.find_all("span","short") #找到评论所在的区域 36 comments = "" 37 flag = 0 38 for line in comment_list: #把逗号全部替换成分号 39 bc = line.string 40 bc = bc.replace(",","。") #将英文逗号替换成句号 41 bc = bc.replace(",","。") #将中文逗号替换成句号 42 bc = bc .replace(";","。") #将分号替换成句号 43 if flag == 0: #如果是第一条评论 44 flag += 1 45 else: 46 comments += ";" #评论之间用分号间隔 47 comments += bc 48 with open("BookComments.txt","a",encoding="utf-8") as file: #表三:图书ID,热门评论 49 BookComments = str(book_id) + "," +comments + "\n" 50 file.write(BookComments) 51 file.close() 52 user_list = soup.find_all("span", "comment-info") #找到用户和评分的所在区域 53 user_info_txt = open("UserInfo.txt","a",encoding="utf-8") 54 user_score_txt = open("UserScore.txt","a",encoding="utf-8") 55 for user_info in user_list: 56 user_name = user_info.find("a").string #用户姓名所在的<a></a> 57 user_url = user_info.find("a").attrs["href"] #提取出超链接 58 user_id = user_url.split("/")[-2] #提取出用户id 59 score = user_info.find_all("span")[0].attrs["title"] #找到用户评分的区域,得到分数 60 time_info = user_info.find_all("span")[1].string #提取出评分的时间 61 time_info = time_info.split("-") 62 score_year = time_info[0] #截取出评论时间的年份 63 user_info_txt.write(user_id + "," +user_name + "\n") #表二:用户ID,用户名 64 user_score_txt.write(user_id + "," + str(book_id) + "," + score + "," + str(score_year) + "\n") #表四:用户ID,图书ID,评分,评分时间 65 #print("book_id is " + book_id +" user name is " + user_name + ",id is " + user_id + ",score is " + score_info + " " + time_info) 打印出一系列信息 66 user_info_txt.close() 67 user_score_txt.close() 68 except: 69 print("cannot find!") 70 71 72def get_comments(soup, comment_url, book_id, page): #获取(图书ID,图书评论),(图书ID,用户ID,用户评分),(用户ID,用户名) 73 while page <= 2: #爬取的页数 74 if int(page) == 1: #如果是第一页 75 write_txt(soup, book_id) #传入超链接 76 page += 1 #页数加一 77 else: 78 comment_url += "hot?p=" + str(page) #拼合链接 79 time.sleep(random.uniform(3,6)) 80 html = requests.get(url=comment_url,headers=header) 81 if html.status_code == 200: 82 comment_text = html.text 83 soup = BeautifulSoup(comment_text,"lxml") 84 write_txt(soup, book_id) #传入网页内容 85 page += 1 #页数加一 86 87 88 89#https://book.douban.com/subject/1007305/ 90if __name__ == '__main__': 91 url="https://book.douban.com/subject/" 92 startID=1007304 #起始的图书ID 93 st = 0 #循环的起点 94 lens=20000 #len=20000时,需要爬取的总书籍数 95 while st < lens: #设置st和lens是为了爬取热门书籍 96 if startID-1007304 >=1000: 97 print("stop! " + startID) 98 break 99 try: 100 startID += 1 #图书id增长 101 score_url = url + str(startID) + "/" #图书信息的链接地址 102 html = requests.get(url=score_url,headers=header) 103 html.encoding = "utf-8" 104 time.sleep(random.uniform(3, 6)) # 暂停几秒,随机数在2-4s之间 105 if html.status_code == 200: 106 comment_url = score_url + "comments/" # 评论的链接地址 107 comment_html = requests.get(url=comment_url, headers=header).text 108 time.sleep(random.uniform(3, 6)) # 暂停几秒,随机数在2-4s之间 109 soup = BeautifulSoup(comment_html, "lxml") 110 total_comments = soup.select("#total-comments")[0].string 111 comment_num = total_comments.replace("全部共 ","") 112 comment_num = comment_num.replace(" 条","") 113 if int(comment_num) >= 1000: 114 st +=1 115 print(str(startID)+" is success!" + score_url + " comment_num is " + comment_num) 116 text = html.text 117 get_score(startID,text) 118 get_comments(soup,comment_url,startID,1) #获取评论信息 119 else: 120 print(score_url + " is failed!" + " comment_num is " + comment_num) 121 else: 122 print(str(startID)+" is failed!") 123 except: 124 print(str(startID) + " is failed!",end='') 125 print(html.status_code) 126 127

输出如下(这是以前有输出语句时的代码的输出)
在这里插入图片描述
中间一堆数据省略了
这是爬取到了一些冷门书籍,评论数少得可怜,所以直接忽略了
在这里插入图片描述

代码交流 2021