爬虫初探01-慕课网《Python开发简单爬虫》总结

爬虫学习总结

1.说明

  • 原课程基于python 2.7版本讲解,本总结基于python 3.5版本
  • 原课程和以下部分引用来自慕课网 乒乓球鸡蛋老师的 ** 《Python开发简单爬虫》**

2.回顾总结

2.1 爬虫简介及其价值

  • 爬虫: 一段自动抓取互联网信息的程序
  • 价值: 互联网数据,为我所用!

2.2 简单爬虫架构和运行流程

j简单爬虫架构
运行流程

2.3 URL管理器和实现方法

  • URL管理器:管理待抓取URL集合和已抓取URL集合
    • 防止重复抓取和循环抓取
    • 添加新URL到待爬取集合
    • 判断待添加URL是否在容器
    • 判断是否还有待爬取URL
    • 获取待爬取URL
  • 实现方式
    • 内存
      • python内存set()
    • 关系数据库
      • MySQL urls(url, is_crawled)
    • 缓存数据库
      • redis

2.4 网页下载器

  • 网页下载器:将互联网上URL对应的网页下载到本地的工具
  • 直接下载
    1
    2
    3
    4
    5
    6
    7
    import urllib.request
    # 直接请求
    response = urllib.request.urlopen('http://python.org/')
    # 获取状态码,如果是200表示获取成功
    print response.getcode()
    #读取内容
    html = response.read()
  • 发送数据和header
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    import urllib.parse
    import urllib.request

    url = 'http://localhost/login.php'
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    values = {
    'act' : 'login',
    }
    headers = { 'User-Agent' : user_agent }
    data = urllib.parse.urlencode(values)
    req = urllib.request.Request(url, data, headers)
    response = urllib.request.urlopen(req)
    the_page = response.read()
    print(the_page.decode("utf8"))
  • 添加特殊情景处理器
    • HTTPCookieProcessor
    • ProxyHandler
    • HTTPSHandler
    • HTTPRedirectHandler

2.5 网页解析器

  • 网页解析器:从网页中提取有价值数据的工具
    • 提取有价值数据
    • 新的待爬取URL
  • 类型
    • 正则表达式
      • 字符串方式模糊匹配
    • html.parser
      • 结构化解析
    • BeautifulSoup
      • 结构化解析
    • lxml
      • 结构化解析

3. 实战爬取百科数据

建立python 项目和baike_spider包,baike_spider包内有spider_main、url_manager、html_downloader、html_parser、html_outputer五个模块

3.1 调度程序 spider_main.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from baike_spider import url_manager, html_downloader, html_parser, html_outputer

class SpiderMain(object):
def __init__(self):
self.urls = url_manager.UrlManager()
self.downloader = html_downloader.HtmlDownloader()
self.parser = html_parser.HtmlParser()
self.outputer = html_outputer.HtmlOutputer()

def craw(self, root_url):
count = 1
self.urls.add_new_url(root_url)
while self.urls.has_new_url():
try:
new_url = self.urls.get_new_url()
print('craw %d : %s' % (count, new_url))
html_cont = self.downloader.download(new_url)
new_urls, new_data = self.parser.parse(new_url, html_cont)
self.urls.add_new_urls(new_urls)
self.outputer.collect_data(new_data)
if count == 10:
break
count = count + 1
except:
print("craw failed")
self.outputer.output_html()

if __name__ == "__main__":
root_url = "http://baike.baidu.com/view/21087.htm"
obj_spider = SpiderMain()
obj_spider.craw(root_url)

3.2 URL管理器 url_manager.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31



class UrlManager(object):
def __init__(self):
self.new_urls = set()
self.old_urls = set()


def add_new_url(self, url):
if url is None:
return
if url not in self.new_urls and url not in self.old_urls:
self.new_urls.add(url)

def add_new_urls(self, urls):
if urls is None or len(urls) == 0:
return
for url in urls:
self.add_new_url(url)

def has_new_url(self):
return len(self.new_urls) != 0


def get_new_url(self):
new_url = self.new_urls.pop()
self.old_urls.add(new_url)
return new_url


3.3 HTML下载器 html_downloader.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

import urllib


class HtmlDownloader(object):


def download(self, url):
if url is None:
return None
response = urllib.request.urlopen(url)
if response.getcode() != 200:
return None
return response.read()


3.4 HTML解析器 html_parser.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

import re
import urllib

from bs4 import BeautifulSoup


class HtmlParser(object):


def _get_new_urls(self, page_url, soup):
new_urls = set()
links = soup.find_all('a', href=re.compile(r"/view/\d+\.htm"))
for link in links:
new_url = link['href']
new_full_url = urllib.parse.urljoin(page_url, new_url)
new_urls.add(new_full_url)
return new_urls


def _get_new_data(self, page_url, soup):
res_data = {}
res_data['url'] = page_url
# <dd class="lemmaWgt-lemmaTitle-title"><h1>Python</h1>
title_node = soup.find('dd', class_="lemmaWgt-lemmaTitle-title").find("h1")
res_data['title'] = title_node.get_text()
# <div class="lemma-summary" label-module="lemmaSummary">
summary_node = soup.find('div', class_="lemma-summary")
res_data['summary'] = summary_node.get_text()
return res_data


def parse(self, page_url, html_cont):
if page_url is None or html_cont is None:
return

soup = BeautifulSoup(html_cont, 'html.parser', from_encoding='utf-8')
new_urls = self._get_new_urls(page_url, soup)
new_data = self._get_new_data(page_url, soup)
return new_urls, new_data


3.5 HTML输出器 html_outputer.python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import os


class HtmlOutputer(object):
def __init__(self):
self.file_name = 'output.html'
self.datas = []

def collect_data(self, data):
if data is None:
return
self.datas.append(data)


def output_html(self):
fout = open(self.file_name, 'w', encoding='utf-8')
css = '<style type="text/css">body{padding:40px 50px;color:#444;font-size:14px} \
table{width:100%;border:solid #ccc 1px;border-spacing:0}td,th{padding:10px; \
border-top:1px solid #ccc;border-left:1px solid #ccc} \
tbody tr:nth-child(even){background:#f5f5f5;box-shadow:0 1px 0 hsla(0,0%,100%,.8) \
nset}th{background-color:#eee;text-align:left}</style>'

fout.write("<html>")
fout.write("<head><meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">%s</head>" % css)
fout.write("<body>")
fout.write("<table>")
fout.write("<tr>")
fout.write("<th>URL</th>")
fout.write("<th>TITLE</th>")
fout.write("<th>SUMMARY</th>")
fout.write("</tr>")
for data in self.datas:
fout.write("<tr>")
fout.write("<td>%s</td>" % data['url'])
fout.write("<td>%s</td>" % data['title'])
fout.write("<td>%s</td>" % data['summary'])
fout.write("</tr>")

fout.write("</table>")
fout.write("</body>")
fout.write("</html>")
fout.close()
os.system(self.file_name)

3.6 运行结果

eclipse运行结果

网页运行结果