2016-01-18

[Python]广度优先的多线程Python网页抓取脚本

一、脚本作用

使用Python脚本批量化抓取特定网页内容中的相应资源，应包含功能：

读取命令参数：argparse
读取配置文件：ConfigParser
HTTP请求、URL处理及HTML解析：urllib2, urllib, HTMLParser
广度优先并行请求：设计多线程模型
保存本地资源：仅保存符合规则的资源

日志记录：logging

二、脚本主要说明

2.1 命令参数读取

Python目前推荐使用argparse进行命令参数解析

import argparse
import logging

parser = argparse.ArgumentParser()
parser.add_argument("-v", "--version", help="show current script version", action="version", version="%(prog)s 1.0")
parser.add_argument("-c", "--conf", help="set config file", required="true")
args = parser.parse_args()
if args.conf:
    conf = args.conf 
    logging.info("Success to read conf args : %s" , conf)
else:
    logging.error("Fail to read conf args")
    sys.exit(1)

2.2 配置文件读取

通过ConfigParser模块读取配置文件

import ConfigParser
import logging

conf = "spider.conf"
conf_parser = ConfigParser.ConfigParser()
try:
    conf_parser.read(conf)
except ConfigParser.Error as e:
    logging.error("Fail to load conf(%s) as ConfigParser.Error: %s", conf, e)
    return "Fail"

#读取配置文件中author配置项内容
author = conf_parser.get("spider", "author")

2.3 HTTP请求、URL处理

2.3.1 HTTP请求

Python中使用urllib, urllib2, Requests进行HTTP请求和响应处理。
以urllib2为例：

url = "http://www.baidu.com"
try:
    page = urllib2.urlopen(url)
except IOError as e:
    logging.error('Url open failed with exception: %s', e)
    return None
html = page.read()
print "HTML of the url:", html

2.3.2 URL处理

需要考虑的URL地址处理包括：

URL拼接：部分链接为相对地址，需要在请求前拼接为真实的绝对地址
1
urlparse.urljoin('http://www.baidu.com/news/123','/images/baidu.png')

结果为：http://www.baidu.com/images/baidu.png

URL转码：当以URL作为文件名保存文件时，需要对URL中特殊字符转义
1
file_name = urllib.quote_plus(url)

2.3.3 HTML解析

网页编码解析 ：由于不同网页编码格式不一，会导致HTML页面内容解析时报错，应根据页面编码格式进行html编码。

第三方库chardetet可实现页面编码格式探测：

pip install chardet #安装chardet库

import urllib
import chardet
html = urllib.urlopen('http://www.google.cn/').read()
print chardet.detect(rawdata)

手动解析页面html中charset属性，确定页面编码
实现方式：正则表达式匹配html中charset=部分的属性，确定网页编码格式。

regex = ur'meta.*charset=("?)(.*?)("|>)'
match = re.search(regex, html)
html_charset = 'utf-8'  # default charset
if match:
    html_charset = match.group(2)
else:
    logging.error("Fail to match charset Regex for url:%s", url)
    logging.info(html)
    return html

if html_charset=="gb2312" or html_charset=="GBK":
    html_charset = "GB18030"
elif html_charset=="iso-8859-1":
    html_charset = "latin-1"
return html.decode(html_charset)

根据http Response的Header中字符编码，经测试发现国内大多网站该报文头部字段与页面编码格式不符。

Python中解析HTML多种方式：BeautifulSoup/HTMLParser/SGMLParser等，这里选用HTMLParser操作
函数MyParser.get_sub_urls(cur_url, cur_html)将返回指定url对应html下所有子链接（以标签方式包含的链接）

import HTMLParser
import re
import urlparse
import logging

class MyParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.links = []
    def handle_starttag(self, tag, attrs):
        if tag == "a":
            if len(attrs) == 0: pass
            else:
                for (k, v) in attrs:
                    if k == "href" and v != "/" and v != "javascript:;" and v !="javascript:void(0)" and v != "#" and v != "":
                        self.links.append(v)

# HtmlParser to find all links
def get_sub_urls(cur_url, cur_html):
    # get the current scheme
    urlparser = urlparse.urlparse(cur_url)

    # get all the sub href links
    myhtmlparser = MyParser()
    myhtmlparser.feed(cur_html)
    myhtmlparser.close()
    logging.info('get all sub urls succ of : %s ' % cur_url)

    return myhtmlparser.links

2.4 广度优先遍历

广度优先首先想到“队列Queue”实现，Python中封装Queue是线程安全的，也为下一步多线程实践提供支持。
思路：HTTP请求->获取HTML->解析HTML获取子链接->子链接入队列->循环队列pop元素,重复该步骤，直到队列为空。
下面是简单的伪码描述，文末附全部代码。

crawl-request:

while not url_queue.empty();
    url = url_queue.get()
    content = webpage_urlopen.webpage_urlopen(url, conf.crawl_timeout)
    //actions like saving pages
    sub_urls = webpage_parse.webpage_parse(content, url)
    for sub_url in sub_urls:
        url_queue.put(sub_url)

function main：
url_queue = Queue.Queue()
url_queue.put(init_url)
crawl-request(url)

2.5 多线程实现

多线程需解决“数据同步”问题，前文采用的队列Queue是线程安全的，因此在此多线程实现是规避了过多对“数据同步”的考量。
另外，多线程可设置为Daemon线程，则不需要注意等待线程结束，而只需要通过queue.join()等待“任务队列Queue”完成则主线程退出。
主线程退出时，Daemon会自动结束。
队列Queue的使用，可简化多线程数据同步、线程控制的模型设计，但要清晰理解Daemon线程、队列join的意义。

queue = Queue.Queue()

class ThreadUrl(threading.Thread):
    def __init__(self,queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        while True:
            # grabs host from Queue
            host = self.queue.get()

            #grabs urls of hosts and prints first 1024 bytes of page
            url = urllib2.urlopen(host)
            print url.read(1024)

            #signals to queue job is done
            self.queue.task_done()

def main():
    #populate queue with data
    for host in hosts:
        queue.put(host)
    #spawn a poll of threads, and pass them queue instance
    for i in range(5):
        t = ThreadUrl(queue)
        t.setDaemon(True)
        t.start()

    #wait on the queue until everything has been processed
    cur_ths = threading.enumerate()
    print "cur enumerate threadings len:", len(cur_ths)
    for t in cur_ths: 
        print "cur enumerate threadings:", t
    queue.join()
main()

2.5 日志记录

logging模块来实现日志记录，定义好格式、级别，没有太多可说的。
可通过logging.handlers为不同错误级别的日志设定独立的日志文件，将高级别日志独立出来便于发现错误。

# main.py
import log
log.init_log("./log.txt", level=logging.DEBUG)

# log.py
import os
import logging
import logging.handlers

def init_log(log_path, level=logging.INFO, when="D", backup=7,
             format="%(levelname)s: %(asctime)s: %(filename)s:%(lineno)d * %(thread)d %(message)s",
             datefmt="%m-%d %H:%M:%S"):
    formatter = logging.Formatter(format, datefmt)
    logger = logging.getLogger()
    logger.setLevel(level)

    dir = os.path.dirname(log_path)
    if not os.path.isdir(dir):
        os.makedirs(dir)

    handler = logging.handlers.TimedRotatingFileHandler(log_path + ".log",
                                                        when=when,
                                                        backupCount=backup)
    handler.setLevel(level)
    handler.setFormatter(formatter)
    logger.addHandler(handler)

    handler = logging.handlers.TimedRotatingFileHandler(log_path + ".log.wf",
                                                        when=when,
                                                        backupCount=backup)
    handler.setLevel(logging.WARNING)
    handler.setFormatter(formatter)
    logger.addHandler(handler)

三、脚本源码

项目GitHub地址：Fivezh/py_spider

小武