python之HTMLParser解析html-马育民老师

# 概述
HTMLParser 是python内置的解析html模块，用于解析html，使用简单，与第三方模块比，功能有些简单。

# 使用方法
使用时需要定义一个类，该类继承HTMLParser类，

**常用属性：**
lasttag，保存上一个解析的标签名，是字符串

**然后重载下面方法：**
1. handle_starttag( tag, attrs)，处理开始标签
tag是标签，如：```<p>```
attrs获取到的是属性列表，属性以元组的方式展示

2. handle_endtag( tag)，处理结束标签，
tag是结束标签，如```</div>```

3. handle_startendtag( tag, attrs)处理单标签
tag是单标签，如：```<br/><input/><img />```等
attrs获取到的是属性列表，属性以元组的方式展示

4. handle_data(data)，处理数据，标签之间的文本
data是```<p>python多快好省</p>```中的“python多快好省”

5. handle_comment(data) ，处理注释，之间的文本
data是``````中的“张三开发”

# 演示代码
假设要解析的html代码如下：
```
<html>
	<head></head>
	<body>哈哈
		<h1>python解析html</h1>
		<br/>
		<p>本文讲解使用HTMLParser解析html</p>
        
		<input name='username'/>
		<input name='password'/>
        <input type='button' value='登录'/>
	</body>
</html>
```

python代码如下：
```
html='''
<html>
	<head></head>
	<body>哈哈
		<h1>python解析html</h1>
		<br/>
		<p>本文讲解使用HTMLParser解析html</p>
        
		<input name='username'/>
		<input name='password'/>
        <input type='button' value='登录'/>
	</body>
</html>
'''

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):
        print('处理开始标签和属性:', tag,',属性:',attrs,sep='')

def handle_endtag(self, tag):
        print('处理结束标签:' , tag,sep='')

def handle_startendtag(self, tag, attrs):
        print('处理单标签:' , tag,',属性:',attrs,sep='')

def handle_data(self, data):
        print('处理标签中的文本:',data,sep='')

def handle_comment(self, data):
        print('处理注释:', data,sep='')

#调用执行
parser = MyHTMLParser()
parser.feed(html)
```
结果如下：
```
处理标签中的文本:

处理开始标签和属性:html,属性:[]
处理标签中的文本:

处理开始标签和属性:head,属性:[]
处理结束标签:head
处理标签中的文本:

处理开始标签和属性:body,属性:[]
处理标签中的文本:哈哈

处理开始标签和属性:h1,属性:[]
处理标签中的文本:python解析html
处理结束标签:h1
处理标签中的文本:

处理单标签:br,属性:[]
处理标签中的文本:

处理开始标签和属性:p,属性:[]
处理标签中的文本:本文讲解使用HTMLParser解析html
处理结束标签:p
处理标签中的文本:

处理注释: 登录部分
处理标签中的文本:

处理单标签:input,属性:[('name', 'username')]
处理标签中的文本:

处理单标签:input,属性:[('name', 'password')]
处理标签中的文本:

处理单标签:input,属性:[('type', 'button'), ('value', '登录')]
处理标签中的文本:

处理结束标签:body
处理标签中的文本:

处理结束标签:html
处理标签中的文本:
```

# 分页网页抓取内容
假设我们要抓取标题和内容部分，如下：
```
<h1>python解析html</h1>
<p>本文讲解使用HTMLParser解析html</p>
```
得到上面的文字，python代码如下：
```
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.is_h1=False
        self.is_p=False

def handle_starttag(self, tag, attrs):
        if tag=='h1':
            self.is_h1=True
        elif tag=='p':
            self.is_p=True

def handle_data(self, data):
        if self.is_h1:
            print('标题:',data,sep='')
            self.is_h1=False
        elif self.is_p:
            print('内容:',data,sep='')
            self.is_p=False

parser = MyHTMLParser()
parser.feed(html)

```
结果如下：
```
标题:python解析html
内容:本文讲解使用HTMLParser解析html
```

原文出处：http://www.malaoshi.top/show_1EF2cIZGZDCO.html