6、beautifulsoup

简单来说，Beautiful Soup 是 python 的一个库，最主要的功能是从网页抓取数据。官方解释如下：

Beautiful Soup 提供一些简单的、python 式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。 Beautiful Soup 自动将输入文档转换为 Unicode 编码，输出文档转换为 utf-8 编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup 就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。 Beautiful Soup 已成为和 lxml、html6lib 一样出色的 python 解释器，为用户灵活地提供不同的解析策略或强劲的速度。

安装

Beautiful Soup 3 目前已经停止开发，推荐在现在的项目中使用 Beautiful Soup 4，不过它已经被移植到 BS4 了，也就是说导入时我们需要 import bs4

Beautiful Soup 支持 Python 标准库中的 HTML 解析器，还支持一些第三方的解析器，例如lxml，如果我们不安装它，则 Python 会使用 Python 默认的解析器，lxml 解析器更加强大，速度更快，推荐安装。

pip install lxml beautifulsoup4

优缺点

html.parser（python默认解析器）：

Python 的内置标准库
执行速度适中
文档容错能力强
Python 2.7.3 or 3.2.2前的版本中文档容错能力差

lxml（第三方解析器）：

速度快
文档容错能力强
需要安装 C 语言库

html5lib（第三方解析器）：

最好的容错性
以浏览器的方式解析文档
生成 HTML5 格式的文档
速度慢
不依赖外部扩展

基本使用

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
# 创建对象，也可以打开本地的html文件：soup = BeautifulSoup(open('./text.html'))
# 可以指定解析器，如果不指定，默认使用当前系统最好的那个，例如lxml
soup = BeautifulSoup(html,'lxml')
#格式化并输出
print(soup.prettify())

四大对象种类

Tag

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

#获取第一个出现的title标签
print(type(soup.title))
print(soup.title)
#获取第一个出现的p标签
print(soup.p)

'''
<class 'bs4.element.Tag'>
<title>Page title</title>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
'''

Tag，它有两个重要的属性，是 name 和 attrs

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
#soup 对象本身比较特殊，它的 name 即为 [document]
print(soup.name)

pTag = soup.p
#获取标签的名字
print(pTag.name)
print(pTag.attrs)

'''
[document]
p
{'id': 'firstpara', 'align': 'center'}
'''

获取属性值的方法

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

print(soup.p['id'])
print(soup.p.get('id'))

'''
firstpara
firstpara
'''

删除tag

soup = BeautifulSoup(html,'lxml')
#删除tag，并返回
print(soup.extract())

'''
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
'''

NavigableString

可以通过标签的string属性获取标签内部的文字，这个属性是NavigableString类型

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

titleTag = soup.title

print(type(titleTag.string))
print(titleTag.string)

'''
<class 'bs4.element.NavigableString'>
Page title
'''

注意：例如上面的p标签，是无法使用string来取出内容的，需要使用text属性或者使用strings属性（返回一个迭代器），官方文档中介绍

1，当tag 包含了多个子节点，tag 就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None。 2，text 返回的是标签的所有字符串连接成的字符串

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

pTag = soup.p

print(type(pTag.text))
print(pTag.text)

'''
<class 'str'>
This is paragraph one.
'''

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

p = soup.p
print(p.strings)

for s in p.strings:
    print(s)
'''
<generator object _all_strings at 0x0000020E0DD9A570>
This is paragraph 
one
.
'''

BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容，也是一个节点，name是[document]

Comment

Comment 对象是一个特殊类型的 NavigableString 对象，其实输出的内容仍然不包括注释符号，但是如果不好好处理它，可能会对我们的文本处理造成意想不到的麻烦

节点

直接子节点

tag 的contents属性可以将 tag 的直接子节点以列表list的方式输出

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

tags = soup.html.contents

print(type(tags))
print(tags)

'''
<class 'list'>
['\n', <head><title>Page title</title></head>, '\n', <body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>, '\n']
'''

tag还有一个children属性，返回的是一个list迭代器，需要遍历和contents效果一样

所有子节点

tag的descendants属性，返回该tag的所有子节点（直接子节点和孙节点，以此递归），和children一样，返回的是一个list迭代器

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

tags = soup.descendants
print(type(tags))
for tag in tags:
    print(tag)

'''
<class 'generator'>
<html>
<head><title>Page title</title></head>
<body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>
</html>


<head><title>Page title</title></head>
<title>Page title</title>
Page title


<body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>


<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
This is paragraph 
<b>one</b>
one
.


<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
This is paragraph 
<b>two</b>
two
.
'''

直接父节点

tag的parent属性，可以得到该节点的直接父节点

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p

print(p.parent.name)
print(p.parent)
'''
body
<body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>
'''

所有父节点

tag的parents属性可以递归得到元素的所有父辈节点，该属性是一个迭代器

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p

print(p.parents)

for p in p.parents:
    print(p.name)
    
'''
<generator object parents at 0x000001C57839A570>
body
html
[document]
'''

相邻兄弟节点

兄弟节点可以理解为和本节点处在统一级的节点，next_sibling 属性获取了该节点的下一个兄弟节点，previous_sibling 属性获取了该节点的上一个兄弟节点，如果节点不存在，则返回 None ，注意：实际文档中的 tag 的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白，因为空白或者换行也可以被视作一个节点，所以得到的结果可能是空白或者换行

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p

#由于此处的p标签前后有换行，所以使用了调用了两次该属性
print(p.next_sibling.next_sibling)
print(p.previous_sibling.previous_sibling)

'''
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
None
'''

全部兄弟节点

通过 next_siblings 和 previous_siblings 属性可以对当前节点的兄弟节点迭代输出

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p

for pre in p.previous_siblings:
    print(pre)
for next in p.next_siblings:
    print(next)

'''



<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>



'''

相邻前后节点

tag的next_element和previous_element属性可以获得该节点相邻前后的节点，不一定是相同等级的兄弟节点

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p

print(p.previous_element)
print(p.next_element)

'''


This is paragraph 

'''

所有前后节点

通过 next_elements 和 previous_elements 的迭代器就可以向前或向后访问文档的解析内容，就好像文档正在被解析一样

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p

for pre in p.previous_elements:
    print(pre)
for next in p.next_elements:
    print(next)

搜索文档树

find_all( name , attrs , recursive , text , **kwargs ) #搜索当前 tag 的所有 tag 子节点，返回符合要求的结果list
find( name , attrs , recursive , text , **kwargs ) #搜索当前 tag 的所有 tag 子节点，返回第一个符合要求的节点


find_parents() | find_parent() #find_all () 和 find () 只搜索当前节点的所有子节点，孙子节点等. find_parents () 和 find_parent () 用来搜索当前节点的父辈节点，搜索方法与普通 tag 的搜索方法相同，搜索文档搜索文档包含的内容
find_next_siblings() | find_next_sibling() #这 2 个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代，find_next_siblings () 方法返回所有符合条件的后面的兄弟节点，find_next_sibling () 只返回符合条件的后面的第一个 tag 节点
find_previous_siblings() | find_previous_sibling() #这 2 个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代，find_previous_siblings () 方法返回所有符合条件的前面的兄弟节点，find_previous_sibling () 方法返回第一个符合条件的前面的兄弟节点
find_all_next() | find_next() #这 2 个方法通过 .next_elements 属性对当前 tag 的之后的 tag 和字符串进行迭代，find_all_next() 方法返回所有符合条件的节点，find_next () 方法返回第一个符合条件的节点
find_all_previous() | find_previous () #这 2 个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代，find_all_previous () 方法返回所有符合条件的节点，find_previous () 方法返回第一个符合条件的节点

传参

name

name参数选择：
字符串：匹配name为这个字符串的tag
正则：匹配name符合改正则的tag
列表：匹配name和列表中任意元素匹配的tag
方法：这个方法只有一个参数tag，如果方法返回True则认为匹配

from bs4 import BeautifulSoup
import re

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name='b'))

'''
[<b>one</b>, <b>two</b>]
'''

from bs4 import BeautifulSoup
import re

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

print(soup.find_all(name=re.compile(r'b|title')))
'''
[<title>Page title</title>, <body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>, <b>one</b>, <b>two</b>]
'''

from bs4 import BeautifulSoup
import re

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

def findTag(tag):
    if(tag.name == 'title' or tag.name == 'b'):
        return True
    else:
        return False

print(soup.find_all(name=findTag))
'''
[<title>Page title</title>, <b>one</b>, <b>two</b>]
'''

keyword 参数

如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字 tag 的属性来搜索，如果包含一个名字为 id 的参数，Beautiful Soup 会搜索每个 tag 的”id” 属性

如果想用 class 过滤，不过 class 是 python 的关键词，这怎么办？加个下划线就可以，即'class_=value'

也可以将一个参数字典传入attrs参数进行筛选

from bs4 import BeautifulSoup
import re

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

#查询有属性为align=center的tag
print(soup.find_all(align='center'))
print('********************************')
attrs = {
    'id':'secondpara',
    'align':'blah'
}
#查询属性为字典中属性的tag
print(soup.find_all(attrs=attrs))

'''
[<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>]
********************************
[<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>]
'''

CSS选择器

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id 名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是list

from bs4 import BeautifulSoup
import re

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center" class="myp">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah" class="myp">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
#查询id为secondpara的tag
print(soup.select('#secondpara'))
print("****************************************")
#查询class为myp的tag
print(soup.select('.myp'))

'''
[<p align="blah" class="myp" id="secondpara">This is paragraph <b>two</b>.</p>]
****************************************
[<p align="center" class="myp" id="firstpara">This is paragraph <b>one</b>.</p>, 
<p align="blah" class="myp" id="secondpara">This is paragraph <b>two</b>.</p>]
'''

安装

优缺点