6、beautifulsoup

简单来说,Beautiful Soup 是 python 的一个库,最主要的功能是从网页抓取数据。官方解释如下:

Beautiful Soup 提供一些简单的、python 式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。 Beautiful Soup 自动将输入文档转换为 Unicode 编码,输出文档转换为 utf-8 编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup 就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。 Beautiful Soup 已成为和 lxml、html6lib 一样出色的 python 解释器,为用户灵活地提供不同的解析策略或强劲的速度。

安装

Beautiful Soup 3 目前已经停止开发,推荐在现在的项目中使用 Beautiful Soup 4,不过它已经被移植到 BS4 了,也就是说导入时我们需要 import bs4

Beautiful Soup 支持 Python 标准库中的 HTML 解析器,还支持一些第三方的解析器,例如lxml,如果我们不安装它,则 Python 会使用 Python 默认的解析器,lxml 解析器更加强大,速度更快,推荐安装。

pip install lxml beautifulsoup4

优缺点

html.parser(python默认解析器):

lxml(第三方解析器):

html5lib(第三方解析器):

基本使用

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
# 创建对象,也可以打开本地的html文件:soup = BeautifulSoup(open('./text.html'))
# 可以指定解析器,如果不指定,默认使用当前系统最好的那个,例如lxml
soup = BeautifulSoup(html,'lxml')
#格式化并输出
print(soup.prettify())

四大对象种类

Tag

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

#获取第一个出现的title标签
print(type(soup.title))
print(soup.title)
#获取第一个出现的p标签
print(soup.p)

'''
<class 'bs4.element.Tag'>
<title>Page title</title>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
'''

Tag,它有两个重要的属性,是 nameattrs

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
#soup 对象本身比较特殊,它的 name 即为 [document]
print(soup.name)

pTag = soup.p
#获取标签的名字
print(pTag.name)
print(pTag.attrs)

'''
[document]
p
{'id': 'firstpara', 'align': 'center'}
'''

获取属性值的方法

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

print(soup.p['id'])
print(soup.p.get('id'))

'''
firstpara
firstpara
'''

删除tag

soup = BeautifulSoup(html,'lxml')
#删除tag,并返回
print(soup.extract())

'''
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
'''

NavigableString

可以通过标签的string属性获取标签内部的文字,这个属性是NavigableString类型

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

titleTag = soup.title

print(type(titleTag.string))
print(titleTag.string)

'''
<class 'bs4.element.NavigableString'>
Page title
'''

注意:例如上面的p标签,是无法使用string来取出内容的,需要使用text属性或者使用strings属性(返回一个迭代器),官方文档中介绍

1,当tag 包含了多个子节点,tag 就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None。 2,text 返回的是标签的所有字符串连接成的字符串

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

pTag = soup.p

print(type(pTag.text))
print(pTag.text)

'''
<class 'str'>
This is paragraph one.
'''
from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

p = soup.p
print(p.strings)

for s in p.strings:
    print(s)
'''
<generator object _all_strings at 0x0000020E0DD9A570>
This is paragraph 
one
.
'''

BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容,也是一个节点,name是[document]

Comment

Comment 对象是一个特殊类型的 NavigableString 对象,其实输出的内容仍然不包括注释符号,但是如果不好好处理它,可能会对我们的文本处理造成意想不到的麻烦

节点

直接子节点

tag 的contents属性可以将 tag 的直接子节点以列表list的方式输出

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

tags = soup.html.contents

print(type(tags))
print(tags)

'''
<class 'list'>
['\n', <head><title>Page title</title></head>, '\n', <body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>, '\n']
'''

tag还有一个children属性,返回的是一个list迭代器,需要遍历和contents效果一样

所有子节点

tag的descendants属性,返回该tag的所有子节点(直接子节点和孙节点,以此递归),和children一样,返回的是一个list迭代器

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

tags = soup.descendants
print(type(tags))
for tag in tags:
    print(tag)

'''
<class 'generator'>
<html>
<head><title>Page title</title></head>
<body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>
</html>


<head><title>Page title</title></head>
<title>Page title</title>
Page title


<body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>


<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
This is paragraph 
<b>one</b>
one
.


<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
This is paragraph 
<b>two</b>
two
.
'''

直接父节点

tag的parent属性,可以得到该节点的直接父节点

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p

print(p.parent.name)
print(p.parent)
'''
body
<body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>
'''

所有父节点

tag的parents属性可以递归得到元素的所有父辈节点,该属性是一个迭代器

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p

print(p.parents)

for p in p.parents:
    print(p.name)
    
'''
<generator object parents at 0x000001C57839A570>
body
html
[document]
'''

相邻兄弟节点

兄弟节点可以理解为和本节点处在统一级的节点,next_sibling 属性获取了该节点的下一个兄弟节点,previous_sibling 属性获取了该节点的上一个兄弟节点,如果节点不存在,则返回 None ,注意:实际文档中的 tag 的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白,因为空白或者换行也可以被视作一个节点,所以得到的结果可能是空白或者换行

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p

#由于此处的p标签前后有换行,所以使用了调用了两次该属性
print(p.next_sibling.next_sibling)
print(p.previous_sibling.previous_sibling)

'''
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
None
'''

全部兄弟节点

通过 next_siblingsprevious_siblings 属性可以对当前节点的兄弟节点迭代输出

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p

for pre in p.previous_siblings:
    print(pre)
for next in p.next_siblings:
    print(next)

'''



<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>



'''

相邻前后节点

tag的next_elementprevious_element属性可以获得该节点相邻前后的节点,不一定是相同等级的兄弟节点

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p

print(p.previous_element)
print(p.next_element)

'''


This is paragraph 

'''

所有前后节点

通过 next_elementsprevious_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p

for pre in p.previous_elements:
    print(pre)
for next in p.next_elements:
    print(next)

搜索文档树

find_all( name , attrs , recursive , text , **kwargs ) #搜索当前 tag 的所有 tag 子节点,返回符合要求的结果list
find( name , attrs , recursive , text , **kwargs ) #搜索当前 tag 的所有 tag 子节点,返回第一个符合要求的节点


find_parents() | find_parent() #find_all () 和 find () 只搜索当前节点的所有子节点,孙子节点等. find_parents () 和 find_parent () 用来搜索当前节点的父辈节点,搜索方法与普通 tag 的搜索方法相同,搜索文档搜索文档包含的内容
find_next_siblings() | find_next_sibling() #这 2 个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代,find_next_siblings () 方法返回所有符合条件的后面的兄弟节点,find_next_sibling () 只返回符合条件的后面的第一个 tag 节点
find_previous_siblings() | find_previous_sibling() #这 2 个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代,find_previous_siblings () 方法返回所有符合条件的前面的兄弟节点,find_previous_sibling () 方法返回第一个符合条件的前面的兄弟节点
find_all_next() | find_next() #这 2 个方法通过 .next_elements 属性对当前 tag 的之后的 tag 和字符串进行迭代,find_all_next() 方法返回所有符合条件的节点,find_next () 方法返回第一个符合条件的节点
find_all_previous() | find_previous () #这 2 个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代,find_all_previous () 方法返回所有符合条件的节点,find_previous () 方法返回第一个符合条件的节点

传参

name

from bs4 import BeautifulSoup
import re

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name='b'))

'''
[<b>one</b>, <b>two</b>]
'''
from bs4 import BeautifulSoup
import re

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

print(soup.find_all(name=re.compile(r'b|title')))
'''
[<title>Page title</title>, <body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>, <b>one</b>, <b>two</b>]
'''
from bs4 import BeautifulSoup
import re

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

def findTag(tag):
    if(tag.name == 'title' or tag.name == 'b'):
        return True
    else:
        return False

print(soup.find_all(name=findTag))
'''
[<title>Page title</title>, <b>one</b>, <b>two</b>]
'''

keyword 参数

如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字 tag 的属性来搜索,如果包含一个名字为 id 的参数,Beautiful Soup 会搜索每个 tag 的”id” 属性

如果想用 class 过滤,不过 class 是 python 的关键词,这怎么办?加个下划线就可以,即'class_=value'

也可以将一个参数字典传入attrs参数进行筛选

from bs4 import BeautifulSoup
import re

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')

#查询有属性为align=center的tag
print(soup.find_all(align='center'))
print('********************************')
attrs = {
    'id':'secondpara',
    'align':'blah'
}
#查询属性为字典中属性的tag
print(soup.find_all(attrs=attrs))

'''
[<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>]
********************************
[<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>]
'''

CSS选择器

我们在写 CSS 时,标签名不加任何修饰,类名前加点,id 名前加 #,在这里我们也可以利用类似的方法来筛选元素,用到的方法是 soup.select(),返回类型是list

from bs4 import BeautifulSoup
import re

html = """
<html>
<head><title>Page title</title></head>
    <body>
        <p id="firstpara" align="center" class="myp">This is paragraph <b>one</b>.</p>
        <p id="secondpara" align="blah" class="myp">This is paragraph <b>two</b>.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
#查询id为secondpara的tag
print(soup.select('#secondpara'))
print("****************************************")
#查询class为myp的tag
print(soup.select('.myp'))

'''
[<p align="blah" class="myp" id="secondpara">This is paragraph <b>two</b>.</p>]
****************************************
[<p align="center" class="myp" id="firstpara">This is paragraph <b>one</b>.</p>, 
<p align="blah" class="myp" id="secondpara">This is paragraph <b>two</b>.</p>]
'''