简单来说,Beautiful Soup 是 python 的一个库,最主要的功能是从网页抓取数据。官方解释如下:
Beautiful Soup 提供一些简单的、python 式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。 Beautiful Soup 自动将输入文档转换为 Unicode 编码,输出文档转换为 utf-8 编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup 就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。 Beautiful Soup 已成为和 lxml、html6lib 一样出色的 python 解释器,为用户灵活地提供不同的解析策略或强劲的速度。
Beautiful Soup 3 目前已经停止开发,推荐在现在的项目中使用 Beautiful Soup 4,不过它已经被移植到 BS4 了,也就是说导入时我们需要 import bs4
Beautiful Soup 支持 Python 标准库中的 HTML 解析器,还支持一些第三方的解析器,例如lxml,如果我们不安装它,则 Python 会使用 Python 默认的解析器,lxml 解析器更加强大,速度更快,推荐安装。
pip install lxml beautifulsoup4
html.parser(python默认解析器):
lxml(第三方解析器):
html5lib(第三方解析器):
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
# 创建对象,也可以打开本地的html文件:soup = BeautifulSoup(open('./text.html'))
# 可以指定解析器,如果不指定,默认使用当前系统最好的那个,例如lxml
soup = BeautifulSoup(html,'lxml')
#格式化并输出
print(soup.prettify())
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
#获取第一个出现的title标签
print(type(soup.title))
print(soup.title)
#获取第一个出现的p标签
print(soup.p)
'''
<class 'bs4.element.Tag'>
<title>Page title</title>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
'''
Tag,它有两个重要的属性,是 name 和 attrs
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
#soup 对象本身比较特殊,它的 name 即为 [document]
print(soup.name)
pTag = soup.p
#获取标签的名字
print(pTag.name)
print(pTag.attrs)
'''
[document]
p
{'id': 'firstpara', 'align': 'center'}
'''
获取属性值的方法
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p['id'])
print(soup.p.get('id'))
'''
firstpara
firstpara
'''
删除tag
soup = BeautifulSoup(html,'lxml')
#删除tag,并返回
print(soup.extract())
'''
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
'''
可以通过标签的string属性获取标签内部的文字,这个属性是NavigableString类型
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
titleTag = soup.title
print(type(titleTag.string))
print(titleTag.string)
'''
<class 'bs4.element.NavigableString'>
Page title
'''
注意:例如上面的p标签,是无法使用string来取出内容的,需要使用text
属性或者使用strings
属性(返回一个迭代器),官方文档中介绍
1,当tag 包含了多个子节点,tag 就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None。 2,text 返回的是标签的所有字符串连接成的字符串
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
pTag = soup.p
print(type(pTag.text))
print(pTag.text)
'''
<class 'str'>
This is paragraph one.
'''
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p
print(p.strings)
for s in p.strings:
print(s)
'''
<generator object _all_strings at 0x0000020E0DD9A570>
This is paragraph
one
.
'''
BeautifulSoup 对象表示的是一个文档的全部内容,也是一个节点,name是[document]
Comment 对象是一个特殊类型的 NavigableString 对象,其实输出的内容仍然不包括注释符号,但是如果不好好处理它,可能会对我们的文本处理造成意想不到的麻烦
tag 的contents属性可以将 tag 的直接子节点以列表list的方式输出
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
tags = soup.html.contents
print(type(tags))
print(tags)
'''
<class 'list'>
['\n', <head><title>Page title</title></head>, '\n', <body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>, '\n']
'''
tag还有一个children属性,返回的是一个list迭代器,需要遍历和contents效果一样
tag的descendants属性,返回该tag的所有子节点(直接子节点和孙节点,以此递归),和children一样,返回的是一个list迭代器
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
tags = soup.descendants
print(type(tags))
for tag in tags:
print(tag)
'''
<class 'generator'>
<html>
<head><title>Page title</title></head>
<body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>
</html>
<head><title>Page title</title></head>
<title>Page title</title>
Page title
<body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
This is paragraph
<b>one</b>
one
.
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
This is paragraph
<b>two</b>
two
.
'''
tag的parent属性,可以得到该节点的直接父节点
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p
print(p.parent.name)
print(p.parent)
'''
body
<body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>
'''
tag的parents属性可以递归得到元素的所有父辈节点,该属性是一个迭代器
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p
print(p.parents)
for p in p.parents:
print(p.name)
'''
<generator object parents at 0x000001C57839A570>
body
html
[document]
'''
兄弟节点可以理解为和本节点处在统一级的节点,next_sibling 属性获取了该节点的下一个兄弟节点,previous_sibling 属性获取了该节点的上一个兄弟节点,如果节点不存在,则返回 None ,注意:实际文档中的 tag 的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白,因为空白或者换行也可以被视作一个节点,所以得到的结果可能是空白或者换行
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p
#由于此处的p标签前后有换行,所以使用了调用了两次该属性
print(p.next_sibling.next_sibling)
print(p.previous_sibling.previous_sibling)
'''
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
None
'''
通过 next_siblings 和 previous_siblings 属性可以对当前节点的兄弟节点迭代输出
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p
for pre in p.previous_siblings:
print(pre)
for next in p.next_siblings:
print(next)
'''
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
'''
tag的next_element和previous_element属性可以获得该节点相邻前后的节点,不一定是相同等级的兄弟节点
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p
print(p.previous_element)
print(p.next_element)
'''
This is paragraph
'''
通过 next_elements 和 previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
p = soup.p
for pre in p.previous_elements:
print(pre)
for next in p.next_elements:
print(next)
find_all( name , attrs , recursive , text , **kwargs ) #搜索当前 tag 的所有 tag 子节点,返回符合要求的结果list
find( name , attrs , recursive , text , **kwargs ) #搜索当前 tag 的所有 tag 子节点,返回第一个符合要求的节点
find_parents() | find_parent() #find_all () 和 find () 只搜索当前节点的所有子节点,孙子节点等. find_parents () 和 find_parent () 用来搜索当前节点的父辈节点,搜索方法与普通 tag 的搜索方法相同,搜索文档搜索文档包含的内容
find_next_siblings() | find_next_sibling() #这 2 个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代,find_next_siblings () 方法返回所有符合条件的后面的兄弟节点,find_next_sibling () 只返回符合条件的后面的第一个 tag 节点
find_previous_siblings() | find_previous_sibling() #这 2 个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代,find_previous_siblings () 方法返回所有符合条件的前面的兄弟节点,find_previous_sibling () 方法返回第一个符合条件的前面的兄弟节点
find_all_next() | find_next() #这 2 个方法通过 .next_elements 属性对当前 tag 的之后的 tag 和字符串进行迭代,find_all_next() 方法返回所有符合条件的节点,find_next () 方法返回第一个符合条件的节点
find_all_previous() | find_previous () #这 2 个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代,find_all_previous () 方法返回所有符合条件的节点,find_previous () 方法返回第一个符合条件的节点
from bs4 import BeautifulSoup
import re
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name='b'))
'''
[<b>one</b>, <b>two</b>]
'''
from bs4 import BeautifulSoup
import re
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name=re.compile(r'b|title')))
'''
[<title>Page title</title>, <body>
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>
</body>, <b>one</b>, <b>two</b>]
'''
from bs4 import BeautifulSoup
import re
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
def findTag(tag):
if(tag.name == 'title' or tag.name == 'b'):
return True
else:
return False
print(soup.find_all(name=findTag))
'''
[<title>Page title</title>, <b>one</b>, <b>two</b>]
'''
如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字 tag 的属性来搜索,如果包含一个名字为 id 的参数,Beautiful Soup 会搜索每个 tag 的”id” 属性
如果想用 class 过滤,不过 class 是 python 的关键词,这怎么办?加个下划线就可以,即'class_=value'
也可以将一个参数字典传入attrs参数进行筛选
from bs4 import BeautifulSoup
import re
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
#查询有属性为align=center的tag
print(soup.find_all(align='center'))
print('********************************')
attrs = {
'id':'secondpara',
'align':'blah'
}
#查询属性为字典中属性的tag
print(soup.find_all(attrs=attrs))
'''
[<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>]
********************************
[<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>]
'''
我们在写 CSS 时,标签名不加任何修饰,类名前加点,id 名前加 #,在这里我们也可以利用类似的方法来筛选元素,用到的方法是 soup.select(),返回类型是list
from bs4 import BeautifulSoup
import re
html = """
<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center" class="myp">This is paragraph <b>one</b>.</p>
<p id="secondpara" align="blah" class="myp">This is paragraph <b>two</b>.</p>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
#查询id为secondpara的tag
print(soup.select('#secondpara'))
print("****************************************")
#查询class为myp的tag
print(soup.select('.myp'))
'''
[<p align="blah" class="myp" id="secondpara">This is paragraph <b>two</b>.</p>]
****************************************
[<p align="center" class="myp" id="firstpara">This is paragraph <b>one</b>.</p>,
<p align="blah" class="myp" id="secondpara">This is paragraph <b>two</b>.</p>]
'''