Python之Html解析方法

2020.05.08

一、強(qiáng)大的BeautifulSoup：BeautifulSoup是一個(gè)可以從html或xml文件中提取數(shù)據(jù)的Python庫(kù)。它能夠通過(guò)你喜歡的轉(zhuǎn)換器實(shí)現(xiàn)慣用的文檔導(dǎo)航、查找、修改文檔的方式。在Python開(kāi)發(fā)中，主要用的是BeautifulSoup的查找提取功能，修改功能很少使用

1、安裝BeautifulSoup

pip3 install beautifulsoup4

2、安裝第三方html解析器lxml

pip3 install lxml

3、安裝純Python實(shí)現(xiàn)的html5lib解析器

pip3 install html5lib

二、BeautifulSoup的使用：

1、導(dǎo)入bs4庫(kù)

frombs4 import BeautifulSoup #導(dǎo)入bs4庫(kù)

2、創(chuàng)建包含html代碼的字符串

html_str= """

<html><head><title>TheDormouse's story</title></head>

<body>

<pclass="title"><b>The Dormouse's stopy</b></p>

<pclass="story">Once upon a time there were three littlesisters;and their names where

"""

3、創(chuàng)建BeautifulSoup對(duì)象

（1）直接通過(guò)字符串方式創(chuàng)建

soup= BeautifulSoup(html_str,'lxml')#html.parser是解析器，也可是lxml

print(soup.prettify())------>輸出soup對(duì)象的內(nèi)容

（2）通過(guò)已有的文件來(lái)創(chuàng)建

soup= BeautifulSoup(open('/home/index.html'),features='html.parser')#html.parser是解析器，也可是lxml

4、BeautifulSoup對(duì)象的種類(lèi)：BeautifulSoup將復(fù)雜HTML文檔轉(zhuǎn)換成一個(gè)復(fù)雜的樹(shù)形結(jié)構(gòu)，每個(gè)節(jié)點(diǎn)都是Python對(duì)象

（1）BeautifulSoup：表示的是一個(gè)文檔的全部?jī)?nèi)容。大部分時(shí)候，可以把它當(dāng)作Tag對(duì)象，是一個(gè)特殊的Tag，因?yàn)?/span>BeautifulSoup對(duì)象并不是真正的HTML和XML，所以沒(méi)有name和attribute屬性

（2）Tag：與XML或HTML原生文檔中的Tag相同，通俗講就是標(biāo)記

如：

抽取title：print（soup.title）

抽取a：print（soup.a）

抽取p：print（soup.p）

Tag中有兩個(gè)重要的屬性：name和attributes。每個(gè)Tag都有自己的名字，通過(guò).name來(lái)獲取

print（soup.title.name）

操作Tag屬性的方法和操作字典相同

如：<pclass=’p1’>Hello World</p>

print（soup.p[‘class’]）

也可以直接“點(diǎn)”取屬性，如 .attrs獲取Tag中所有屬性

print（soup.p.attrs）

（3）NavigableString：獲取標(biāo)記內(nèi)部的文字.string

BeautifulSoup用NavigableString類(lèi)來(lái)封裝Tag中的字符串，一個(gè)NavigableString字符串與Python中的Unicode字符串相同，通過(guò)unicode（）方法可以直接將NavigableString對(duì)象轉(zhuǎn)換成Unicode字符串

如：u_string= unicode(soup.p.string)

（4）Comment：對(duì)于一些特殊對(duì)象，如果不清楚這個(gè)標(biāo)記.string的情況下，可能造成數(shù)據(jù)提取混亂。因此在提取字符串時(shí)，可以判斷下類(lèi)型：

if type(soup.a.string) == bs4.element.Comment:

print(soup.a.string)

5、遍歷文檔

（1）子節(jié)點(diǎn)：

A、對(duì)于直接子節(jié)點(diǎn)可以通過(guò).contents和 .children來(lái)訪(fǎng)問(wèn)

.contents---->將Tag子節(jié)點(diǎn)以列表的方式輸出

print（soup.head.contents）

.children----->返回一個(gè)生成器，對(duì)Tag子節(jié)點(diǎn)進(jìn)行循環(huán)

for child in soup.head.children:

print（child）

B、獲取子節(jié)點(diǎn)的內(nèi)容

.string --->如果標(biāo)記里沒(méi)有標(biāo)記了，則返回內(nèi)容；如果標(biāo)記里只有一個(gè)唯一的標(biāo)記，則返回最里面的內(nèi)容；如果包含多個(gè)子節(jié)點(diǎn)，Tag無(wú)法確定.string方法應(yīng)該返回哪個(gè)時(shí)，則返回None