html_doc = """ The Dormouse's story
Once upon a time there werethree little sisters; and their names were Elsie,Lacie and Tillie; and they lived at the bottom of awell.
...
"""
代碼:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
接下來可以開始使用各種功能
soup.X(X為任意標(biāo)簽,返回整個標(biāo)簽,包括標(biāo)簽的屬性,內(nèi)容等)
如:soup.title
#
soup.p
# The Dormouse's story
soup.a (注:僅僅返回第一個結(jié)果)
# Elsie
soup.find_all('a') (find_all 可以返回所有)
# [Elsie,
# Lacie,
# Tillie]
find還可以按屬性查找
soup.find(id="link3")
# Tillie
要取某個標(biāo)簽的某個屬性,可用函數(shù)有 find_all,get
for link insoup.find_all('a'):
print(link.get('href'))
#http://example.com/elsie
#http://example.com/lacie
#http://example.com/tillie
要取html文件中的所有文本,可使用get_text()
print(soup.get_text())
# TheDormouse's story
# TheDormouse's story
# Once upona time there were three little sisters; and their names were
#Elsie,
# Lacieand
#Tillie;
# and theylived at the bottom of a well.
# ...
如果是打開html文件,語句可用:
soup =BeautifulSoup(open("index.html"))
BeautifulSoup中的Object
tag (對應(yīng)html中的標(biāo)簽)
tag.attrs(以字典形式返回tag的所有屬性)
可以直接對tag的屬性進行增、刪、改,跟操作字典一樣
tag['class']= 'verybold'
tag['id'] =1
tag
#<blockquote class="verybold"id="1">Extremelybold</blockquote>
deltag['class']
deltag['id']
tag
#<blockquote>Extremelybold</blockquote>
tag['class']
# KeyError:'class'
print(tag.get('class'))
# None
X.contents(X為標(biāo)簽,可返回標(biāo)簽的內(nèi)容)
eg.
head_tag =soup.head
head_tag
#<head><title>TheDormouse'sstory</title></head>
head_tag.contents
[<title>The Dormouse'sstory</title>]
title_tag =head_tag.contents[0]
title_tag
#<title>The Dormouse'sstory</title>
title_tag.contents
# [u'TheDormouse's story']
解決解析網(wǎng)頁出現(xiàn)亂碼問題:
importurllib2
2 fromBeautifulSoup import BeautifulSoup
3
4 page =urllib2.urlopen('http://www.leeon.me');
5 soup =BeautifulSoup(page,fromEncoding="gb18030")
6
7 printsoup.originalEncoding