Python下Html/xml解析库Beautiful Soup快速入门教程

发布时间:2013-8-2

本文导语: BeautifulSoup是一个用来解析HTML/XML文档的库。Beautiful Soup在Windows 7上的简易安装方法1.使用git下载最新的代码后进入代码顶级目录。2.运行:python setup.py install即可安装完成。简易入门教程假设解析的HTML文档为:html_doc = """ <html><...

BeautifulSoup是一个用来解析 HTML/XML 文档的库。

Beautiful Soup在Windows 7上的简易安装方法

1.使用git 下载最新的代码后进入代码顶级目录。

2.运行:

python setup.py install

即可安装完成。

简易入门教程

假设解析的HTML文档为:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""

输出解析后的文档:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

其它一些常用用法:

soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

提取文档中所有的URL:

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

提取所有的文本:

print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

您可能感兴趣的文章:

本站(WWW.)旨在分享和传播互联网科技相关的资讯和技术，将尽最大努力为读者提供更好的信息聚合和浏览方式。
本站(WWW.)站内文章除注明原创外，均为转载,整理或搜集自网络.欢迎任何形式的转载，转载请注明出处.
转载请注明：文章转载自:[169IT-IT技术资讯]
本文标题:Python下Html/xml解析库Beautiful Soup快速入门教程

相关文章推荐:

Python学习笔记（一）(基础入门之环境搭建)

从零学Python之入门（二）基本数据类型

从零学Python之入门（四）运算

Python程序设计入门(2)变量类型简介

从零学Python之入门（五）缩进和选择

Python程序设计入门(4)模块和包

Python程序设计入门(5)类的使用简介

从零学Python之入门（三）序列

Python类的基础入门知识

Python程序设计入门(3)数组的使用

Python的ORM框架SQLObject入门实例

Python入门及进阶笔记 Python 内置函数小结

Python-基础-入门简介

python基础入门详解(文件输入/输出内建类型字典操作使用方法)

Python程序设计入门(1)基本语法简介