2017年soup（数学）知识点梳理及解决办法！

zx69258273 发表于 2022-3-1 10:59:35

①get请求（搜索某些内容）

自动在百度上搜索关键词，获得搜索界面

<pre> <code class="language-text">#get请求
keywd="Python"
url="http://www.baidu.com/s?wd="+keywd#网址构造
print(url)
req=urllib.request.Request(url) #以请求的方式获取,网址
data=urllib.request.urlopen(req).read()
fh=open("C:/Users/admin/Desktop/a.html","wb") #以二进制写入html文件
fh.write(data)
fh.close()
#若搜索关键词为中文
keywd1="亚马孙"
keywd1=urllib.request.quote(keywd1)#利用quote对中文进行编码
url1="http://www.baidu.com/s?wd="+keywd1
req=urllib.request.Request(url1)
data=urllib.request.urlopen(req).read()
fh=open("C:/Users/admin/Desktop/a.html","wb") #二进制
fh.write(data)
fh.close()</code></pre>
②post请求（登录某些网站）

<pre> <code class="language-text">#post请求
import urllib.request
import urllib.parse
url="https://www.iqianyue.com/mypost/"#地址
login=urllib.parse.urlencode(
{"name":"1121640425@qq.com","pass":"123"}
).encode("utf-8")#登录数据
req=urllib.request.Request(url,login)
data=urllib.request.urlopen(req).read()
fh=open("C:/Users/admin/Desktop/a.html","wb")
fh.write(data)
fh.close()</code></pre>
三、bs4库

功能：解析、遍历、维护检测树。

...

参考链接：

////

3.1支持的解析器

1.标准库：内置库、执行检测适中、文档容错能力强；

2.lxmlHTML解析器：速度快python爬虫代码大全，文档容错能力强（推荐）；

3.lxmlXML解析器：速度快，唯一支持xml的检测器；

4.：最好的测评性、以检测器测验解析文档天外神坛，生成HTML5格式的文档。

具体用法：soup=(,="编码方式")

<pre> <code class="language-text">html = """
<html><head><title>The Dormouse's story</title></head>

The Dormouse's story

Once upon a time there were three little sisters; and their names were
 Elsie ,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify()) #输出清晰的树形结构</code></pre>
Soup将复杂的HTML文档转化为检测结构，每个节点都是对象：

3.2基本用法

搜索文档树：tag.name_按顺序获得第一个标签

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_1.png

获取所有标签？

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_2.png

tag.可以将tag的子节点以列表方式输出

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_3.png

tag.，对tag的子节点进行循环

tag.，子孙节点

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_4.png

tag.python爬虫代码大全，获取tag（只有一个子节点）下所有的文本内容

迭代的测评找出所有的检测内容

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_6.png

soup.()#从文档中获取所有的文字内容

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_7.png

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_8.jpg

四、正则（信息提取）

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_9.jpg

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_10.jpg

<pre> <code class="language-text">import requests
from bs4 import BeautifulSoup
import bs4
#爬取信息
def getHtmlText(url):
try:
res=requests.get(url,timeout=30)
res.raise_for_status()
res.encoding=res.apparent_encoding
return res.text
except:
print("error")
return ""
#提取信息
def fillUnivList(ulist,html):
soup=BeautifulSoup(html,"html.parser")
for tr in soup.find("tbody").children:
if isinstance(tr,bs4.element.Tag): #检测tr标签的类型
 tds=tr('td')
 ulist.append(.string,tds.string,tds.string])
#打印信息
def printUnivList(ulist,num): #学习数量
tplt="{0:^10}\t{1:{3}^12}\t{2:^9}" #{}域，格式化输出
#表头
print(tplt.format("排名","学校","地址",chr(12288)))
for i in range(num):
u=ulist
print(tplt.format(u,u,u,chr(12288)))
print("Suc"+str(num))
#chr(12288)中文空格，解决中英文混排的问题
def mian():
uinfo=[]
url='http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html'
html=getHtmlText(url)
fillUnivList(uinfo,html)
printUnivList(uinfo,20)
mian()</code></pre>
http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_11.jpg

常见的原子类型-正则表达式最基本的单位

2.元字符-正则表达式中具有特殊含义的字符

3.模式修正符-在不改变正则表达式的检测下，调整匹配结果

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_16.png

4.贪婪模式和懒惰模式

贪婪模式：尽可能多的匹配字符，覆盖的范围更广

pat1="p.*y"

懒惰模式：尽可能少的匹配字符，可以更精确的定位

pat2="p.*?y"

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_18.png

5.正则表达式函数

五、爬虫框架

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_20.jpg

前提：安装好，conda（大法好！）

在cmd中，输入：-h

可以查看所有检测的测评

指定参数的话用两个短线，例如：--nolog#不显示日志

part1：全局命令

①fetch：用于显示检测爬取的过程

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_21.jpg

②：可以不依托爬虫项目，运行爬虫文件

在cmd中：

输入cd该py文件的路径

××.py

爬虫项目：可包括多个爬虫文件

③shell：启动的检测终端，开发调试，不检测项目的测验下，进行检测调试

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_22.jpg

exit()#退出shell窗口

④：创建爬虫项目

first

⑤：查看版本信息

⑥view：下载某个检测并用浏览器查看

part2：项目命令

①bench：测试本地检测的测验，只进行检测的测评，不关注爬取内容

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_26.jpg

②check：检查某爬虫文件

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_27.png

③crawl：运行某爬虫文件，可加控制参数

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_28.jpg

④list：列出当前项目中，可检测的爬虫文件

⑤edit：编辑爬虫文件，限于linux系统，系统会报错

⑥parse：

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_30.jpg

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_31.jpg

-------------------------假的分割线-------------------------------------

first文件夹

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_32.png

爬虫文件夹

初始化检测Items目标文件文件的检测处理设置：如伪装浏览器，开启，开启

.cfg配置文件

创建爬虫文件：

查看可用的检测模板-l

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_33.png

利用basic模板创建爬取的爬虫文件lina.py

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_35.png

库vs库

相同点：都检测处理js、提交表单、应对验证码等检测

不同点：

http://tt.ccoox.cn/data/attachment/forum/20220301/1646103575809_36.jpg

其他实践案例：

<pre> <code class="language-text">#作业：抓取csdn首页全部博文
import urllib.request
import re
url="http://blog.csdn.net/"
homepage=urllib.request.urlopen(url).read().decode('utf-8',"ignore")
print(len(homepage))
pat='<a href="(.*?)" target="_blank'
urls=re.compile(pat).findall(homepage) #找到博客主页所有的文章链接
print(len(urls))
for i in (range(len(urls))):
try:
 print("第"+str(i)+"次抓取")
 file="C:/Users/admin/Desktop/urls/"+str(i)+".html"
 url=urllib.request.urlretrieve(urls,file)
 print("--抓取成功--")
except urllib.error.URLError as e:
 if hasattr(e,"code"):
 print(e.code)
 if hasattr(e,"reason"):</code></pre>
其他笔记：

页: [1]

手游源码-游戏源码-棋牌源码资源网-亲测源码-游戏搭建-破解游戏-网站源码-qq技术's Archiver

2017年soup（数学）知识点梳理及解决办法！