383606558 发表于 2022-1-8 23:02:51

美剧网站电影视频内容,25-30K爬虫工程师面试试题内容

前言

你好!大家好,这里是魔王~

一般来说可以找代码的网站天外神坛源码网,国外的影视剧找不到资源,很多人对此束手无策。

这个时候这个就很有用了,只要能起名字可以找代码的网站,几行代码就可以搞定~

【本次内容】:

爬取美剧网站影视内容,25-30K爬虫工程师面试题

【开发环境】: 【模块使用】:

win + R 回车 cmd 输入安装命令 pip 模块名称 如果流行起来可能是因为网络连接超时切换国内镜像源

对应的安装包/安装教程/激活码/使用教程/学习资料/工具插件可向我索取

首先看一下思路,爬虫思路:一.数据源分析

<p><pre>    <code>分析我们想要的数据内容 是来自于哪里
通过开发者工具进行抓包分析, 去找数据来源
</code></pre></p>
二. 代码实现流程

<p><pre>    <code>发送请求 >>> 获取数据 >>> 解析数据 >>> 保存数据
1. 发送请求,对于 url地址发送请求
2. 获取数据, 获取服务器返回的响应数据 (网页源代码)
3. 解析数据, 提取我们想要的数据内容 视频播放地址
4. 发送请求, 对于这个播放地址发送请求
5. 获取数据, 获取服务器返回的数据 (ts文件内容)
6. 解析数据, 提取所有ts文件
7. 保存数据
</code></pre></p>
只需使用get请求方法请求头

for循环遍历字符串拼接列表值字符串格式f'{}'

使用 sub 简单替换正则表达式

文件操作打开保存

导入模块

<p><pre>    <code class="prism language-python"><span class="token keyword">import</span> requests<span class="token comment"># 数据请求模块 pip install requests</span>
<span class="token keyword">import</span> re<span class="token comment"># 正则表达式模块 内置模块</span>
</code></pre></p>
<p><pre>    <code class="prism language-c"><span class="token macro property"><span class="token directive-hash">#</span> <span class="token directive keyword">url</span> <span class="token expression"><span class="token operator">=</span> </span><span class="token string">&#39;https://www.meiju11.com/Play/7816-0-0.html&#39;</span></span>
<span class="token macro property"><span class="token directive-hash">#</span> <span class="token directive keyword">headers</span> <span class="token expression"><span class="token operator">=</span> <span class="token punctuation">{</span></span></span>
#   <span class="token string">&#39;Referer&#39;</span><span class="token operator">:</span> <span class="token string">&#39;https://www.meiju11.com/Play/7816-0-0.html&#39;</span><span class="token punctuation">,</span>
#   <span class="token string">&#39;user-agent&#39;</span><span class="token operator">:</span> <span class="token string">&#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36&#39;</span>
# <span class="token punctuation">}</span>
<span class="token macro property"><span class="token directive-hash">#</span> <span class="token directive keyword">response</span> <span class="token expression"><span class="token operator">=</span> requests<span class="token punctuation">.</span><span class="token function">get</span><span class="token punctuation">(</span>url<span class="token operator">=</span>url<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span></span></span>
<span class="token macro property"><span class="token directive-hash">#</span> <span class="token directive keyword">response</span><span class="token expression"><span class="token punctuation">.</span>encoding <span class="token operator">=</span> response<span class="token punctuation">.</span>apparent_encoding # 自动识别编码转码</span></span>
# # <span class="token function">print</span><span class="token punctuation">(</span>response<span class="token punctuation">.</span>text<span class="token punctuation">)</span>
# # 每个网站数据结构都不一样<span class="token punctuation">,</span> 都是需要重新去数据结构 <span class="token punctuation">(</span>你要爬取网站<span class="token punctuation">,</span> 通用爬虫 比如说 百度搜索引擎<span class="token punctuation">)</span>
<span class="token macro property"><span class="token directive-hash">#</span> <span class="token directive keyword">title</span> <span class="token expression"><span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token function">findall</span><span class="token punctuation">(</span></span><span class="token string">"var playn = &#39;(.*?)&#39;,"</span><span class="token expression"><span class="token punctuation">,</span> response<span class="token punctuation">.</span>text<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span></span></span>
<span class="token macro property"><span class="token directive-hash">#</span> <span class="token directive keyword">video</span><span class="token expression">_data <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token function">findall</span><span class="token punctuation">(</span></span><span class="token string">&#39;第\d+集\$(.*?)\$rrm3u8&#39;</span><span class="token expression"><span class="token punctuation">,</span> response<span class="token punctuation">.</span>text<span class="token punctuation">)</span></span></span>
<span class="token macro property"><span class="token directive-hash">#</span> <span class="token directive keyword">print</span><span class="token expression"><span class="token punctuation">(</span>title<span class="token punctuation">)</span></span></span>
<span class="token macro property"><span class="token directive-hash">#</span> <span class="token directive keyword">print</span><span class="token expression"><span class="token punctuation">(</span>video_data<span class="token punctuation">)</span></span></span>
<span class="token macro property"><span class="token directive-hash">#</span> <span class="token directive keyword">num</span> <span class="token expression"><span class="token operator">=</span> <span class="token number">1</span></span></span>
<span class="token macro property"><span class="token directive-hash">#</span> <span class="token directive keyword">for</span> <span class="token expression">index in video_data<span class="token operator">:</span></span></span>
<span class="token macro property"><span class="token directive-hash">#</span>   <span class="token directive keyword">video</span><span class="token expression">_name <span class="token operator">=</span> f</span><span class="token string">&#39;{title}第{num}集&#39;</span></span>
<span class="token macro property"><span class="token directive-hash">#</span>   <span class="token directive keyword">print</span><span class="token expression"><span class="token punctuation">(</span>video_name<span class="token punctuation">,</span> index<span class="token punctuation">)</span></span></span>
</code></pre></p>
1. 发送请求,发送请求url地址

<p><pre>    <code class="prism language-c">#   # 确定请求的网址
<span class="token macro property"><span class="token directive-hash">#</span>   <span class="token directive keyword">video</span><span class="token expression">_url <span class="token operator">=</span> f</span><span class="token string">&#39;https://www.meiju11.com/ckplayerx/m3u8.php?url={index}&f=ck_m3u8&#39;</span></span>
#   # headers 请求头 用来伪装python代码的 不被识别出来是爬虫程序
#   # User<span class="token operator">-</span>Agent 用户代理 浏览器基本身份标识 建议都加
#   # Referer 防盗链告诉服务器我们的发送请求url地址 是从哪里跳转过来
#   # 加headers就穿个衣服 <span class="token punctuation">,</span> 不加裸奔<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> 有时候只是加一个UA 相当于你只有裤头
#   # Cookie 用户信息<span class="token punctuation">,</span> 检测是否有登陆账号
#   # Host 域名
<span class="token macro property"><span class="token directive-hash">#</span>   <span class="token directive keyword">headers</span><span class="token expression">_1 <span class="token operator">=</span> <span class="token punctuation">{</span></span></span>
#         <span class="token string">&#39;Referer&#39;</span><span class="token operator">:</span> <span class="token string">&#39;https://www.meiju11.com/js/player/rrm3u8.html&#39;</span><span class="token punctuation">,</span>
#         <span class="token string">&#39;User-Agent&#39;</span><span class="token operator">:</span> <span class="token string">&#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36&#39;</span>
#   <span class="token punctuation">}</span>
#   # 通过requests这个模块里面get请求方法 对于url地址发送请求<span class="token punctuation">,</span> 并且携带上headers请求头 最后用response变量接收返回数据
<span class="token macro property"><span class="token directive-hash">#</span>   <span class="token directive keyword">response</span> <span class="token expression"><span class="token operator">=</span> requests<span class="token punctuation">.</span><span class="token function">get</span><span class="token punctuation">(</span>url<span class="token operator">=</span>video_url<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers_1<span class="token punctuation">)</span></span></span>
#   #<span class="token operator"><</span>Response <span class="token punctuation">[</span><span class="token number">200</span><span class="token punctuation">]</span><span class="token operator">></span>响应对象<span class="token number">200</span> 状态码表示请求成功
#   # 虽然请求成功<span class="token punctuation">,</span> 但是没有得到自己想要的数据内容<span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span>
#   # <span class="token function">print</span><span class="token punctuation">(</span>response<span class="token punctuation">.</span>text<span class="token punctuation">)</span> 文本数据
</code></pre></p>
http://tt.ccoox.cn/data/attachment/forum/20220108/1641654171443_3.jpg

3. 解析数据,提取出我们想要的数据内容的视频播放地址

<p><pre>    <code class="prism language-c"><span class="token macro property"><span class="token directive-hash">#</span>   <span class="token directive keyword">m</span><span class="token expression"><span class="token number">3u</span><span class="token number">8</span>_url <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token function">findall</span><span class="token punctuation">(</span></span><span class="token string">"url: &#39;(.*?)&#39;"</span><span class="token expression"><span class="token punctuation">,</span> response<span class="token punctuation">.</span>text<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span></span></span>
<span class="token macro property"><span class="token directive-hash">#</span>   <span class="token directive keyword">print</span><span class="token expression"><span class="token punctuation">(</span>m3u8_url<span class="token punctuation">)</span></span></span>
</code></pre></p>
4. 发送请求,向这个播放地址发送请求

<p><pre>    <code class="prism language-c"><span class="token macro property"><span class="token directive-hash">#</span>   <span class="token directive keyword">m</span><span class="token expression"><span class="token number">3u</span><span class="token number">8</span>_data_1 <span class="token operator">=</span> requests<span class="token punctuation">.</span><span class="token function">get</span><span class="token punctuation">(</span>url<span class="token operator">=</span>m3u8_url<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers_1<span class="token punctuation">)</span><span class="token punctuation">.</span>text</span></span>
#   # <span class="token punctuation">.</span> 是匹配任意字符<span class="token punctuation">(</span>除了换行\n<span class="token punctuation">)</span><span class="token operator">*</span> 匹配前一个字符<span class="token number">0</span>或者无限个
<span class="token macro property"><span class="token directive-hash">#</span>   <span class="token directive keyword">m</span><span class="token expression"><span class="token number">3u</span><span class="token number">8</span>_data <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token function">sub</span><span class="token punctuation">(</span></span><span class="token string">&#39;#EX.*&#39;</span><span class="token expression"><span class="token punctuation">,</span> </span><span class="token string">&#39;&#39;</span><span class="token expression"><span class="token punctuation">,</span> m3u8_data_1<span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">split</span><span class="token punctuation">(</span><span class="token punctuation">)</span></span></span>
<span class="token macro property"><span class="token directive-hash">#</span>   <span class="token directive keyword">for</span> <span class="token expression">link in m3u8_data<span class="token operator">:</span></span></span>
#         # https<span class="token operator">:</span><span class="token comment">//fangao.stboda.com/concat/20211115/54b420f81ccc4339b56b11b74adc15a7/cloudv-transfer/a0941a88a1784001b46560ea05257010_0000031.ts</span>
#         # a0941a88a1784001b46560ea05257010_0000000<span class="token punctuation">.</span>ts
</code></pre></p>
6. 解析数据,提取所有ts文件

<p><pre>    <code class="prism language-c"><span class="token macro property"><span class="token directive-hash">#</span>         <span class="token directive keyword">ts</span><span class="token expression">_url <span class="token operator">=</span> </span><span class="token string">&#39;https://fangao.stboda.com/concat/20211115/54b420f81ccc4339b56b11b74adc15a7/cloudv-transfer/&#39;</span> <span class="token expression"><span class="token operator">+</span> link</span></span>
</code></pre></p>
7. 保存数据(二进制数据[图片/音频/视频/特定格式的文件]) 发送请求获取数据

<p><pre>    <code class="prism language-c">#         # ts_content <span class="token operator">=</span> requests<span class="token punctuation">.</span><span class="token function">get</span><span class="token punctuation">(</span>url<span class="token operator">=</span>ts_url<span class="token punctuation">)</span><span class="token punctuation">.</span>content # response<span class="token punctuation">.</span>content 获取二进制数据
#         # # a 追加保存 b 二进制模块
#         # with <span class="token function">open</span><span class="token punctuation">(</span>video_name <span class="token operator">+</span> <span class="token string">&#39;.mp4&#39;</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">&#39;ab&#39;</span><span class="token punctuation">)</span> as f<span class="token operator">:</span>
#         #   f<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>ts_content<span class="token punctuation">)</span>
<span class="token macro property"><span class="token directive-hash">#</span>         <span class="token directive keyword">print</span><span class="token expression"><span class="token punctuation">(</span>ts_url<span class="token punctuation">)</span></span></span>
<span class="token macro property"><span class="token directive-hash">#</span>   <span class="token directive keyword">num</span> <span class="token expression"><span class="token operator">+=</span> <span class="token number">1</span></span></span>
</code></pre></p>
http://tt.ccoox.cn/data/attachment/forum/20220108/1641654171443_5.png

某站自动发送弹幕码

<p><pre>    <code class="prism language-c"><span class="token macro property"><span class="token directive-hash">#</span>
<span class="token directive keyword">import</span> <span class="token expression">random</span></span>
import time
<span class="token keyword">while</span> True<span class="token operator">:</span>
    lis <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">&#39;主播666&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;主播真帅&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;666&#39;</span><span class="token punctuation">]</span>
    word <span class="token operator">=</span> random<span class="token punctuation">.</span><span class="token function">choice</span><span class="token punctuation">(</span>lis<span class="token punctuation">)</span>
    url <span class="token operator">=</span> <span class="token string">&#39;https://api.live.bilibili.com/msg/send&#39;</span>
    data <span class="token operator">=</span> <span class="token punctuation">{</span>
      <span class="token string">&#39;bubble&#39;</span><span class="token operator">:</span> <span class="token string">&#39;0&#39;</span><span class="token punctuation">,</span>
      <span class="token string">&#39;msg&#39;</span><span class="token operator">:</span> word<span class="token punctuation">,</span>
      <span class="token string">&#39;color&#39;</span><span class="token operator">:</span> <span class="token string">&#39;16777215&#39;</span><span class="token punctuation">,</span>
      <span class="token string">&#39;mode&#39;</span><span class="token operator">:</span> <span class="token string">&#39;1&#39;</span><span class="token punctuation">,</span>
      <span class="token string">&#39;fontsize&#39;</span><span class="token operator">:</span> <span class="token string">&#39;25&#39;</span><span class="token punctuation">,</span>
      <span class="token string">&#39;rnd&#39;</span><span class="token operator">:</span> <span class="token string">&#39;1640181316&#39;</span><span class="token punctuation">,</span>
      <span class="token string">&#39;roomid&#39;</span><span class="token operator">:</span> <span class="token string">&#39;22819508&#39;</span><span class="token punctuation">,</span>
      <span class="token string">&#39;csrf&#39;</span><span class="token operator">:</span> <span class="token string">&#39;131d24bf2b92a3609208ed83758ab8f2&#39;</span><span class="token punctuation">,</span>
      <span class="token string">&#39;csrf_token&#39;</span><span class="token operator">:</span> <span class="token string">&#39;131d24bf2b92a3609208ed83758ab8f2&#39;</span><span class="token punctuation">,</span>
    <span class="token punctuation">}</span>
    headers <span class="token operator">=</span> <span class="token punctuation">{</span>
    <span class="token string">&#39;cookie&#39;</span><span class="token operator">:</span> <span class="token string">&#39;_uuid=CF79473B-0F83-6087-BCB0-23A7E36C479296281infoc; buvid3=8AEBDC64-0CEE-4F6F-9866-B1508FAB2871148805infoc; blackside_state=1; rpdid=|(kmJY|kmu|)0J\&#39;uYJRYkJu|Y; CURRENT_QUALITY=32; buvid_fp=8AEBDC64-0CEE-4F6F-9866-B1508FAB2871148805infoc; video_page_version=v_old_home; sid=abkn3and; i-wanna-go-back=-1; b_ut=5; bp_video_offset_65901796=602085772740229500; LIVE_BUVID=AUTO5916400017585242; CURRENT_BLACKGAP=1; CURRENT_FNVAL=2000; b_lsid=FB21108D7_17DE26BCC89; fingerprint=3c7949ca118951ae1be0b5b37b321306; buvid_fp_plain=556F173C-C5B7-4583-86F9-FD1074EB421D143076infoc; DedeUserID=523606542; DedeUserID__ckMd5=909861ec223d26d8; SESSDATA=cec4dda4%2C1655733287%2Cc5b88*c1; bili_jct=131d24bf2b92a3609208ed83758ab8f2; innersign=0; _dfcaptcha=fbb016616944e9b94331aeb2627c9d3d; PVID=3&#39;</span><span class="token punctuation">,</span>
    <span class="token string">&#39;origin&#39;</span><span class="token operator">:</span> <span class="token string">&#39;https://live.bilibili.com&#39;</span><span class="token punctuation">,</span>
    <span class="token string">&#39;referer&#39;</span><span class="token operator">:</span> <span class="token string">&#39;https://live.bilibili.com/&#39;</span><span class="token punctuation">,</span>
    <span class="token string">&#39;user-agent&#39;</span><span class="token operator">:</span> <span class="token string">&#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36&#39;</span><span class="token punctuation">,</span>
    <span class="token punctuation">}</span>
    response <span class="token operator">=</span> requests<span class="token punctuation">.</span><span class="token function">post</span><span class="token punctuation">(</span>url<span class="token operator">=</span>url<span class="token punctuation">,</span> data<span class="token operator">=</span>data<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span>
    <span class="token function">print</span><span class="token punctuation">(</span>response<span class="token punctuation">.</span>text<span class="token punctuation">)</span>
    time<span class="token punctuation">.</span><span class="token function">sleep</span><span class="token punctuation">(</span><span class="token number">5</span><span class="token punctuation">)</span>
</code></pre></p>
好了,我的这篇文章到此结束!

如果您有更多的建议或问题,欢迎评论或私信我!一起努力吧(ง•_•)ง

w96680 发表于 2023-2-14 16:00:22

很厉害的帖子啊

w96680 发表于 2023-2-25 09:57:34

啊啊啊啊啊啊啊啊啊啊啊啊

w96680 发表于 2023-3-18 23:06:11

看看撒发发发发
页: [1]
查看完整版本: 美剧网站电影视频内容,25-30K爬虫工程师面试试题内容