Let's Use Python To Scrap Some Online Movies - Videos - by Peng Cao - Freedium
Let's Use Python To Scrap Some Online Movies - Videos - by Peng Cao - Freedium
Let's Use Python To Scrap Some Online Movies - Videos - by Peng Cao - Freedium
Let's explore the Python steps to scrape and reconstruct videos from web
efficiently. Discover the intricacies of how websites store videos. By obtaining
and arranging these segments through M3U8 files, we can reconstruct the
complete video.
<video src="xxx.mp4"></video>
The src attribute inside this <video> tag is not the actual download address of
the video. Almost no video website directly provides a download address within
the <video> tag.
A better solution is to slice the video into segments (ts). Each segment is assigned
a unique URL. Once all the segments are obtained, they can be properly arranged
and merged to create a complete video.
Since the video needs to be divided into numerous small fragments, a file is
required to record the paths of these fragments. This file is generally an M3U file.
After encoding the content of the M3U file in UTF-8, it becomes an M3U8 file.
Nowadays, most major video platforms use M3U8 files.
Nowadays, almost all video websites adopt a similar approach. The correct
loading sequence is as follows:
This method offers several advantages, such as saving network resources. When
a user fast-forwards, the server can directly locate and load the corresponding ts
file, greatly enhancing the user experience and reducing server pressure.
Copy
import requests
from lxml import etree
import json
def get_first_m3u8_url():
# Fetch the page source code
url = "https://www.yunbtv.org/vodplay/sandadui-2-1.html"
resp = requests.get(url)
resp.encoding = "utf-8"
tree = etree.HTML(resp.text)
print(url_value)
2. Download the first-level M3U8 file and extract the second-level M3U8 file
address.
Copy
import requests
def download_m3u8_file(first_m3u8_url):
resp = requests.get(first_m3u8_url)
resp.encoding = "utf-8"
url2 = resp.text.split()[-1]
Freedium
# Remove the last segment of the first URL (remove '/index.m3u8')
base_url = first_m3u8_url.rsplit('/', 1)[0]
# Second-level M3U8 address
second_m3u8_url = f"{base_url}/{url2}"
3. Parse the second-level M3U8 file and crawl the video segments.
Copy
import aiohttp
import aiofiles
import asyncio
4. Merge the TS files to reconstruct the MP4 file. This relies on ffmpeg
executable.
Copy
import os
def merge_ts_files():
print("Merging files")
name_list = []
with open("m3u8.txt", mode="r", encoding="utf-8") as f:
for line in f:
# Exclude all lines starting with #
if line.startswith("#"):
continue
line = line.strip()
file_name = line.split("/")[-1]
name_list.append(file_name)