Let's Use Python To Scrap Some Online Movies - Videos - by Peng Cao - Freedium

Freedium
< Go to the original
Let's Use Python to Scrap Some Online

Movies/Videos
Peng Cao
Follow
~3 min read · January 7, 2024 (Updated: January 7, 2024) · Free: No
Let's explore the Python steps to scrape and reconstruct videos from web
efficiently. Discover the intricacies of how websites store videos. By obtaining
and arranging these segments through M3U8 files, we can reconstruct the
complete video.
How are the video stored

Typically, to display a video resource on a webpage, there must be a <video> tag:
Freedium Copy
<video src="xxx.mp4"></video>
The src attribute inside this <video> tag is not the actual download address of
the video. Almost no video website directly provides a download address within
the <video> tag.
This approach leads to a poor user experience as it negatively impacts both

network speed and memory usage.
A better solution is to slice the video into segments (ts). Each segment is assigned
a unique URL. Once all the segments are obtained, they can be properly arranged
and merged to create a complete video.
Since the video needs to be divided into numerous small fragments, a file is
required to record the paths of these fragments. This file is generally an M3U file.
After encoding the content of the M3U file in UTF-8, it becomes an M3U8 file.
Nowadays, most major video platforms use M3U8 files.
Nowadays, almost all video websites adopt a similar approach. The correct
loading sequence is as follows:
1. Request the M3U8 file.
2. Load the segment (ts) files.
3. Play the video normally.
This method offers several advantages, such as saving network resources. When
a user fast-forwards, the server can directly locate and load the corresponding ts
file, greatly enhancing the user experience and reducing server pressure.
Steps to obtain and construct video

Freedium
1. Obtain the first-level M3U8 file address by inspecting the webpage source
code.
Copy
import requests
from lxml import etree
import json
def get_first_m3u8_url():
# Fetch the page source code
url = "https://www.yunbtv.org/vodplay/sandadui-2-1.html"
resp = requests.get(url)
resp.encoding = "utf-8"
tree = etree.HTML(resp.text)
# Parse the URL from the script content

script_content = tree.xpath('//script[contains(text(), "player_aaaa")]/text
# Extract the JSON part from the script

json_str = script_content[script_content.find('{'):script_content.rfind('}'
# Parse the JSON string

data = json.loads(json_str)
# Extract the URL value

url_value = data.get("url", "")
print(url_value)
2. Download the first-level M3U8 file and extract the second-level M3U8 file
address.
Copy
import requests
def download_m3u8_file(first_m3u8_url):
resp = requests.get(first_m3u8_url)
resp.encoding = "utf-8"
url2 = resp.text.split()[-1]
Freedium
# Remove the last segment of the first URL (remove '/index.m3u8')
base_url = first_m3u8_url.rsplit('/', 1)[0]
# Second-level M3U8 address
second_m3u8_url = f"{base_url}/{url2}"
# Download M3U8 file

m3u8_resp = requests.get(second_m3u8_url)
m3u8_resp.encoding = "utf-8"
with open("m3u8.txt", mode="w", encoding="utf-8") as f:

f.write(m3u8_resp.text)
3. Parse the second-level M3U8 file and crawl the video segments.
Copy
import aiohttp
import aiofiles
import asyncio
# Download a single ts file

async def download_one(url):
print("Downloading: " + url)
# Retry 10 times to prevent download failures
for i in range(10):
try:
file_name = url.split("/")[-1]
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
content = await resp.content.read()
async with aiofiles.open(f"./TsFiles/{file_name}", mode="wb
await f.write(content)
break
except:
print("Download failed: " + url)
await asyncio.sleep((i + 1) * 5)
# Download all ts files

async def download_all_ts():
# Prepare the task list
tasks = []
# Read the m3u8 file
with open("m3u8.txt", mode="r", encoding="utf-8") as f:
for line in f:
# Exclude all lines starting with #
Freedium
if line.startswith("#"):
continue
line = line.strip()
task = asyncio.create_task(download_one(line))
tasks.append(task)
# Wait for all tasks to finish

await asyncio.wait(tasks)
4. Merge the TS files to reconstruct the MP4 file. This relies on ffmpeg
executable.
Copy
import os
def merge_ts_files():
print("Merging files")
name_list = []
with open("m3u8.txt", mode="r", encoding="utf-8") as f:
for line in f:
# Exclude all lines starting with #
if line.startswith("#"):
continue
line = line.strip()
file_name = line.split("/")[-1]
name_list.append(file_name)
with open(".\TsFiles\m3u8.txt", mode="w", encoding="utf-8") as f:

for data in name_list:
f.write("file " + "'" + data + "'" + "\n")
# Record the current working directory

now_dir = os.getcwd()
# Change the working directory
os.chdir("./TsFiles")
os.system("D:\\ffmpeg\\ffmpeg.exe -f concat -safe 0 -i m3u8.txt -c copy out
# Switch back to the original working directory after all operations
os.chdir(now_dir)
print("File merging completed")
Thanks for reading! Happy hacking!

Freedium
#coding #programming #web-scraping #python #software-engineering

Let's Use Python To Scrap Some Online Movies - Videos - by Peng Cao - Freedium

Uploaded by

Copyright:

Available Formats

Let's Use Python To Scrap Some Online Movies - Videos - by Peng Cao - Freedium

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Let's Use Python To Scrap Some Online Movies - Videos - by Peng Cao - Freedium

Uploaded by

Copyright:

Available Formats

Freedium

< Go to the original

Let's Use Python to Scrap Some Online

~3 min read · January 7, 2024 (Updated: January 7, 2024) · Free: No

How are the video stored

This approach leads to a poor user experience as it negatively impacts both

1. Request the M3U8 file.

2. Load the segment (ts) files.

3. Play the video normally.

Steps to obtain and construct video

# Parse the URL from the script content

# Extract the JSON part from the script

# Parse the JSON string

# Extract the URL value

# Download M3U8 file

with open("m3u8.txt", mode="w", encoding="utf-8") as f:

# Download a single ts file

# Download all ts files

# Wait for all tasks to finish

with open(".\TsFiles\m3u8.txt", mode="w", encoding="utf-8") as f:

# Record the current working directory

Thanks for reading! Happy hacking!

You might also like