YouTube Data Scraping, Preprocessing and Analysis using Python
Last Updated :
24 Feb, 2023
YouTube is one of the oldest and most popular video distribution platforms in the world. We can't even imagine the video content available here. It has billion of users and viewers, which keeps on increasing every passing minute.
Since its origins, YouTube and its content have transformed very much. Now we have SHORTS, likes, and many more features.
So here we will be doing the analysis for the GeeksforGeeks Youtube channel, which includes the analysis of the time duration, likes, title of the video, etc.
Before that, we need the data. We can scrap the data using Web Scraping.
Web scraping the data
Web Scraping is the automation of the data extraction process from websites. Web Scrapers automatically load and extract data from websites based on user requirements. These can be custom-built to work for one site or can be configured to work with any website.
Here, we will be using Selenium and BeautifulSoup for web scraping.
After extracting the data, we will be converting it into an excel file. So for that, we will be using XLSXwriter library.
Python3
import time
from selenium import webdriver
from bs4 import BeautifulSoup
import xlsxwriter
The url of the main webpage/ youtube page must be provided in this format.
Python3
# provide the url of the channel whose data you want to fetch
urls = [
'https://www.youtube.com/c/GeeksforGeeksVideos/videos'
]
Now , we will be creating the soup of the content extracted by the chrome driver.
You need to specify the path of chrome driver at the place of path_of_chrome_driver.
If you don't have it, please install it and then specify the correct location.
Note : Mostly, the download location is 'C:\Downloads\chromedriver.exe'
Python3
times = 0
row = 0
t = v = d = []
driver = webdriver.Chrome(executable_path='path_of_chrome_driver')
for url in urls:
driver.get('{}/videos?view=0&sort=p&flow=grid'.format(url))
while times < 5:
time.sleep(1)
driver.execute_script(
"window.scrollTo(0, document.documentElement.scrollHeight);")
times += 1
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content, 'lxml')
Now we will be extracting the title, duration, views as per their respective id/ class and storing them in a separate list. You can add more columns in a same way.
Python3
#Title
titles = soup.findAll('a', id='video-title')
t =[]
for i in titles:
t.append(i.text)
#Views
views = soup.findAll('span', class_='style-scope ytd-grid-video-renderer')
v = []
count = 0
for i in range(len(views)):
if i%2 == 0:
v.append(views[i].text)
else:
continue
#Duration
duration = soup.findAll(
'span', class_='style-scope ytd-thumbnail-overlay-time-status-renderer')
d = []
for i in duration:
d.append(i.text)
Once we have the list, we are now ready to create the excel file.
Note : After creating an excel file and adding all the items. Please close it using workbook.close() command, else it will not show at the specified location.
Python3
workbook = xlsxwriter.Workbook('file.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, "Title")
worksheet.write(0, 1, "Views")
worksheet.write(0, 2, "Duration")
row = 1
for title, view, dura in zip(t,v,d):
worksheet.write(row, 0, title)
worksheet.write(row, 1, view)
worksheet.write(row, 2, dura)
row += 1
workbook.close()
Data PreprocessingÂ
Data preprocessing involves the following steps:
- Removal of extra characters from columns (like spaces in duration)
- Conversion of values as per the requirement (e.g. 2.6k must be in form 2600)
- Conversion of duration column into categories.
For implementing the above steps, let's start with loading the excel file we created above.
Python3
import pandas as pd
data = pd.read_excel('file.xlsx')
data.head()
Output:
Â
Removal of extra character from views column is done by checking if there is  'k' in the value and  removing it. Then, converting it into float value then multiply it with 1000. Refer the below code for the same.
Python3
data['Views'] = data['Views'].str.replace(" views","")
new = []
for i in data['Views']:
if(i.endswith('K')):
i = i.replace('K','')
new.append(float(i) * 1000)
else :
new.append(i)
data['Views'] = new
Removal of extra character from Duration column is done by removing '\n' . Then we need to convert it into seconds. For that, we will use loop and multiply the hour value with 3600 and minute value with 60 and add them with seconds value.
Python3
#Duration column cleaning
data['Duration'] = data['Duration'].str.replace("\n","")
new2 = []
for i in data['Duration']:
if(i=='SHORTS' or len(i.split(':'))==1):
new2.append(i)
elif(len(i.split(':'))==2):
i = i.split(':')
tim = int(i[0])*60 + int(i[1])
new2.append(tim)
elif(len(i.split(':'))==3):
i = i.split(':')
tim = int(i[0])*3600 + int(i[1])*60 + int(i[2])
new2.append(tim)
data['Duration'] = new2
Once we get the seconds, we can easily categorize the values. In this article, we have taken 4 section :Â
- SHORTS
- Mini-Videos
- Long-Videos
- Very-Long-Videos
You can take more or less, as per your choice.
Python3
#Duration column categorization
for i in data['Duration'].index:
val = data['Duration'].iloc[i]
if(val==' SHORTS'):
continue
elif(val in range(0,900)):
data.loc[i,'Duration'] = 'Mini-Videos'
elif(val in range(901,3600)):
data.loc[i,'Duration'] = 'Long-Videos'
else:
data.loc[i,'Duration'] = 'Very-Long-Videos'
After all the preprocessing, let's check the new dataset.
Python3
Output:
 Text Preprocessing
Text preprocessing refers to the cleaning of the text data by doing the following steps :Â
- Removal of punctuations
- Lowercase the characters
- Create tokens
- Remove Stopwords
We can do all these using NLTK Library.
Python3
import re
from tqdm import tqdm
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
After importing the libraries, run the below code for processing the Title column.
Python3
def preprocess_text(text_data):
preprocessed_text = []
for sentence in tqdm(text_data):
sentence = re.sub(r'[^\w\s]', '', sentence)
preprocessed_text.append(' '.join(token.lower()
for token in str(sentence).split()
if token not in stopwords.words('english')))
return preprocessed_text
preprocessed_review = preprocess_text(data['Title'].values)
data['Title'] = preprocessed_review
Data Visualization
Data visualization is the graphical representation of information and data in a pictorial or graphical format. Here we will be using WordCloud for checking the maximum used words in the whole data.
We will also use Seaborn library for better visualization.
Let's import the libraries for that.
Python3
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns
Wordcloud for the title column.
Python3
consolidated = ' '.join(word for word in data['Title'].astype(str))
wordCloud = WordCloud(width=1600, height=800, random_state=21,
max_font_size=110, collocations=False)
plt.figure(figsize=(15, 10))
plt.imshow(wordCloud.generate(consolidated), interpolation='bilinear')
plt.axis('off')
plt.show()
Output:
Â
Countplot for Duration column.
Python3
sns.countplot(data['Duration'])
Output:
Â
Similar Reads
Sentiment Analysis of YouTube Comments
YouTube has become a significant platform for communication and expression, where people from all over the world can share their thoughts and opinions on various videos. These comments can provide a deep insight into what the viewer perceives and their feedback on the content. This allows content cr
13 min read
Processing text using NLP | Basics
In this article, we will be learning the steps followed to process the text data before using it to train the actual Machine Learning Model. Importing Libraries The following must be installed in the current working environment: NLTK Library: The NLTK library is a collection of libraries and program
2 min read
How to Extract YouTube Comments Using Youtube API - Python
Prerequisite: YouTube API Google provides a large set of APIâs for the developer to choose from. Each and every service provided by Google has an associated API. Being one of them, YouTube Data API is very simple to use provides features like â Search for videosHandle videos like retrieve informatio
2 min read
How is YouTube Using Machine Learning?
Machine learning is used at YouTube in several ways including Recommendations, Content moderation, Copyright identification, and many others. By recognizing the three factors of viewing, searching and engagement that people have towards videos, YouTube can recommend clips that people would be intere
10 min read
Best Python Web Scraping Libraries in 2024
Python offers several powerful libraries for web scraping, each with its strengths and suitability for different tasks. Whether you're scraping data for research, monitoring, or automation, choosing the right library can significantly affect your productivity and the efficiency of your code.Best Pyt
5 min read
How to do web scraping using selenium and google colab?
Selenium is used for testing, web automation tasks, web scraping tasks etc. Its WebDriver component allows user actions to perform tasks in the web browser, while its headless mode performs automation tasks in the background. Google Colaboratory in short Google Colab is a cloud-based platform provid
6 min read
Python - Efficient Text Data Cleaning
Gone are the days when we used to have data mostly in row-column format, or we can say Structured data. In present times, the data being collected is more unstructured than structured. We have data in the form of text, images, audio etc and the ratio of Structured to Unstructured data has decreased
6 min read
Scrape LinkedIn Using Selenium And Beautiful Soup in Python
In this article, we are going to scrape LinkedIn using Selenium and Beautiful Soup libraries in Python. First of all, we need to install some libraries. Execute the following commands in the terminal. pip install selenium pip install beautifulsoup4In order to use selenium, we also need a web driver.
7 min read
Does YouTube Use Artificial Intelligence?
In the digital age, Artificial Intelligence (AI) has become useful for technological advancements, and platforms like YouTube are no exception. With over 2.5 billion active users worldwide, YouTube harnesses the power of AI to enhance user experience, optimize content delivery, and ensure platform s
8 min read
Does YouTube Use Artificial Intelligence?
YouTube, the world's largest video-sharing platform, has evolved significantly since its inception. This evolution has been largely driven by advancements in technology, particularly artificial intelligence (AI). AI plays a crucial role in enhancing user experiences and streamlining content manageme
4 min read