Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
26 views

Pythonlevel 2

This document provides an introduction to web scraping in Python. It discusses setting up a Python environment and installing necessary libraries. It also covers concepts like dictionaries, exceptions, reading and writing files, and making HTTP requests. The document then introduces HTML and scraping foundations.

Uploaded by

ROHIT BANSAL
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Pythonlevel 2

This document provides an introduction to web scraping in Python. It discusses setting up a Python environment and installing necessary libraries. It also covers concepts like dictionaries, exceptions, reading and writing files, and making HTTP requests. The document then introduces HTML and scraping foundations.

Uploaded by

ROHIT BANSAL
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

Click to edit Master title style

Programming with Python: Beyond the Basics

How to Write a Web Scraper in


Python
Set up
Click to edit Master title style
• Python 3.6 or higher installed (I’ll be using 3.8)
• Latest version: https://www.python.org/downloads/

• An IDE for Python (PyCharm Community recommended)

• Course material downloaded and unzipped


• https://github.com/ariannedee/python-level-2

• Resources downloaded (PDF slides and course reference sheet)


Today’s schedule
Click to edit Master title style
• Introduction, set-up, and review (35 mins)
• Break, Q&A
• More concepts: dictionaries and exceptions (25 min)
• Reading and writing to files (30 mins)
• Break, Q&A
• Scraper foundations (40 mins)
• Break, Q&A
• Build a scraper (50 mins)
• Further discussion (10 mins)
Questions and breaks
Click to edit Master title style
• Use group chat throughout class
• Only ask questions relevant to current discussion
• If it’s too specific or if I need to do research, put in the Q&A
• Anyone can answer

• 3 Breaks (10 mins each)


• Step away or work through code
• I’ll answer questions in the Q&A feature
• Ask general or more in-depth questions

• Email more in-depth questions at arianne.dee.studios@gmail.com


Click to edit Master title style

Introduction
Poll (single choice)
Click to edit Master title style
How long have you been programming?
• Less than a week
• Less than a month
• Less than a year
• 1 - 3 years
• 3 - 10 years
• 10+ years
Poll (multi choice)
Click to edit Master title style
• What are you looking forward to learning?
• Dictionaries
• Exception handling
• Reading and writing to files
• Using external libraries
• Making HTTP requests
• Writing simple web scrapers
• Writing complex web scrapers
• Other (say in chat)
Click to edit Master title style

Introduction

Installation
Set up
Click to edit Master title style
• Download the PDF of these slides and the Reference
document (Resources widget)

• Go to https://github.com/ariannedee/python-level-2
and follow the installation instructions in the Readme

• Python 3.6+ installed


• An IDE for Python (PyCharm recommended)
• Course material downloaded and unzipped
Install links
Click to edit Master title style
• Install Python 3.8.x for your operating system
• https://www.python.org/downloads/

• Download the free, community edition of PyCharm


• https://www.jetbrains.com/pycharm/download/

• Download the code


• https://github.com/ariannedee/python-level-2
PyCharm IDE
Click to edit Master title style
• Supports syntax and error highlighting for
Python

• Integrated Terminal/Command Line

• Package installation without command line


Click to edit Master title style

Reviewing Python Basics

Functions, conditionals, lists, and


for-loops
Click to Conditionals
edit Master title style
Click to edit Lists
Master title style
Click to edit
ForMaster
Loops title style
Click to edit
WhileMaster
Loopstitle style
Click to edit Master title style
Functions
Click to edit
Let’s Master
do some title style
practice
Syntax
Click to edit Master title style
• For certain keywords (e.g. if, for, while, def)
• Use colon at end of line
• Indent next line(s) to define a code block
• 4-spaces (by convention)
• All lines in block of code must be indented the same
amount
• Can be nested
Poll
Click to edit Master title style
• How much of the review content do you feel comfortable
with?
• None of it - This is too advanced for me
• Some of it - I’ll be struggling
• Most of it - I’ll be following along
• All of it - It was good to review
• All of it and more - It was very basic
Next Level Python LiveLessons
Click to edit Master title style
Lesson 1.2 – Review functions, conditionals and lists
• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_01_02/
Click to edit Master title style

More Concepts

Dictionaries and exceptions


Click to edit Master title style
Dictionaries
Click to edit Master
Key-value title style
relationship
Dictionary examples
Click to edit Master title style
• Key-value examples
• Dictionary: word (key), definition (value)
• Thesaurus: word (key), synonyms (value)
• Phone book: name (key), phone number (value)

• Can also be used to store data about objects


• User with keys: name, email, birthday, country
• Book with keys: title, author, published year
Next Level Python LiveLessons
Click to edit Master title style
Lesson 1.3 – Store data in dictionaries
• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_01_03/
Click to edit Master title style
Exceptions
Next Level Python LiveLessons
Click to edit Master title style
Lesson 1.4 – Handle exceptions
• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_01_04/
Next Level Python LiveLessons
Click to edit Master title style
Lesson 1.5 – Work with dates and times
• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_01_05/

Lesson 1.6 – Use regular expressions


• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_01_06/
ClickQuestions
to edit Master title style
and break
Q&A widget
Click to edit Master title style

Reading and Writing to Files


ClickReading
to edit Master
from title
filesstyle
Click to Sample
edit Master
datatitle style

Country list: https://gist.github.com/kalinchernev/486393efcca01623b18d

Comprehensive list: https://github.com/umpirsky/country-list


Click toWriting
edit Master title style
to files
Click to edit
CSVMaster
files title style
Comma-separated values

Name Age
Name, Age Shehin 23
Shehin, 23
Freddy, 85 Freddy 85
Bob, 5 Bob 5
Gabriella, 62
Gabriella 62
Next Level Python LiveLessons
Click to edit Master title style
Lesson 2 – Work with Files
• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_02_00/
Click to edit Master title style

Scraper Foundations

Installing external libraries


Click to edit Master
Pip title style
Pip commands
Click to edit Master title style
• Install package(s)
• $ pip install

• Uninstall package
• $ pip uninstall

• List all installed packages


• $ pip list
More advanced pip commands
Click to edit Master title style
• Install specific package
• $ pip install SomePackage # latest version
• $ pip install SomePackage==1.0.4 # specific version
• $ pip install ‘SomePackage>=1.0.4' # minimum version

• Upgrade package to newest version


• $ pip install -U SomePackage

• Install with proxy


• pip install --proxy proxy.server:port package
• or --proxy [user:passwd@]proxy.server:port
Pip with requirements.txt
Click to edit Master title style
• Create text file with all installed packages + versions
• $ pip freeze > requirements.txt

• Install all packages + versions from text file


• $ pip install -r requirements.txt
Further reading
Click to edit Master title style
• Beginner tutorial
• What is Pip? A Guide for New Pythonistas

• Using requirements.txt files to save your list of libraries


• Why and how to make a requirements.txt
Next Level Python LiveLessons
Click to edit Master title style
Lesson 3.2 – Install external libraries using pip
• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_03_02/
Click to edit Master title style

Scraper Foundations

Making HTTP Requests with the


requests library
Click to edit Master title style
Requests
Next Level Python LiveLessons
Click to edit Master title style
Lesson 7.1 – Use the Requests library to make HTTP requests
• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_07_01/
Click to edit Master title style

Scraper Foundations

Introduction to HTML
Poll
Click to edit Master title style
• Do you know HTML?
• Not at all
• A bit, and I would like to review it
• A bit, but I don’t want to review it
• Yes
Click to edit
HTML Master
page title style
structure
Click toHTML
edit Master title style
elements
Click toHTML
edit Master title style
elements
Element
Click toHTML
edit Master title style
elements

Start tag End tag


Click toHTML
edit Master title style
elements

Start tag End tag

Tags start with “<” and end with “>”

Tags have a name (e.g. p is for paragraph)

End tags have a “/” before the name


Click toHTML
edit Master title style
elements

Content
Click toHTML
edit Master title style
elements
Element

Start tag Content End tag


Common tags
Click to edit Master title style
• p - paragraph
• div - divider
• a - link (a.k.a anchor)
• h1 … h6 - heading

• table
• th - table header
• tr - table row
• td - table data
Click to editNesting
Master title style

<div>
<p>This is a paragraph inside a div</p>
</div>
Click to editNesting
Master title style
<table>
<tr>
<th>Title</th>
<th>Author</th>
</tr>
<tr>
<td>Animal Farm</td>
<td>George Orwell</td>
</tr>
<tr>
<td>Pride and Prejudice</td>
<td>Jane Austen</td>
</tr>
</table>
Click to editNesting
Master title style
<table>
<tr>
<th>Title</th>
<th>Author</th>
</tr>
<tr>
<td>Animal Farm</td>
<td>George Orwell</td>
</tr>
<tr>
<td>Pride and Prejudice</td>
<td>Jane Austen</td>
</tr>
</table>
Click to editNesting
Master title style

<table>
<tr><th>Title</th><th>Author</th>
</tr><tr>
<td>Animal Farm</td><td>George Orwell</td>
</tr><tr>
<td>Pride and Prejudice</td><td>Jane Austen</td>
</tr></table>
Click to edit
HTML Master
page title style
structure
Click to edit Master title style
Attributes

Attribute Value
Click to edit Master title style
Attributes

Attribute Value

Attributes are listed after the tag name, with an “=” after
The value of an attribute is in quotes after the “=”
If there are multiple attributes, they have a space between them
Common attributes
Click to edit Master title style
• id
• A unique identifier for an element
• class
• Often used for determining the styling of an object
• E.g. Menu link has a class “active” so the styling is different for the current page
• href
• The URL for a link (hypertext reference)
• Required for links ("a" tag)
• src
• The source location of an image
• Required for images ("img" tag)
Click toSample
edit Master
HTML title style
Click toSample
edit Master
HTML title style
Next Level Python LiveLessons
Click to edit Master title style
Lesson 7.2 – Review web pages and HTML
• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_07_02/
Click to edit Master title style

Scraping websites

Scraping data
Python scraper options
Click to edit Master title style
• Beautiful Soup - simple
• lxml - more technical, supports xml
• Scrapy - advanced features, full scraper capability
• Selenium - handles JavaScript and user events, slow
• Requests-HTML - simple, but not production-ready
Click to edit Master
Beautiful Souptitle
4 style
You didn't write that awful page. You're just trying

to get some data out of it. Beautiful Soup is here

to help. Since 2004, it's been saving programmers

hours or days of work on quick-turnaround screen

scraping projects.
Click to editBeautifulSoup
Install Master title style
Click to editfind
Practise: Master
the title style
buttons
Next Level Python LiveLessons
Click to edit Master title style
Lesson 7.3 – Parse HTML documents with Beautiful Soup
• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_07_03/
ClickQuestions
to edit Master title style
and break
Q&A widget
Click to Project
edit Master
datatitle style

Countries in the United Nations


https://en.wikipedia.org/wiki/Member_states_of_the_United_Nations
Click to edit Master title style

https://en.wikipedia.org/wiki/Member_states_of_the_United_Nations
Next Level Python LiveLessons
Click to edit Master title style
Lesson 8 – Create a Web Scraping Application
• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_08_00/
Ethics
Click to edit Master title style
• Let the website know who you are and how to contact you

• Limit the rate at which you make requests

• Use public APIs instead of scraping when you can

• Read more: Ethics in Web Scraping - James Densmore


Wikipedia Terms of Use
Click to edit Master title style
4. Refraining from Certain Activities

Engaging in Disruptive and Illegal Misuse of Facilities

• Engaging in automated uses of the site that are abusive or disruptive of the services and have not been

approved by the Wikimedia community;

• Disrupting the services by placing an undue burden on a Project website or the networks or servers

connected with a Project website;

• Disrupting the services by inundating any of the Project websites with communications or other traffic that

suggests no serious intent to use the Project website for its stated purpose;

• …

https://foundation.wikimedia.org/wiki/Terms_of_Use/en
Click to edit
Let’sMaster
code!title style
Click to Project
edit Master
datatitle style

Countries in the United Nations


https://en.wikipedia.org/wiki/Member_states_of_the_United_Nations
Notes
Click to edit Master title style
• You might not be able to follow along the whole time

• Remember that you’ll get a recording of this lesson in a day or


two AND all of it (and more) is covered in Lesson 8 of my video

• There are 3 versions of the solution file:


• Retrieving all country names and printing to .txt
• Countries and the date they joined the UN to .csv
• Countries and more info from their country page
Next Level Python LiveLessons
Click to edit Master title style
Lesson 8 – Create a Web Scraping Application
• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_08_00/
Click to edit Master title style

Wrapping up

Follow-up, feedback, etc.


More examples
Click to edit Master title style
• Authenticated sites (login required)
• REST APIs
• GraphQL APIs
Next Level Python LiveLessons
Click to edit Master title style
Lesson 7.4 – Scrape authenticated sites
• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_07_04/
What is an API?
Click to edit Master title style
• Short for Application Protocol Interface

• A structured way of retrieving data without a graphical interface

• Can return multiple document types, but JSON is the most common

• REST APIs are the most common


• REpresentational State Transfer
Click to edit
RESTMaster
APIstitle style
URL: https://api.github.com/users/ariannedee/repos
Returns the list of repositories for a user
REST APIs
Click to edit Master title style
• Have a root API URL
• e.g. https://api.github.com/

• Have a different endpoint for each resource you want to access


• e.g. https://api.github.com/users

• Filter via URL or query parameters


• e.g. https://api.github.com/users/ariannedee/repos?type=member
REST API resources
Click to edit Master title style
• Wikipedia
• https://en.wikipedia.org/wiki/Representational_state_tran
sfer
• Very basic intro video to APIs
• https://www.youtube.com/watch?v=s7wmiS2mSXY
• What are RESTful APIs video
• https://www.youtube.com/watch?v=SLwpqD8n3d0
• Tutorial
• https://realpython.com/api-integration-in-python/
Other types of APIs
Click to edit Master title style
• GraphQL APIs
• e.g. https://developer.github.com/v4/explorer
• Documentation: https://developer.github.com/v4/
• Short intro video:
https://www.youtube.com/watch?v=zvZP0PVAdR0

• SOAP (Simple Object Access Protocol)


• RPC (Remote Procedure Call)
Next Level Python LiveLessons
Click to edit Master title style
Lesson 7.5 – Make API requests
• https://learning.oreilly.com/videos/next-level-
python/9780136904083/9780136904083-NLP1_01_07_05/
More web scraper ideas
Click to edit Master title style
• Other data gathering options:
• weather, news, sports, stock prices, currency exchange
rates, etc
• Custom notifications across different websites - jobs,
classifieds, flights
• Analyze Twitter or other social media content
Advanced scraper ideas
Click to edit Master title style
• Create your own visualizations of scraped data over time
• Create a “cron-job” to run your script every day and
gather new data
• Store that data either in a file or in a database (db)
• SQLite is the simplest db
• Create a new script that visualizes all of the data collected
in a time frame
More scraper tutorials
Click to edit Master title style
• Follow the stock price
• Free Code Camp tutorial - easy

• Trip Advisor reviews


• Medium tutorial - Advanced
• Use Selenium library to load JavaScript

• Advanced scraper tips and tricks


• Codementor article
Beginner Live Trainings by Arianne
Click to edit Master title style
• Introduction to Python Programming
• Next class: August 25 (link)
• Very beginner

• Python Environments and Best Practices


• Next class: July 13 (link)
• virtual envs, testing, debugging, PyCharm tips, git
• Beginner - recommended if you’re a new programmer

• Object-Oriented Programming in Python


• Next class: July 19 (link)
• Intermediate
More Live Trainings by Arianne
Click to edit Master title style
• Introduction to Django: a web application framework for Python
• Next class: July 27 (link)
• Intermediate – Recommended after all previous classes

• Rethinking REST: A hands-on guide to GraphQL and queryable APIs


• Next class: August
• Advanced – Recommended if you know Django
Video Trainings by Arianne
Click to edit Master title style
• Introduction to Python LiveLessons
• Very beginner content w/ brief intro to data analysis and web development
• Link

• Next Level Python LiveLessons


• Material from this class
• Setting up Python projects with virtual environments and git
• Testing, debugging, and understanding modules
• Link

• Rethinking REST: A hands-on guide to GraphQL and Queryable APIs


• Link
Click to edit Master title style
Thanks!

Questions?

Email me at arianne.dee.studios@gmail.com

You might also like