- Motivation
- Set up
- Workshop
- Full Code
- Exercise
- Learn more about Web Autmation
- Learn more Python
- Ethics
- Social Impact of Open Source
- Ideas
- Requirements
Note: pygame
with Ahmed has been moved to June 22. Make sure to show up. He’s a great presenter who can teach just about anything.
Automation is fun! What if we could do fancy stuff through automation with the web browser?
print("Hello Browser")
Hosted by: Harsh Deep
Motivation
On the Subject of Talent - this is advice for artists, but in my experience, a lot of good art advice also applies to programming. We are more alike than a lot of people realize.
Set up
Fill out your requirements.txt file with this:
bs4
pytest
pylint
And run
$ pip install -r requirements.txt
We’ll be coding a web scraper that uses the structure in HTML files to collect data. You don’t actually need to understand much HTML at all, but if you’ve never seen it before, I suggest a quick read through What is HTML? and My First Site.
Workshop
We’ll be making a tool that scrapes the UIUC CS 225 website for all the slide URLs and downloads them all locally. You can adapt this approach to scrape content for pretty much any of your classes with ease.
Getting the page HTML
First we start at the CS 225 Homepage on the web browser and poke around until we find the page we’re looking for. After that, we import get
from the built-in requests
library and use .text
to get the page contents. Let’s kick off scrape.py
:
from requests import get
URL = "https://courses.engr.illinois.edu/cs225/sp2021/pages/lectures.html"
page_html = get(URL).text
print(page_html[0:200]) # get the first 200 chars
which prints out an HTML page:
<!DOCTYPE html>
<html lang="en">
<head>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-148919077-1"></script>
<script>
Generally, each web page has three components:
- Content: via HTML. Content between nested tags
<tag></tag>
like this that tell the browser what is the structure of the page. Understanding the structure of the page is how we can extract our content. - Style: via CSS. There are style sheet rules that tell the browser how to display the page. Sometimes we use
class
to group a bunch of elements together andid
to single out single elements. We don’t really need to understand too much about how CSS works here, but exploitingclass
andid
structure is common in web scraping tasks. - Functionality: via JavaScript. This adds dynamic content to the page and all sorts of special functionality. For many pages, this doesn’t make too much of a difference content wise, but if you’re working with a JS heavy page then check out
selenium
(my Selenium workshop slides are linked at the bottom of the page, take a look).
In the old days, people coded all three by hand. Today, almost all pages on the internet generate these three components via other programs (including this workshop site using jekyll
). This automatic generation gives the page a strong structure that can be exploited to web scrape it. Fighting automation with more automation :)
Workflow
The general workflow for pretty much any type of web automation is:
-
Identify Pattern - for this we’ll inspect element around the page till we can find some common aspects (generally looking for elements, classes, ids)
-
Code Pattern - this is where BeautifulSoup comes in. We figure out how to pattern match what we need.
-
Action. Sometimes we’re clicking buttons, filling forms, or performing more sub-extraction.
Extracting Features
However, HTML is not easy to parse using the string functions split()
and trim()
like we normally do. Instead, we’ll use a library called BeautifulSoup
. It gives us lots of neat selectors to navigate around and extract parts of our html page.
from requests import get
from bs4 import BeautifulSoup
URL = "https://courses.engr.illinois.edu/cs225/sp2021/pages/lectures.html" # common constant
if __name__ == "__main__":
page_html = get(URL).text
page = BeautifulSoup(page_html, features="html.parser")
breakpoint()
I’ve added a breakpoint()
at the end, which creates an interactive console at the end of our program. This is generally quite useful for debugging. I think it’s a good workflow for all web bots because you can figure out your next steps right away, and once you’re sure you can add it back to the program. Having the __main__
check makes sure pytest
doesn’t hit the debugger and actual code.
Get Row
The page has each week’s lectures in one row. A good strategy for scraping would be retrieving each row, and then getting the slides for each week seperately.
Identify Pattern
Now let’s open the browser and inspect element. After some poking about, we find that all element rows have a <div>
element with the class lecture-row
that wraps it.
Code Pattern
Now let’s try it out in our console. We’ll use pdb
(Python debugger) so that we can run one line at a time to see what it does. This is good for trying out things, and inspecting the state of existing stuff too.
$ python scrape.py
--Return--
> /home/harsh183/Experiments/225lecture/scrape.py(10)<module>()->None
-> breakpoint()
(Pdb)
Let’s use find_all
in BeautifulSoup to match the pattern.
(Pdb) rows = page.find_all('div', class_="lecture-row")
(Pdb) len(rows)
15
Usually I use len()
to check if my matching was successful. 15
corresponds to the number of weeks so we’re good here. Now let’s inspect the order using .text
which shows the displayed web page text.
(Pdb) rows[0].text
'\n\n\n\nMon, 03 May\nAll Pairs Shortest Path (APSP) (Fall Dec 6)\n\nslides\nhandout\nTA lecture notes\n\n\n\n\n\n\n\nWed, 05 May\nReview\n\nslides\n\n\n\n\n\n\n\nThu, 06 May\nReading Day and Final Project\n\nexam details\n\n\n\n\n'
(Pdb) rows[1].text
'\n\n\n\nMon, 26 Apr\nEnd of MST and Single Source Shortest Path (SSSP) (Fall Dec 2)\n\nslides\nhandout\nTA lecture notes\n\n\n\n\n\n\n\nWed, 28 Apr\nSingle Source Shortest Path (SSSP) (Fall Dec 4)\n\nslides\nhandout\nTA lecture notes\n\n\n\n\n\n\n\nFri, 30 Apr\nExam 3\n\nexam details\n\n\n\n\n'
(Pdb) rows[14].text
'\n\n\n\nMon, 25 Jan\nIntroduction\n\nslides\nhandout\ncode\nTA lecture notes\n\n\n\n\n\n\n\nWed, 27 Jan\nClasses\n\nslides\nhandout\ncode\nTA lecture notes\n\n\n\n\n\n\n\nFri, 29 Jan\nPointers and Memory\n\nslides\nhandout\ncode\nTA lecture notes\n\n\n\n\n'
Looks like Beautiful Soup captured the rows as they appear on the page. But the page puts the lecture rows in reverse order. Let’s try reversing it and seeing if it looks good.
(Pdb) rows.reverse()
(Pdb) rows[0].text
'\n\n\n\nMon, 25 Jan\nIntroduction\n\nslides\nhandout\ncode\nTA lecture notes\n\n\n\n\n\n\n\nWed, 27 Jan\nClasses\n\nslides\nhandout\ncode\nTA lecture notes\n\n\n\n\n\n\n\nFri, 29 Jan\nPointers and Memory\n\nslides\nhandout\ncode\nTA
(Pdb) rows[14].text
'\n\n\n\nMon, 03 May\nAll Pairs Shortest Path (APSP) (Fall Dec 6)\n\nslides\nhandout\nTA lecture notes\n\n\n\n\n\n\n\nWed, 05 May\nReview\n\nslides\n\n\n\n\n\n\n\nThu, 06 May\nReading Day and Final Project\n\nexam details\n\n\n\n\n'
(Pdf) exit() # to leave
Since this looks good lets add it to our code.
def get_rows(page):
rows = page.find_all('div', class_="lecture-row")
rows.reverse() # to get the proper order
return rows
Let’s add a test to make sure our functionality is working correctly as well. Having a function helps us since we can test the function.
def test_get_rows():
page = BeautifulSoup(get(URL).text, features="html.parser")
rows = get_rows(page)
assert len(rows) == 15, "There are 15 weeks of lectures"
assert "Introduction" in rows[0].text, "The first row has the intro lecture"
At this stage pytest
should pass everything. Double check if it’s not. The full program should be
from requests import get
from bs4 import BeautifulSoup
URL = "https://courses.engr.illinois.edu/cs225/sp2021/pages/lectures.html"
def get_rows(page):
rows = page.find_all('div', class_="lecture-row")
rows.reverse() # to get the proper order
return rows
def test_extractors():
page = BeautifulSoup(get(URL).text, features="html.parser")
rows = get_rows(page)
assert len(rows) == 15, "There are 15 weeks of lectures"
assert "Introduction" in rows[0].text, "The first row has the intro lecture"
if __name__ == "__main__":
page_html = get(URL).text
page = BeautifulSoup(page_html, features="html.parser")
rows = get_rows(page)
breakpoint()
Action
In this case, what we’re looking for is within the same element. So now we can perform extractions on each row accordingly.
Getting Lecture Card Details
Our plan is to save each file as lecture_name.pdf
. For this we need to extract the lecture name and figure out the download link.
Identify Pattern
Let’s poke around using inspect element. Each card has a div
with a class card
. We find that the h5
element has the lecture title. And the lecture links is a list item with the content slides
. In some cases, the lecture links aren’t present (university breaks, exams, etc.) so we can check if the list is empty before trying to save stuff.
Code Pattern
Let’s take a sample row with all valid links (Week 1) and see if we’re able to extract things. First, let’s get each day’s card:
$ python scrape.py
(Pdb) first_row = rows[0] # row with everything we want
(Pdb) cards = first_row.find_all('div', class_="card-body")
(Pdb) len(cards)
3
(Pdb) [card.text for card in cards]
['\nMon, 25 Jan\nIntroduction\n\nslides\nhandout\ncode\nTA lecture notes\n\n', '\nWed, 27 Jan\nClasses\n\nslides\nhandout\ncode\nTA lecture notes\n\n', '\nFri, 29 Jan\nPointers and Memory\n\nslides\nhandout\ncode\nTA lecture notes\n\n']
Having the right length and text is a good sign. Now let’s get the name:
(Pdb) sample_card = cards[1] # sample card with the right stuff
(Pdb) sample_card.find('h5')
<h5 class="card-title">Classes</h5>
(Pdb) title = sample_card.find('h5').text
(Pdb) title
'Classes'
Similarly, let’s figure out how to get the title. We can use dictionary syntax to get element attributes, so here we use ['href']
. string=
lets BeautifulSoup match on the actual link text.
(Pdb) slide_links = sample_card.find_all('li', string="slides") # check the link title
(Pdb) len(slide_links)
1
(Pdb) slide_links = sample_card.find_all('li', string="slides") # check the link title
(Pdb) slide_links[0]['href']
*** KeyError: 'href'
(Pdb) slide_links[0].find('a')
<a href="/cs225/sp2021/assets/lectures/slides/cs225sp21-02-classes-slides.pdf">slides</a>
(Pdb) slide_links[0].find('a')['href']
'/cs225/sp2021/assets/lectures/slides/cs225sp21-02-classes-slides.pdf'
Looks like we didn’t get the the base url, but we can just manually store it at the start of our program.
BASE_URL = "https://courses.engr.illinois.edu"
Next, let’s check a point where there are no slides. We can check if slide_links
has a len
of 0 or 1 in order to figure out whether there’s a slide link. If not, we can skip it.
(Pdb) missing_row = rows[5]
(Pdb) missing_row.text
'\n\n\n\nMon, 22 Feb\nStacks and Queues\n\nslides\nhandout\ncode\nTA lecture notes\n\n\n\n\n\n\n\nWed, 24 Feb\niterators\n\nslides\nhandout\ncode\nTA lecture notes\n\n\n\n\n\n\n\nFri, 26 Feb\nIntro Trees\n\nslides\nhandout\ncode\nTA lecture notes\n\n\n\n\n'
(Pdb) missing_card = missing_row.find_all('div', class_="card-body")[2]
(Pdb) missing_card.text
'\nFri, 05 Mar\nExam 1\n\nexam details\n\n'
(Pdb) slide_links = missing_card.find_all('li', string="slides")
(Pdb) len(slide_links)
0
Now let’s quickly code this in:
def get_row_info(row):
cards = row.find_all('div', class_="card-body")
return [get_card_info(card) for card in cards]
def get_card_info(card):
title = card.find('h5').text
slide_links = card.find_all('li', string="slides")
download_link = None # in case there is nothing
if len(slide_links) > 0:
download_link = BASE_URL + slide_links[0].find('a')['href']
return title, download_link
and write some tests similar to what we were trying in our console:
def test_extractors():
page = BeautifulSoup(get(URL).text, features="html.parser")
rows = get_rows(page)
assert len(rows) == 15, "There are 15 weeks of lectures"
assert "Introduction" in rows[0].text, "The first row has the intro lecture"
# normal row
third_row = rows[2]
third_week_info = get_row_info(third_row)
assert len(third_week_info) == 3, "Get all three lectures"
sample_title, sample_url = third_week_info[0]
assert sample_title == "Overloading"
assert sample_url == "https://courses.engr.illinois.edu/cs225/sp2021/assets/lectures/slides/cs225sp21-07-overloading-slides.pdf"
# missing row
missing_row = rows[5]
missing_row_info = get_row_info(missing_row)
assert missing_row_info[2] == ("Exam 1", None), "Missing links get None"
All of this should pass. Writing tests as we go along as a good way to verify that our web scraper is working as intended.
At this point, your full code should be:
from requests import get
from bs4 import BeautifulSoup
BASE_URL = "https://courses.engr.illinois.edu"
URL = "https://courses.engr.illinois.edu/cs225/sp2021/pages/lectures.html"
def get_rows(page):
rows = page.find_all('div', class_="lecture-row")
rows.reverse() # to get the proper order
return rows
def get_row_info(row):
cards = row.find_all('div', class_="card-body")
return [get_card_info(card) for card in cards]
def get_card_info(card):
title = card.find('h5').text
slide_links = card.find_all('li', string="slides")
download_link = None # in case there is nothing
if len(slide_links) > 0:
download_link = BASE_URL + slide_links[0].find('a')['href']
return title, download_link
def test_extractors():
page = BeautifulSoup(get(URL).text, features="html.parser")
rows = get_rows(page)
assert len(rows) == 15, "There are 15 weeks of lectures"
assert "Introduction" in rows[0].text, "The first row has the intro lecture"
# normal row
third_row = rows[2]
third_week_info = get_row_info(third_row)
assert len(third_week_info) == 3, "Get all three lectures"
sample_title, sample_url = third_week_info[0]
assert sample_title == "Overloading"
assert sample_url == "https://courses.engr.illinois.edu/cs225/sp2021/assets/lectures/slides/cs225sp21-07-overloading-slides.pdf"
# missing row
missing_row = rows[5]
missing_row_info = get_row_info(missing_row)
assert missing_row_info[2] == ("Exam 1", None), "Missing links get None"
if __name__ == "__main__":
page_html = get(URL).text
page = BeautifulSoup(page_html, features="html.parser")
rows = get_rows(page)
breakpoint()
Actually downloading files
Now that our code functions are set up nicely, let’s create a loop for actually creating files. To create folders we use the built in os
module’s mkdirs
to set it up. Let’s create a folder per week.
import os
for i, row in enumerate(rows):
week_num = i + 1
folder_name = f"week{week_num}"
os.makedirs(folder_name, exist_ok=True)
To download each file, we’re using with open
with the mode wb
(write bytes) to set up our downloads. Here we use .content
to get the page as bytes
(raw data) so that we can save this as is. We also add in prints to show when something is done. If it fails anywhere, we know where it failed.
for i, row in enumerate(rows):
week_num = i + 1
folder_name = f"week{week_num}"
os.makedirs(folder_name, exist_ok=True)
row_info = get_row_info(row)
for title, slide_url in row_info:
if slide_url is None:
continue
file_name = f"{folder_name}/{title}.pdf"
print(file_name)
slide_content = get(slide_url).content
with open(file_name, "wb") as slide_file:
slide_file.write(slide_content)
print(f"Downloaded: {file_name}")
print(f"Downloaded Week{week_num}")
print()
Full Code
All tests should be passing as well.
from requests import get
from bs4 import BeautifulSoup
import os
BASE_URL = "https://courses.engr.illinois.edu"
URL = "https://courses.engr.illinois.edu/cs225/sp2021/pages/lectures.html"
def get_rows(page):
rows = page.find_all('div', class_="lecture-row")
rows.reverse() # to get the proper order
return rows
def get_row_info(row):
cards = row.find_all('div', class_="card-body")
return [get_card_info(card) for card in cards]
def get_card_info(card):
title = card.find('h5').text
slide_links = card.find_all('li', string="slides")
download_link = None # in case there is nothing
if len(slide_links) > 0:
download_link = BASE_URL + slide_links[0].find('a')['href']
return title, download_link
def test_extractors():
page = BeautifulSoup(get(URL).text, features="html.parser")
rows = get_rows(page)
assert len(rows) == 15, "There are 15 weeks of lectures"
assert "Introduction" in rows[0].text, "The first row has the intro lecture"
# normal row
third_row = rows[2]
third_week_info = get_row_info(third_row)
assert len(third_week_info) == 3, "Get all three lectures"
sample_title, sample_url = third_week_info[0]
assert sample_title == "Overloading"
assert sample_url == "https://courses.engr.illinois.edu/cs225/sp2021/assets/lectures/slides/cs225sp21-07-overloading-slides.pdf"
# missing row
missing_row = rows[5]
missing_row_info = get_row_info(missing_row)
assert missing_row_info[2] == ("Exam 1", None), "Missing links get None"
if __name__ == "__main__":
page_html = get(URL).text
page = BeautifulSoup(page_html, features="html.parser")
rows = get_rows(page)
for i, row in enumerate(rows):
week_num = i + 1
folder_name = f"week{week_num}"
os.makedirs(folder_name, exist_ok=True)
row_info = get_row_info(row)
for title, slide_url in row_info:
if slide_url is None:
continue
file_name = f"{folder_name}/{title}.pdf"
print(file_name)
slide_content = get(slide_url).content
with open(file_name, "wb") as slide_file:
slide_file.write(slide_content)
print(f"Downloaded: {file_name}")
print(f"Downloaded Week{week_num}")
print()
breakpoint()
Exercise
Currently we just use the name of the lecture as our file name. What if we could also include the date? Your job now is to figure out how to extract that and add it in.
Learn more about Web Autmation
-
A few weeks ago, I held a different workshop on Web Bots where I covered
webbrowser
andselenium
. I highly reccomend those. Slides, Code and Video. -
scrapy
is a more serious framework for writing web scraping spiders in python. Take a look at a starter tutorial here -
Web bots are commonly used for testing. pytest-selenium combines
pytest
withselenium
. Here is another guide on BeautifulSoup testing which combines it withunittest
.
Learn more Python
-
Writing to File I/O W3 schools and Real Python
-
os
module Geeks for Geeks
Ethics
- Forbes - The Social Impact Of Bad Bots And What To Do About Them - Bots have the potential to be used in bad ways, like harassment or influencing politics.
Social Impact of Open Source
Open Source software has a large impact on the world today. Since it’s free, it’s accessible to people who can’t afford it, which can help new startups grow out of communities that couldn’t afford it in the past. Open source software also provides decentralized options with good security which allow for privacy and worry authoritarian governments. Many governments have also turned towards open source projects as a way of giving the results to the public, such as the Indian COVID-19 contact tracing app.
To learn more, check out: [Pfaff and David - Society and open source | Why open source software is better for society than proprietary closed source software](https://benpfaff.org/writings/anp/oss-is-better.html) |
Ideas
-
Extract information from a site you use: sports scores, class information, news sites, job postings, etc. Make sure the page is publicly accessible and you’re not creating too much load on the website.
-
You don’t have to use
BeautifulSoup
. Feel free to useSelenium
to interact dynamically with websites (click, fill in forms, scroll) as well. See the linked workshop for more information. -
You can also use it to test something out if you already have a site.
Requirements
-
You should show it off in your README.md file, with animations/video and an explanation of the web bot.
-
Has to use a Open Source license via a
LICENSE
file
Contributors: Harsh, Maaheen