A Gentle Introduction to Web Scraping with Python

Build a simple web scraper that fetches data from websites using Python, requests, pandas and BeautifulSoup.

Apr 22, 2025

“The universe is made of stories, not of atoms.”
— Muriel Rukeyser

Muriel Rukeyser once wrote, “The universe is made of stories, not of atoms.” And if that’s true, then websites are little universes — bursting with data, meaning and stories just waiting to be told.

Web scraping is the act of finding clarity in digital noise. It’s one of the simplest ways to step into the world of data science, automation and storytelling — and the best part? You can start right now, even if you’re a complete beginner.

Step 1: Setting the Stage – What You'll Need

Before we dive in, here’s what you’ll need:

✅ Google Colab (or your local IDE like VS Code, Anaconda, Jupyter etc.)
✅ A web page to scrape – we’ll use Yahoo Finance’s Most Active Stocks
✅ Three Python libraries:
- requests – the web-fetching courier
- beautifulsoup4 – the HTML detective
- pandas – the spreadsheet artist

Meet Your Dream Team (a.k.a. The Libraries)

Think of this like planning a wedding:

Requests is your planner – it gets you to the venue (aka, the webpage).
BeautifulSoup is your assistant – it helps you find exactly what you need once you’re inside.
Pandas is your designer – it turns raw material into a beautiful, usable format.

Installing the Tools

If you're using Google Colab, you're all set — these libraries are pre-installed. Just import them with the following code:

import requests

from bs4 import BeautifulSoup

import pandas as pd

A quick note:
When we write import pandas as pd, we’re giving it a nickname — kind of like calling your friend “pd” instead of “Pandas” every time you talk. It just makes our code cleaner, shorter, and yes, easier for lazy fingers. If you’re working locally, install the libraries using:

pip install requests

pip install beautifulsoup4

pip install pandas

If you’re in Google Colab, use !pip install instead:

!pip install requests

!pip install beautifulsoup4

!pip install pandas

Depending on your setup, you might need to use pip3 instead of pip. If in doubt, try a few variations or check your Python version with python --version.

Step 2: Fetching the Web Page

Now we’re reaching out to the site we want to scrape.

url = "https://finance.yahoo.com/most-active"

headers = {'User-Agent': 'Mozilla/5.0'}

page = requests.get(url, headers=headers)

What does this all mean?

url = "...": This is the webpage we want to scrape. We store it in a variable for easy reference.
headers = {'User-Agent': 'Mozilla/5.0'}: Some websites won’t give data to bots. This header tricks the site into thinking we’re just a normal browser. Think of it as dressing nicely to get into a members-only jazz club.
page = requests.get(url, headers=headers): This is us sending a polite request: “Hey, can I have a copy of this page?” The site says yes, and we store the response in page.

Step 3: Reading the HTML

Now that we have the page, we need to make sense of it with BeautifulSoup.

soup = BeautifulSoup(page.text, 'html.parser')

table = soup.find('table')

What’s happening here?

BeautifulSoup(page.text, 'html.parser'): This turns the messy HTML into a beautiful, searchable structure. If HTML is a manuscript, BeautifulSoup is the editor.
soup.find('table'): Tells BeautifulSoup, “Find me the first table on this page.” Yahoo displays stock data in a table, so that’s where we’ll look.

Want to see what the soup looks like? Try:

print(soup.prettify()[:1000])

It’ll print the first 1000 characters of your parsed HTML.

Step 4: Extracting the Data (The Fun Part)

We’ve got our table — now let’s extract the stock info:

cleaned_data = []

for row in table.find_all('tr')[1:]:

cols = [col.text.strip() for col in row.find_all('td')]

if len(cols) >= 7:

cleaned_data.append(cols[:7])

What each line does:

cleaned_data = []: An empty list where we’ll store our nice clean rows.
for row in table.find_all('tr')[1:]:: Loops through each row — skipping the header row.
cols = [col.text.strip() for col in row.find_all('td')]: Pulls out and cleans up each cell’s text.
if len(cols) >= 7:: Ensures we only grab rows with 7 or more columns.
cleaned_data.append(cols[:7]): Adds the first 7 columns of clean data to our list.

By the end, we’ve got a tidy list of lists — aka. our dataset.

Step 5: Wrangling It with Pandas

Now we turn that raw list into a DataFrame (like a spreadsheet):

columns = ['Symbol', 'Name', 'Price', 'Change', '% Change', 'Market Cap', 'Volume']

df = pd.DataFrame(cleaned_data, columns=columns)

df.head()

columns = [...]: Defines the column headers.
pd.DataFrame(...): Tells pandas to create a table from our cleaned data.
df.head(): Shows the first 5 rows.

💾 Bonus: Save Your Scraped Data

Want to keep your scraped data? Use the following:

df.to_csv('most_active_stocks.csv', index=False)

This creates a .csv file you can open in Excel, Google Sheets, or any other spreadsheet tool.

Full Code Recap

If you want to copy and paste the whole thing into your Google Colab or IDE, here’s the full code we used from start to finish:

import requests

from bs4 import BeautifulSoup

import pandas as pd

# Step 1: Get the web page

url = "https://finance.yahoo.com/most-active"

headers = {'User-Agent': 'Mozilla/5.0'}

page = requests.get(url, headers=headers)

# Step 2: Parse the HTML

soup = BeautifulSoup(page.text, 'html.parser')

table = soup.find('table')

# Step 3: Extract table rows

cleaned_data = []

for row in table.find_all('tr')[1:]: # skip the header

cols = [col.text.strip() for col in row.find_all('td')]

if len(cols) >= 7: # Make sure we have enough columns

cleaned_data.append(cols[:7]) # Only take the first 7

# Step 4: Load into a pandas DataFrame

columns = ['Symbol', 'Name', 'Price', 'Change', '% Change', 'Market Cap', 'Volume']

df = pd.DataFrame(cleaned_data, columns=columns)

# Step 5: Display the first 5 rows

df.head()

In Summary …

Web scraping isn’t just about collecting data — it’s about uncovering stories. Each stock symbol, each price change, each row in your dataframe… they’re not just numbers. They’re fragments of a larger narrative unfolding in the markets, in the economy, in human behavior. Maybe you'll build a scraper to track prices over time, or compare companies, or generate stunning data visualizations. Or maybe you'll feed this data into a machine learning model and make a predictive analysis.

As Muriel Rukeyser reminds us: “The universe is made of stories, not of atoms.” And every line of code you write — especially in projects like these — is a step toward telling those stories with data.

💌 Stay in the Loop

If you found this helpful (or just kind of delightful), subscribe to The Literary Coder — where we make code poetic, data meaningful, and every tutorial a little bit literary. Thanks for scraping with me 🧡

➡️ Coming soon on The Literary Coder

How to scrape dynamic sites using Selenium, and even combine scraping with AI tools like Langchain for some real-time transformation magic.

The Literary Coder

Discussion about this post

Ready for more?