Python Web Scraping with BeautifulSoup¶
BeautifulSoup web scraping patterns - element search methods, data extraction, pagination, table scraping, and anti-bot countermeasures.
Key Facts¶
- Three parsers:
html.parser(stdlib),lxml(faster),html5lib(most lenient) find()returns first match;find_all()returns all;select()uses CSS selectorstag.textgets all text recursively;tag.stringonly if tag has exactly one string childtag['href']raises KeyError if missing;tag.get('href')returns None- For JavaScript-rendered pages, use
seleniumorplaywrightinstead - Always use
requests.Session()to maintain cookies across requests
Patterns¶
Setup¶
from bs4 import BeautifulSoup
import requests
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
Finding Elements¶
# By tag and attributes
tag = soup.find('div', class_='product')
tag = soup.find('a', attrs={'href': True})
tag = soup.find('span', id='price')
# Find all with multiple tag names
tags = soup.find_all(['h1', 'h2', 'h3'])
tags = soup.find_all('a', limit=10)
# CSS selectors (often cleaner)
tags = soup.select('div.product > span.price')
tag = soup.select_one('table#results tbody tr')
Extracting Data¶
tag.text # all text content (recursive)
tag.get_text() # same, with optional separator
tag.string # only if exactly one string child
tag.strings # generator of all strings
tag.stripped_strings # stripped, no empty strings
tag['href'] # attribute (KeyError if missing)
tag.get('href') # safe (None if missing)
tag.attrs # dict of all attributes
# Navigation
tag.parent
tag.children # direct children (generator)
tag.descendants # all descendants (generator)
tag.find_next_sibling('tr')
Table Scraping¶
table = soup.find('table')
headers = [th.text.strip() for th in table.find_all('th')]
rows = []
for tr in table.find('tbody').find_all('tr'):
cells = [td.text.strip() for td in tr.find_all('td')]
rows.append(dict(zip(headers, cells)))
Paginated Scraping¶
page = 1
while True:
soup = get_page(base_url, page)
items = soup.find_all('div', class_='item')
if not items:
break
process(items)
page += 1
URL Resolution¶
from urllib.parse import urljoin
links = [urljoin(base_url, a['href']) for a in soup.find_all('a', href=True)]
Anti-Bot Countermeasures¶
| Technique | Implementation |
|---|---|
| User-Agent | headers={'User-Agent': 'Mozilla/5.0'} |
| Rate limiting | time.sleep(random.uniform(1, 3)) between requests |
| Session cookies | requests.Session() |
| Rotating proxies | Proxy services for serious scraping |
| JS rendering | selenium or playwright |
Gotchas¶
tag.stringreturns None if tag contains multiple elements, even if they all contain textnext_siblingmay return whitespace (NavigableString) - usefind_next_sibling()insteadfind_all(class_='product')usesclass_with underscore becauseclassis a Python keyword- Some sites serve different content to different User-Agents - check what you receive
requests.get()follows redirects by default; checkresponse.urlto verify final destination
See Also¶
- python stdlib patterns - collections, itertools for processing scraped data
- sql advanced patterns - SQL for analyzing scraped datasets
- browser test automation - Selenium/Geb for JS-rendered pages