Regular Expressions¶

★★★★★ Intermediate

The re module provides Perl-style regular expressions for pattern matching, searching, and text manipulation. Always use raw strings r'...' for patterns to avoid double-escaping.

Key Facts¶

re.search() finds first match anywhere; re.match() only at start; re.fullmatch() entire string
re.findall() returns list of all matches; re.finditer() returns iterator of Match objects
re.sub() replaces matches; re.split() splits by pattern
re.compile() pre-compiles pattern for reuse (faster with repeated use)
Greedy (*, +) matches as much as possible; lazy (*?, +?) as little as possible
Match object: .group() full match, .group(1) first capture group, .groups() all groups

Patterns¶

Core Functions¶

import re

re.search(r'\d+', 'abc 123 def')     # Match at '123'
re.match(r'\d+', '123 abc')          # Match at '123' (start only)
re.fullmatch(r'\d+', '123')          # Match (entire string)
re.findall(r'\d+', 'a1 b22 c333')    # ['1', '22', '333']
re.sub(r'\s+', ' ', 'a  b   c')      # 'a b c'
re.split(r'[,;.]', 'a,b;c.d')        # ['a', 'b', 'c', 'd']

Match Object¶

m = re.search(r'(\d+)-(\d+)', 'tel: 555-1234')
m.group()     # '555-1234' (entire match)
m.group(1)    # '555' (first group)
m.group(2)    # '1234' (second group)
m.groups()    # ('555', '1234')
m.start()     # 5
m.end()       # 13
m.span()      # (5, 13)

Character Classes¶

Pattern	Matches
`.`	Any char except newline
`\d` / `\D`	Digit / non-digit
`\w` / `\W`	Word char `[a-zA-Z0-9_]` / non-word
`\s` / `\S`	Whitespace / non-whitespace
`[abc]`	Any of a, b, c
`[a-z]`	Range
`[^abc]`	NOT a, b, c

Quantifiers¶

Pattern	Meaning
`` / `?`	0+ (greedy / lazy)
`+` / `+?`	1+ (greedy / lazy)
`?` / `??`	0 or 1 (greedy / lazy)
`{n}`	Exactly n
`{n,m}`	Between n and m

Greedy vs Lazy¶

re.findall(r'<B>.*</B>', '<B>a</B> and <B>b</B>')   # ['<B>a</B> and <B>b</B>']
re.findall(r'<B>.*?</B>', '<B>a</B> and <B>b</B>')  # ['<B>a</B>', '<B>b</B>']

Anchors and Boundaries¶

# ^ start, $ end, \b word boundary
re.findall(r'\bcat\b', 'the cat scattered cats')  # ['cat']

Groups and Backreferences¶

# Capturing groups
m = re.search(r'((ab)(cd))', 'abcd')
m.group(1)  # 'abcd', m.group(2)  # 'ab', m.group(3)  # 'cd'

# Non-capturing group
re.findall(r'(?:ab)+', 'ababab')  # ['ababab']

# Backreference (find repeated words)
re.findall(r'\b(\w+)\s+\1\b', 'the the cat')  # ['the']

# Sub with backreference
re.sub(r'(\w+) (\w+)', r'\2 \1', 'hello world')  # 'world hello'

Flags¶

re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
# re.I - case insensitive
# re.M - ^ and $ match line boundaries
# re.S - . matches newline
# re.X - verbose mode (comments, whitespace in pattern)

Common Practical Patterns¶

# Email (simplified)
r'[\w.-]+@[\w.-]+\.\w+'

# Phone
r'\+7-\d{3}-\d{3}-\d{2}-\d{2}'

# IP address
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'

# URL
r'https?://[\w./\-?=&#]+'

# Normalize whitespace
re.sub(r'\s+', ' ', text).strip()

# Extract all numbers
numbers = [int(x) for x in re.findall(r'\d+', text)]

# Split by multiple delimiters
re.split(r'[,;.\s]+', text)

Gotchas¶

Always use raw strings r'...' to avoid \\d instead of \d
re.match() only matches at string start - use re.search() for anywhere
findall() with groups returns groups, not full match - use non-capturing (?:...) if needed
re.split() with capturing groups includes the groups in the result
re.escape(string) escapes all metacharacters for use as literal pattern