Get Started with Regex in 5 Minutes
Regex — short for regular expressions — is a way to describe patterns in text. Even a basic understanding lets you search, validate, and transform strings in ways that would otherwise take 20 lines of code.
I remember the first time I had to extract all email addresses from a messy log file. I spent two hours writing string manipulation logic before a senior dev showed me a single regex that did the job in one line. That moment changed how I thought about text processing. Regex is one of those rare skills where a weekend’s worth of practice pays dividends for the rest of your career.
Let’s start with the simplest possible example. Open Python and try this:
import re
text = "Contact us at [email protected] or [email protected]"
matches = re.findall(r'\w+@\w+\.\w+', text)
print(matches)
# ['[email protected]', '[email protected]']
That \w+@\w+\.\w+ is your first regex pattern. It found two email addresses without a single loop. That’s the payoff.
Deep Dive: The Building Blocks
Regex is built from a small set of core symbols. Master these ~15 constructs and you can write almost any pattern you’ll encounter in real work.
Character Classes
\w— any word character (letters, digits, underscore)\d— any digit (0–9)\s— any whitespace (space, tab, newline)\W,\D,\S— the opposites (uppercase = negation)[abc]— any one of: a, b, or c[^abc]— any character except a, b, or c[a-z]— any lowercase letter
Quantifiers
+— one or more*— zero or more?— zero or one (optional){3}— exactly 3 times{2,5}— between 2 and 5 times
Anchors and Boundaries
^— start of string$— end of string\b— word boundary
Combine them in Python like this:
import re
# Match a date in YYYY-MM-DD format
pattern = r'\d{4}-\d{2}-\d{2}'
text = "Deployment date: 2025-03-15, rollback date: 2025-03-16"
dates = re.findall(pattern, text)
print(dates)
# ['2025-03-15', '2025-03-16']
# Match only standalone word "error" (not inside "errors" or "error_code")
pattern = r'\berror\b'
log = "found 3 errors but only one error was critical"
print(re.findall(pattern, log))
# ['error']
Groups and Capturing
Parentheses () create capture groups — they let you pull specific parts out of a match instead of getting the whole thing.
import re
log_line = "2025-03-15 14:23:01 ERROR Database connection failed"
pattern = r'(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+) (.+)'
match = re.match(pattern, log_line)
if match:
date, time, level, message = match.groups()
print(f"Date: {date}, Level: {level}")
print(f"Message: {message}")
# Date: 2025-03-15, Level: ERROR
# Message: Database connection failed
Advanced Usage
Non-capturing Groups and Alternation
Sometimes you need grouping for logic, not for capturing output. That’s what (?:...) is for — it groups without polluting your results.
import re
# Match "color" or "colour" (British/American spelling)
pattern = r'colo(?:u)?r'
text = "The color and colour are both valid spellings"
print(re.findall(pattern, text))
# ['color', 'colour']
# Alternation with |
pattern = r'\b(cat|dog|bird)\b'
text = "I have a cat and a dog, but no bird"
print(re.findall(pattern, text))
# ['cat', 'dog', 'bird']
Lookahead and Lookbehind
Zero-width assertions are one of regex’s most powerful tricks. They check what surrounds a position without consuming any characters in the match.
import re
# Lookahead: match dollar amounts only when followed by " USD"
pattern = r'\$\d+(?:\.\d{2})?(?= USD)'
text = "Total: $49.99 USD, Tax: $3.50 USD"
print(re.findall(pattern, text))
# ['$49.99', '$3.50']
# Lookbehind: match version numbers only after "version "
pattern = r'(?<=version )\d+\.\d+'
text = "Running version 3.11, upgrading to version 3.12"
print(re.findall(pattern, text))
# ['3.11', '3.12']
Using Regex in Bash
Regex isn’t Python-exclusive. In shell scripting, grep uses basic regex by default. Pass -E (or use egrep) for extended regex — that unlocks +, ?, and |.
# Find all lines with IP addresses in a log file
grep -E '\b([0-9]{1,3}\.){3}[0-9]{1,3}\b' /var/log/nginx/access.log
# Count HTTP status codes (200, 404, 500...)
grep -oE ' [0-9]{3} ' access.log | sort | uniq -c
# Find lines that DON'T contain "GET"
grep -v 'GET' access.log
# Case-insensitive match for any severity level
grep -iE 'error|warning|critical' app.log
Regex for String Substitution
Find-and-replace is probably the most common real-world use case. Here’s a practical example: normalizing phone numbers that arrive in three different formats.
import re
# Standardize all phone variants to XXX-XXX-XXXX
def normalize_phone(text):
return re.sub(
r'\(?([0-9]{3})\)?[.\s-]?([0-9]{3})[.\s-]?([0-9]{4})',
r'\1-\2-\3',
text
)
print(normalize_phone("Call (123) 456-7890 or 123.456.7890 or 123-456-7890"))
# Call 123-456-7890 or 123-456-7890 or 123-456-7890
Practical Tips for Writing Better Regex
Tip 1: Use Raw Strings in Python
Always prefix patterns with r. Without it, Python interprets backslashes before regex ever sees them — \n becomes a newline, \b becomes a backspace. Raw strings prevent that.
# Wrong — \b gets interpreted as backspace (ASCII 8)
pattern = "\bword\b"
# Correct — raw string preserves backslashes
pattern = r'\bword\b'
Tip 2: Be Specific, Not Greedy
By default, .* is greedy — it matches as much text as possible. Add ? to flip it to lazy mode, which stops at the earliest possible match.
import re
html = "<b>bold</b> and <b>more bold</b>"
# Greedy — matches from first <b> all the way to last </b>
print(re.findall(r'<b>.*</b>', html))
# ['<b>bold</b> and <b>more bold</b>']
# Lazy — stops at first closing tag
print(re.findall(r'<b>.*?</b>', html))
# ['<b>bold</b>', '<b>more bold</b>']
Tip 3: Compile Patterns You Reuse
Running the same pattern in a tight loop? Compile it first. In benchmarks with 100,000 iterations, compiled patterns run roughly 2–3× faster than passing a raw string to re.findall() each time.
import re
# Compile once
email_pattern = re.compile(r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}')
# Reuse many times
for line in open('emails.txt'):
matches = email_pattern.findall(line)
if matches:
print(matches)
Tip 4: Test Your Patterns Interactively
Paste your pattern into regex101.com before putting it in code. It highlights exactly which characters each group matched, explains every token in plain English, and flags common mistakes. I’ve debugged patterns in 30 seconds there that would have taken 20 minutes in a Python REPL.
Tip 5: Know When NOT to Use Regex
Regex is the wrong tool for deeply structured formats. Parsing HTML? Use BeautifulSoup or lxml. Validating JSON? Use a schema validator like jsonschema. The rule of thumb: if your regex needs more than two levels of nesting or stateful logic, reach for a proper parser instead.
A Real-World Mini Project
Here’s a script that parses an nginx access log and tallies requests by HTTP status code — the kind of thing you’d actually write on the job:
import re
from collections import Counter
log_pattern = re.compile(
r'(?P<ip>[\d.]+) .+ \[(?P<time>[^\]]+)\] '
r'"(?P<method>\w+) (?P<path>[^ ]+) HTTP/[\d.]+" '
r'(?P<status>\d{3}) (?P<size>\d+)'
)
status_counts = Counter()
with open('/var/log/nginx/access.log') as f:
for line in f:
match = log_pattern.match(line)
if match:
status_counts[match.group('status')] += 1
for status, count in sorted(status_counts.items()):
print(f"HTTP {status}: {count} requests")
Named groups ((?P<name>...)) make the code self-documenting. You can read match.group('status') and immediately know what it contains — no counting parentheses to figure out which group number you need.
Regex feels cryptic for the first few hours. Then something clicks. Suddenly you start recognizing patterns in data you’d have looped through manually — and your code shrinks by half.

