Mastering Regular Expressions in Python: A Practical Guide
Written on
Chapter 1: Introduction to Regex in Python
In this section, we will delve into the fundamentals of regular expressions (regex) in Python. After covering the basics, it’s time to roll up our sleeves and apply what we've learned.
Now that we understand the theory, let’s explore how to create patterns. The dot . symbol represents any character. For instance, the pattern b...k signifies that we are looking for a string that starts with 'b', followed by any three characters, and concludes with 'k'.
In another example, we look for strings that begin and end with 'y'. The combination of dot and star (.*) indicates that any characters can appear between the two 'y's, and this occurrence can be zero or multiple times. It’s akin to having an optional element—it can exist, or it may not.
The last example is particularly intriguing: it indicates that the letter preceding the question mark may occur either zero or one time, allowing it to match both "block" and "blocks".
text = """A blockchain, originally block chain,
is a growing list of records, called blocks,
which are linked using cryptography yy yay."""
print(re.findall(r'b...k', text)) # ['block', 'block', 'block']
print(re.findall('y.*y', text)) # ['yptography yy yay']
print(re.findall('blocks?', text)) # ['block', 'block', 'blocks']
Chapter 2: Greedy vs. Lazy Matching
In this chapter, we will examine how greedy and lazy matching can yield different results.
html = "hello world"
print(re.findall('<.*>', html)) # greedy - ['hello world']
print(re.findall('<.*?>', html)) # lazy - ['', '']
The first example is greedy, indicating it should capture as much text as possible until it reaches the closing tag, while the second is lazy, which stops at the first occurrence of the closing tag.
Chapter 3: Utilizing Grouping and Character Ranges
Now, let’s say we need to parse uppercase words from a dataset. We can achieve this by using [A-Z] to specify that we want any uppercase letter, with the plus sign (+) indicating one or more occurrences. The dollar sign ($) ensures that the string must end with this sequence.
pattern = re.compile(r"[A-Z]+$")
print(pattern.findall("aaaaHIDDENTEXT")) # ['HIDDENTEXT']
print(pattern.findall("aaaaHIDDENTEXTxxx")) # []
Character Range Example
Sometimes, we don't need an exact number of characters but rather a range. This is often useful for personal data, such as phone numbers.
pattern = re.compile(r"^[0-9]{3,5}$")
value = "4145"
print(pattern.findall(value)) # ['4145']
Handling Phone Numbers
Let’s consider a scenario where users might include spaces between the dialing code and the number, or they might write it together.
pattern = re.compile("^+(d){3}[ ]?[0-9]{9}$")
value = "+420 734857080"
print(pattern.match(value)) # Match found
value = "+420734857080"
print(pattern.match(value)) # Match found
If you found this guide helpful, consider joining our community for more insights. Your feedback and comments are always appreciated!
Chapter 4: Further Learning Resources
The first video titled "Python Tutorial: re Module - How to Write and Match Regular Expressions (Regex)" provides a comprehensive overview of utilizing the re module in Python for effective pattern matching.
The second video, "RegEx / Regular Expressions for Python (Python Part 17)," offers further insights into applying regular expressions in Python programming.
Thank you for reading! If you enjoyed this content, please consider following for more updates and resources.