Regular-Expression
Regular Expression (EN)
The Blog is written in English, than Google translate into Chinese.
Abstract
Introduction
What is Regular Expression? Lets have a look at Wikipedia’s definition:
A regular expression (shortened as
regex
orregexp
), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.
From the text above, we can conclude that:
- Regex is a sequence of characters, or we can say Regex is a special kind of string.
- Regex is powerful in String searching algorithms.
To explain this more clearly, let’s assume we have the following C++ program.
1 |
|
Assuming that we have a powerful algorithm (like KMP algorithm) that could judge whether there exists a target string in the whole text, which is shown in the SearchSting
function. For example, text
is I love SJTU
while target
is SJTU
, then the program will output “Matched”!
But we want to make the function more powerful, like: I want to search whether there is a sentence that has the structure of I love ...
, where ...
means any word consisting of four uppercase English letters. Or we can say we expect the consequences as below:
1 |
|
We definitely can implement the function by adding some code into SearchString
, but it compromises universality. Over time, the algorithm has turned into a pile of if-else statements, resembling a mountain of messy code. In a word, the algorithm seems weak when facing more lenient string matching problems with more detailed requirements.
This problem does exist in the real world! For example, if you want to search all the files with suffix .md
, how would you implement this using the command line?
Regex counts!
This is where the power of Regex shines. Regex offers advanced pattern matching capabilities, supporting complex patterns, quantifiers, character classes, and groups, which simple string matching lacks, enabling more flexible and powerful text search and manipulation. Moreover, regex plays as a powerful tool in programming language like Python and Javascripts. All the demos in the text will use Python as the demonstrating language.
For the questions above, we can use Regex to search files with certain suffixes using only one command line!
1 |
|
In today’s Blog, we will master the basic usage of Regex and implement some simple applications using Regex.
Basic Usage
Learning maps
After knowing what Regex can do, next question we need to clarify is What Regex truly is.
- The fundamental principle: Regex is used for string matching problems.
- To say more specifically, for more complex string matching problems.
What is complex? Assume that there is a picky person asking you to find all the strings with some peculiar and bizarre rules. It’s easy for you to understand the rules while hard for computers to do so. Therefore we can say: Regex is a series of laws to make computers understand “peculiar and bizarre rules” of complex string matching problems.
Like the question above, “find all file names with certain suffix ‘.md’” is a certain kind of peculiar and bizarre rules.
There are countless peculiar and bizarre rules for string matching, and people intelligently create a system to unify recordings for these peculiar and bizarre rules.
Syntax | Explanations |
---|---|
Special Position Matching | ✅Characters that match the start and end of a string or a word. ✅Zero-Width Assertion: match customized certain positions. |
Special Character Matching | ✅Numbers, words, null characters. ✅Escape characters for Metacharacter. ✅Wildcard |
Character Classes | ✅Match one character from a set of possible characters. |
Repetition and Capturing | ✅Quantifiers ✅Grouping ✅Capturing |
So regex is just a hand-made syntax rule for string-matching problems! Don’t be afraid, and you can learn Regex gradually through your learning and working process. You don’t need to learn all the Regex syntax for it doesn’t make sense for most of the circumstances.
Metacharacters
Metacharacters is characters that has different meanings, which is often used as the case and mark for special positions and characters. The table below shows several most frequently used metacharacters and we will discuss it during a concrete example.
Metacharacter | Description |
---|---|
. |
Matches any single character (except newline). |
^ |
Matches the start of a string. |
$ |
Matches the end of a string. |
* |
Matches 0 or more repetitions of the preceding element. |
+ |
Matches 1 or more repetitions of the preceding element. |
? |
Matches 0 or 1 repetition of the preceding element (makes it optional). |
\ |
Escapes a metacharacter (e.g., \. matches a literal dot). |
[] |
Matches any single character within the brackets (e.g., [abc] ). |
[^] |
Matches any single character not within the brackets (e.g., [^abc] ). |
() |
Groups patterns and captures the matched text. |
{} |
Specifies exact repetition (e.g., a{2,4} matches 2 to 4 a ‘s). |
\d |
Matches any digit (equivalent to [0-9] ). |
\w |
Matches any word character (letters, digits, underscore). |
\s |
Matches any whitespace character (space, tab, newline). |
\b |
Matches a word boundary. |
\B |
Matches a non-word boundary. |
Examples
Learning Metacharacters is the most fundamental thing in learning Regex, as Regex’s power simply lies in searching different patterns and positions. In the Blog below, I will use a practical example to show readers how to use Regex in grep command to search for specific files and string patterns.
Example 1: search certain file name
This current directory is the place where I store all my Blog files in markdown format. If I type ls
command, I will see all the files and folders in the current directories.
1 |
|
If one day I want to search for all Blog files with Python, I can use the simple grep
command like this:
1 |
|
It will get result as below:
1 |
|
There are actually several sub-folders for storing blog images which I don;t want them to be displayed. Thus I can use Regex for searching speific suffixes as below:
1 |
|
"python.*\.md"
: How to read Regex? just read it from left to right and take notice of every metacharacters for its special meaning! For example in this string, python
contains no metacharacters and .
is a very important metacharacters with the meaning of matching any single characters.
And *
is use for greedy research, which means matching 0 or more repetitions of the preceding element and make sure the matches as long as possible. So .*
is a very powerful tool that matches all strings!
\
means escape characters, which is used for “escaping for the special functions for metacharacters”. If you want to match the metacharacters itself, you can use \
! In this example, \
is used for escaping .
, which means to match the .
itself.
So you now get to know how to search string with certain suffixes .md
!
There are many other commands for more complex string-pattern matching, you can look up in the table above for searching accordingly!
Advanced Techniques
Zero-Width Assertion
Zero-width assertions are advanced features in regular expressions that allow you to match specific conditions at a position in the text without consuming characters. They are called “zero-width” because they do not contribute to the match result but instead assert whether a pattern exists (or does not exist) at a certain position.
Types of Zero-Width Assertions:
- Positive Lookahead (
(?=...)
)
Asserts that a pattern must follow the current position.
Example:\d+(?= dollars)
matches numbers followed by “dollars” but does not include “dollars” in the match. - Negative Lookahead (
(?!...)
)
Asserts that a pattern must not follow the current position.
Example:\d+(?! dollars)
matches numbers not followed by “dollars.” - Positive Lookbehind (
(?<=...)
)
Asserts that a pattern must precede the current position.
Example:(?<=\$)\d+
matches numbers preceded by a dollar sign. - Negative Lookbehind (
(?<!...)
)
Asserts that a pattern must not precede the current position.
Example:(?<!\$)\d+
matches numbers not preceded by a dollar sign.
Conclusion
Just use regex in real practice! I will update this blog when I have some new comprehension for Regex~
References
Regexr : a powerful website for practicing Regex.
https://www.regular-expressions.info/ : You can search for more advanced usage regarding Regex on this website.