Python 正则表达式

**正**则**表**达式 (RegEx) 是一系列字符，用于定义搜索模式。例如：

^a...s$

上述代码定义了一个 RegEx 模式。该模式是：**任何以 a 开头，以 s 结尾的五个字母的字符串**。

使用 RegEx 定义的模式可用于匹配字符串。

表达式	字符串	匹配？
`^a...s$`	`abs`	不匹配
	`alias`	匹配
	`abyss`	匹配
	`Alias`	不匹配
	`An abacus`	不匹配

Python 有一个名为 re 的模块，用于处理 RegEx。这是一个例子

import re

pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")

在这里，我们使用 re.match() 函数在 test_string 中搜索 pattern。如果搜索成功，该方法返回一个匹配对象。否则，它返回 None。

re 模块中定义了其他几个用于处理 RegEx 的函数。在我们探索这些函数之前，让我们先了解正则表达式本身。

如果您已经了解 RegEx 的基础知识，请跳转到 Python RegEx。

使用 RegEx 指定模式

为了指定正则表达式，使用了元字符。在上面的例子中，^ 和 $ 是元字符。

元字符

元字符是 RegEx 引擎以特殊方式解释的字符。以下是元字符列表

[] . ^ $ * + ? {} () \ |

[] - 方括号

方括号指定您希望匹配的一组字符。

表达式	字符串	匹配？
`[abc]`	`a`	1 个匹配
	`ac`	2 个匹配
	`Hey Jude`	不匹配
	`abc de ca`	5 个匹配

在这里，如果字符串包含 a、b 或 c 中的任何一个，则 [abc] 将匹配。

您还可以使用方括号内的 - 指定字符范围。

[a-e] 与 [abcde] 相同。
[1-4] 与 [1234] 相同。
[0-39] 与 [01239] 相同。

您可以使用方括号开头的脱字符 ^ 符号来补充（反转）字符集。

[^abc] 表示除 a、b 或 c 之外的任何字符。
[^0-9] 表示任何非数字字符。

. - **句点**

句点匹配任何单个字符（除了换行符 '\n'）。

表达式	字符串	匹配？
`..`	`a`	不匹配
	`ac`	1 个匹配
	`acd`	1 个匹配
	`acde`	2 个匹配（包含 4 个字符）

^ - **脱字符**

脱字符号 ^ 用于检查字符串是否**以**某个字符**开头**。

表达式	字符串	匹配？
`^a`	`a`	1 个匹配
	`abc`	1 个匹配
	`bac`	不匹配
`^ab`	`abc`	1 个匹配
`^ab`	`acb`	不匹配（以 `a` 开头，但后面没有 `b`）

$ - **美元符号**

美元符号 $ 用于检查字符串是否**以**某个字符**结尾**。

表达式	字符串	匹配？
`a$`	`a`	1 个匹配
	`formula`	1 个匹配
	`cab`	不匹配

* - **星号**

星号 * 匹配其左侧模式的**零个或多个出现**。

表达式	字符串	匹配？
`ma*n`	`mn`	1 个匹配
	`man`	1 个匹配
	`maaan`	1 个匹配
	`main`	不匹配（`a` 后面没有 `n`）
	`woman`	1 个匹配

+ - **加号**

加号 + 匹配其左侧模式的**一个或多个出现**。

表达式	字符串	匹配？
`ma+n`	`mn`	不匹配（没有 `a` 字符）
	`man`	1 个匹配
	`maaan`	1 个匹配
	`main`	不匹配（a 后面没有 n）
	`woman`	1 个匹配

? - **问号**

问号 ? 匹配其左侧模式的**零个或一个出现**。

表达式	字符串	匹配？
`ma?n`	`mn`	1 个匹配
	`man`	1 个匹配
	`maaan`	不匹配（有多个 `a` 字符）
	`main`	不匹配（a 后面没有 n）
	`woman`	1 个匹配

{} - **大括号**

考虑这段代码：{n,m}。这意味着其左侧模式至少重复 n 次，最多重复 m 次。

表达式	字符串	匹配？
`a{2,3}`	`abc dat`	不匹配
	`abc daat`	1 个匹配（在 `daat` 处）
	`aabc daaat`	2 个匹配（在 `aabc` 和 `daaat` 处）
	`aabc daaaat`	2 个匹配（在 `aabc` 和 `daaaat` 处）

我们再试一个例子。这个 RegEx [0-9]{2, 4} 匹配至少 2 位但不超过 4 位的数字。

表达式	字符串	匹配？
`[0-9]{2,4}`	`ab123csde`	1 个匹配（在 `ab123csde` 处匹配）
	`12 和 345673`	3 个匹配（`12`, `3456`, `73`）
	`1 和 2`	不匹配

| - **交替**

竖线 | 用于交替（或 运算符）。

表达式	字符串	匹配？
`a\|b`	`cde`	不匹配
	`ade`	1 个匹配（在 `ade` 处匹配）
	`acdbea`	3 个匹配（在 `acdbea` 处）

在这里，a|b 匹配包含 a 或 b 的任何字符串

() - **组**

括号 () 用于对子模式进行分组。例如，(a|b|c)xz 匹配任何与 a、b 或 c 后面跟着 xz 匹配的字符串。

表达式	字符串	匹配？
`(a\|b\|c)xz`	`ab xz`	不匹配
	`abxz`	1 个匹配（在 `abxz` 处匹配）
	`axz cabxz`	2 个匹配（在 `axzbc cabxz` 处）

\ - **反斜杠**

反斜杠 \ 用于转义各种字符，包括所有元字符。例如，

\$a 匹配字符串中包含 $ 后跟 a 的情况。在这里，$ 不会被 RegEx 引擎以特殊方式解释。

如果您不确定某个字符是否具有特殊含义，可以在其前面加上 \。这可以确保该字符不会被特殊处理。

特殊序列

特殊序列使常用模式更易于编写。以下是特殊序列列表

\A - 匹配字符串开头指定的字符。

表达式	字符串	匹配？
`\Athe`	`the sun`	匹配
`\Athe`	`In the sun`	不匹配

\b - 匹配单词开头或结尾处的指定字符。

表达式	字符串	匹配？
`\bfoo`	`football`	匹配
	`a football`	匹配
	`afootball`	不匹配
`foo\b`	`the foo`	匹配
	`the afoo test`	匹配
	`the afootest`	不匹配

\B - 与 \b 相反。如果指定字符**不**在单词的开头或结尾，则匹配。

表达式	字符串	匹配？
`\Bfoo`	`football`	不匹配
	`a football`	不匹配
	`afootball`	匹配
`foo\B`	`the foo`	不匹配
	`the afoo test`	不匹配
	`the afootest`	匹配

\d - 匹配任何十进制数字。等同于 [0-9]

表达式	字符串	匹配？
`\d`	`12abc3`	3 个匹配（在 `12abc3` 处）
`\d`	`Python`	不匹配

\D - 匹配任何非十进制数字。等同于 [^0-9]

表达式	字符串	匹配？
`\D`	`1ab34"50`	3 个匹配（在 `1ab34"50` 处）
`\D`	`1345`	不匹配

\s - 匹配字符串中包含任何空白字符的位置。等同于 [ \t\n\r\f\v]。

表达式	字符串	匹配？
`\s`	`Python 正则表达式`	1 个匹配
`\s`	`PythonRegEx`	不匹配

\S - 匹配字符串中包含任何非空白字符的位置。等同于 [^ \t\n\r\f\v]。

表达式	字符串	匹配？
`\S`	`a b`	2 个匹配（在 `a b` 处）
`\S`		不匹配

\w - 匹配任何字母数字字符（数字和字母）。等同于 [a-zA-Z0-9_]。顺便说一下，下划线 _ 也被视为字母数字字符。

表达式	字符串	匹配？
`\w`	`12&": ;c`	3 个匹配（在 `12&": ;c` 处）
`\w`	`%"> !`	不匹配

\W - 匹配任何非字母数字字符。等同于 [^a-zA-Z0-9_]

表达式	字符串	匹配？
`\W`	`1a2%c`	1 个匹配（在 `1a2%c` 处）
`\W`	`Python`	不匹配

\Z - 匹配字符串末尾的指定字符。

表达式	字符串	匹配？
`Python\Z`	`I like Python`	1 个匹配
	`I like Python Programming`	不匹配
	`Python is fun.`	不匹配

**提示：** 要构建和测试正则表达式，您可以使用 RegEx 测试工具，例如 regex101。此工具不仅可以帮助您创建正则表达式，还可以帮助您学习它。

现在您已经了解了 RegEx 的基础知识，下面我们讨论如何在 Python 代码中使用 RegEx。

Python RegEx

Python 有一个名为 re 的模块来处理正则表达式。要使用它，我们需要导入该模块。

import re

该模块定义了几个函数和常量来处理 RegEx。

re.findall()

re.findall() 方法返回一个包含所有匹配项的字符串列表。

示例 1：re.findall()


# Program to extract numbers from a string

import re

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

# Output: ['12', '89', '34']

如果未找到模式，re.findall() 返回一个空列表。

re.split()

re.split 方法在匹配处分割字符串，并返回一个发生分割的字符串列表。

示例 2：re.split()


import re

string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

# Output: ['Twelve:', ' Eighty nine:', '.']

如果未找到模式，re.split() 将返回包含原始字符串的列表。

您可以将 maxsplit 参数传递给 re.split() 方法。它是将发生的最大分割数。


import re

string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = '\d+'

# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)

# Output: ['Twelve:', ' Eighty nine:89 Nine:9.']

顺便说一下，maxsplit 的默认值为 0；表示所有可能的分割。

re.sub()

re.sub() 的语法是

re.sub(pattern, replace, string)

该方法返回一个字符串，其中匹配的出现被 replace 变量的内容替换。

示例 3：re.sub()


# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

# Output: abc12de23f456

如果未找到模式，re.sub() 将返回原始字符串。

您可以将 count 作为第四个参数传递给 re.sub() 方法。如果省略，则结果为 0。这将替换所有出现。


import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'
replace = ''

new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

# Output:
# abc12de 23
# f45 6

re.subn()

re.subn() 与 re.sub() 类似，不同之处在于它返回一个包含新字符串和替换次数的 2 项元组。

示例 4：re.subn()


# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

# Output: ('abc12de23f456', 4)

re.search()

re.search() 方法接受两个参数：一个模式和一个字符串。该方法查找 RegEx 模式与字符串首次匹配的位置。

如果搜索成功，re.search() 返回一个匹配对象；如果失败，则返回 None。

match = re.search(pattern, str)

示例 5：re.search()


import re

string = "Python is fun"

# check if 'Python' is at the beginning
match = re.search('\APython', string)

if match:
  print("pattern found inside the string")
else:
  print("pattern not found")  

# Output: pattern found inside the string

这里，match 包含一个匹配对象。

匹配对象

您可以使用 dir() 函数获取匹配对象的方法和属性。

一些常用的匹配对象方法和属性有

match.group()

group() 方法返回字符串中匹配的部分。

示例 6：匹配对象


import re

string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
  print(match.group())
else:
  print("pattern not found")

# Output: 801 35

在这里，match 变量包含一个匹配对象。

我们的模式 (\d{3}) (\d{2}) 有两个子组 (\d{3}) 和 (\d{2})。您可以获取这些带括号的子组的字符串部分。方法如下

>>> match.group(1)
'801'

>>> match.group(2)
'35'
>>> match.group(1, 2)
('801', '35')

>>> match.groups()
('801', '35')

match.start()、match.end() 和 match.span()

start() 函数返回匹配子字符串的起始索引。类似地，end() 返回匹配子字符串的结束索引。

>>> match.start()
2
>>> match.end()
8

span() 函数返回一个包含匹配部分的起始和结束索引的元组。

>>> match.span()
(2, 8)

match.re 和 match.string

匹配对象的 re 属性返回一个正则表达式对象。类似地，string 属性返回传入的字符串。

>>> match.re
re.compile('(\\d{3}) (\\d{2})')

>>> match.string
'39801 356, 2102 1111'

我们已经介绍了 re 模块中定义的所有常用方法。如果您想了解更多，请访问 Python 3 re 模块。

在 RegEx 前使用 r 前缀

当正则表达式前使用 r 或 R 前缀时，表示原始字符串。例如，'\n' 是一个换行符，而 r'\n' 表示两个字符：一个反斜杠 \ 后跟 n。

反斜杠 \ 用于转义各种字符，包括所有元字符。然而，使用 r 前缀会使 \ 被视为普通字符。

示例 7：使用 r 前缀的原始字符串


import re

string = '\n and \r are escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)

# Output: ['\n', '\r']

热门教程

热门实例

参考资料

认证课程

成为一名认证的 Python程序员。

热门教程

参考资料

热门实例

Python 简介

Python 基础

Python 流程控制

Python 数据类型

Python 函数

Python 文件

Python 异常处理

Python 对象和类

Python 高级主题

Python 日期和时间

附加主题

Python 教程

Python 正则表达式

使用 RegEx 指定模式

元字符

Python RegEx

re.findall()

示例 1：re.findall()

re.split()

示例 2：re.split()

re.sub()

示例 3：re.sub()

re.subn()

示例 4：re.subn()

re.search()

示例 5：re.search()

匹配对象

match.group()

示例 6：匹配对象

match.start()、match.end() 和 match.span()

match.re 和 match.string

在 RegEx 前使用 r 前缀

示例 7：使用 r 前缀的原始字符串

目录

相关教程

成为一名认证的 Python
程序员。