C++ 正则表达式 regex 的使用

前言

正则表达式（regular expression / regex）是非常实用的工具，几乎在所有语言中，都包含对正则表达式的支持，它们的语法大同小异。今天我们来讲讲 C++ 中 regex 的使用。另外需要注意的是，正则表达式不是专属于 C++ 的，在本文中我们假设你已经熟悉正则表达式。

本文需要正则表达式的前置芝士。

快速开始

一个简单的匹配

在正则表达式中，表达式的匹配分为编译和匹配两个部分。在 regex 库中，必须首先编译模式串，然后和字符串匹配。需要注意的是，在 C++ 中，编译和匹配两个操作必须是分离的，而在一些其他语言中可能支持直接匹配的接口（例如 Python，初学者通常选择直接编译）。

因此在 C++ 中，使用正则表达式必须分为两个步骤：

构造 regex 对象，即前面提到的编译。
使用 regex_match / regex_search 匹配。或者使用其他方法匹配。

#include <iostream>
#include <regex>
using std::cin;
using std::cout;
using std::endl;

// 检查拼写错误: 除非上个字母是 c, 否则单词中不会出现 ei.
void demo(const std::string &s) {
    static std::regex pattern(R"(\w*[^c]ei\w*)");  
    // static std::regex pattern("\\w*[^c]ei\\w*");
    // C++ 11: R"()" 中的 \ 无需转义
    static std::smatch match;  
    // 使用 smatch 必须对应 std::string, 可以使用 cmatch 对应 const char *
    if (std::regex_search(s, match, pattern)) {  // 返回 bool 类型, match 接收其他返回
        // 返回第一个匹配
        cout << match.str() << endl;
    } else {
        // 没有找到匹配
        cout << "None" << endl;
    }
}

int main() {
    demo("Hello, friend!");  // None
    demo("Hello, freind!");  // freind
    return 0;
}

在这个示例中，我们首先编译正则模式串 \w*[^c]ei\w*，它表示匹配一个存在拼写错误的单词，其中 \w 等价于 [a-zA-Z_]。可以看到第 24 行的输出为 None，第 25 行的输出为 freind。

正则表达式编译与 C++ 编译的区别

正则表达式的 “编译”，并非编译。它是正则表达式中的一个概念，表示对正则表达式的预处理。需要注意正则表达式的编译是运行时的，它实际上就是 std::regex 对象的构造。

编译与匹配分离的好处是，它存在提高性能的可能性。如果编译与匹配的操作在一个方法内完成，那么对于特定的模式串每次需要编译，而分离则一个模式串可以被一次编译多次使用，这种情况事实上十分普遍。

最后，在 C++ 中不提供编译与匹配同时完成的方法。

regex_match 与 regex_search 的区别

regex_match 是对完整字符串的匹配，而 regex_search 是在字符串中查找一个匹配子串，通常第二个操作更常用。

例如这两者是等价的：

1
2
3

std::regex_match(s, match, std::regex("..."));
std::regex_search(s, match, std::regex("^...$"));
// 其中 ^ 表示匹配字符串头, $ 表示匹配字符串尾.

另外，match 作为接收对象，当然也包含其他信息：

`match` 成员函数	功能
`match.ready()`	返回布尔类型，`match` 已接收返回值。
`match.str()`	返回匹配字符串。
`match.position()`	返回匹配开始位置。
`match.length()`	返回匹配串长度。
`match.prefix()`	返回前缀字符串。
`match.suffix()`	返回后缀字符串。
`match.size()`	返回子匹配数量，未匹配返回 0，子匹配我们将在后面的章节介绍，还有其他相关方法。

match 这类对象是一个视图，不要拷贝或存储一个 match 对象，以及后文中的 regex_iterator 等。

处理所有匹配

在上面这个示例中，我们使用 regex_search 可以查找字符串中的第一个匹配，那么如何查找所有的匹配。

一种做法是循环调用 regex_search，并且每次为 string 切片。但 C++ 为我们提供了方法 —— regex_iterator。

#include <iostream>
#include <regex>
using std::cin;
using std::cout;
using std::endl;

void demo(const std::string &s) {
    static std::regex pattern(R"(\w*[^c]ei\w*)");
    // sregex_iterator 也算一种流迭代器
    for (std::sregex_iterator it(s.begin(), s.end(), pattern), end_it; it != end_it; it++) {
        printf("[%lu:%lu]: %s\n", it->position(), it->position() + it->length(), it->str().c_str());
        // *it 返回一个 smatch 对象
    }
}

int main() {
    demo("Hello, freind!\nHi, freind!");
    return 0;
}

选项配置

regex 提供了一些可选项，它们主要包括三个方面：字符或存储类型、编译选项、匹配标志。

字符或存储类型

C++ 的正则表达式中，对于不同的字符或存储类型，必须使用不同的 regex 类型：

字符与存储类型	basic_regex	match_result	regex_iterator
`const char *`	`std::regex`	`std::cmatch`	`std::cregex_iterator`
`std::string`	`std::regex`	`std::smatch`	`std::sregex_iterator`
`const wchar_t *`	`std::wregex`	`std::wcmatch`	`std::wcregex_iterator`
`std::wstring`	`std::wregex`	`std::wsmatch`	`std::wsregex_iterator`

任何的类型不匹配都将导致错误，例如如果你的模式串和匹配串是 const char * 的则必须使用 std::regex 与 std::cmatch，而如果是 std::string 的则必须使用 std::regex 与 std::smatch。同时，模式串和匹配串也应该保持相同的类型。

为什么要这样做？

或者有人直接提出了这样的问题：为什么不直接使用模板？事实上它们就是模板实现的。

上面这些类型分别是模板类 basic_regex / match_result / regex_iterator 的别名。

但为什么这么做？其实就是因为 C++ 中无法完成类的自动模板类型推导，于是为我们提供了方便的别名，其实 wstring 也是 basic_string<wchar_t> 的别名。而我们可以看到 regex_search 这类函数不需要别名就是因为它们支持自动模板类型推导。

编译选项

std::regex pat(pattern, flag=std::regex_constants::ECMAScript);
// 其中
// pattern 是模式串, 它是一个字符数组或字符串类型
// flag 则是标志位, 它是一个 unsigned 类型

f 支持的标志位如下：

编译标志（std::regex_constants）	功能
`icase`	忽略大小写
`nosubs`	不保存子表达式
`optimize`	编译优化（提高编译耗时，减少匹配耗时）
`ECMAScript`	使用 `ECMA-262` 指定的正则表达式语法（默认）
`basic` / `extended` / `awk` / `grep` / `egrep`	使用 `POSIX` 的基本 / 拓展 / `awk` / `grep` / `egrep` 正则表达式语法

（其中，六类正则表达式语法只可选一）

匹配标志

std::regex_search(str, match, pattern, mft=std::regex_constants::match_default);
std::regex_search(begin, end, match, pattern, mft=std::regex_constants::match_default);
// 其中 
// str 是匹配串, 它是一个字符数组或字符串类型
// begin / end 是迭代器类型, 它可以替代 str 的类型
// match 是 match_results 类型: cmatch / smatch / wcmatch / wsmatch
// pattern 是 basic_regex 类型: regex / wregex
// mft 则是标志位, 它是一个 unsigned 类型

当然 regex_iterator 也支持 mft。

1	std::sregex_iterator it(begin, end, pattern, mft=std::regex_constants::match_default);

另外，我们后面即将提到的 format 和 regex_replace 也支持 mft。

mft 支持的标志位如下：

匹配标志（std::regex_constants）	功能
`match_default` / `format_default`	默认匹配 / 格式化方法
`match_not_bol` / `match_not_eol` / `match_not_bow` / `match_not_eow`	不将首 / 尾字符作为行 / 词首 / 尾处理
`match_any`	`search` 不再保证返回第一个匹配
`format_sed`	使用 `POSIX` 的 `sed` 替换规则
`format_no_copy`	`format` 不输出匹配串的 `prefix` 与 `suffix`
`format_first_only`	`replace` 只替换第一次匹配

这里我们只列举了一些常用的标志，没有一一列出。

子匹配：提取与替换

子匹配

与多数正则表达式一样，在 C++ 的 regex 中支持子匹配，在一些语境下它也叫匹配组。

任何一个被 () 包括的都是子匹配，同样子匹配形成一个树形结构，它们按照深度优先的顺序被排序，完整匹配是第 0 匹配。

我们可以研究下下面这个示例：

// 匹配身份证号
std::regex pattern(
    "(\\d{6})[ \r\t\n\v\f]?((\\d{4})(\\d{2})(\\d{2}))[ \r\t\n\v\f]?(\\d{3}[0-9X])", 
    std::regex_constants::icase
);
std::cmatch match;
std::regex_search("ID: 33000020000101000X", match, pattern);
cout << match.str() << endl;
cout << match.str(1) << ' ' << match.str(2) << ' ' << match.str(6) << endl;
cout << match.str(3) << ' ' << match.str(4) << ' ' << match.str(5) << endl;
/* 输出: 
33000020000101000X
330000 20000101 000X
2000 01 01
*/

在这里我们通过 str() 中填写参数，从而选择对应的子匹配。另外子匹配还支持：

子匹配方法	功能
`match.size()`	返回子匹配数量，未匹配返回 0。
`match.str(i=0)`	返回第 $i$ 个子匹配。
`match.position(i=0)`	返回第 $i$ 个子匹配的开始位置。
`match.length(i=0)`	返回第 $i$ 个子匹配的长度。
`sub_match.prefix()`	返回子匹配的前缀字符串。
`sub_match.suffix()`	返回子匹配的后缀字符串。
`sub_match::operator string() const;`	子匹配存在向 `string` 的隐式类型转。

提取与替换

子匹配还有两个重要的函数：match_result 的 format 成员函数、regex_replace 函数，它们提供了更加方便的方法。在这些方法中，我们可以使用 $\$i$ 表示第 $i$ 个匹配。

std::regex pattern(
    "(\\d{6})[ \r\t\n\v\f]?((\\d{4})(\\d{2})(\\d{2}))[ \r\t\n\v\f]?(\\d{3}[0-9X])", 
    std::regex_constants::icase
);
std::cmatch match;
std::regex_search("ID: 33000020000101000X", match, pattern);
cout << match.format("$3/$4/$5") << endl;
/* 输出: 
2000/01/01
*/

std::regex pattern(
    "(\\d{6})[ \r\t\n\v\f]?((\\d{4})(\\d{2})(\\d{2}))[ \r\t\n\v\f]?(\\d{3}[0-9X])", 
    std::regex_constants::icase
);
cout << std::regex_replace("ID: 33000020000101000X", pattern, "$3/$4/$5") << endl;
/* 输出: 
2000/01/01
*/