java Character.isValidCodePoint,toCodePoint,codePointAt,codePointBefore

一、核心概念：Unicode 与 UTF-16 编码

码点（Code Point）：Unicode 字符的唯一编号，范围 U+0000 到 U+10FFFF。
UTF-16 编码：
- BMP 字符（U+0000 ~ U+FFFF）：用 1 个 char（16 位）表示。
- 增补字符（U+10000 ~ U+10FFFF）：用 代理对（Surrogate Pair） 表示，即 2 个 char（高位 + 低位代理）。

💡 一个 char 不一定对应一个字符！增补字符需要两个 char。

二、方法详解

1. `Character.isValidCodePoint(int codePoint)`

定义

public static boolean isValidCodePoint(int codePoint)

功能

判断给定的整数是否是一个有效的 Unicode 码点。

有效范围

0x0000 ≤ codePoint ≤ 0x10FFFF
但排除代理对区域：0xD800 ~ 0xDFFF（这些是 UTF-16 内部使用的，不能作为字符）

示例

System.out.println(Character.isValidCodePoint(0x41));        // true  ('A')
System.out.println(Character.isValidCodePoint(0x1F600));     // true  (😀, Grinning Face)
System.out.println(Character.isValidCodePoint(0x110000));    // false (超出范围)
System.out.println(Character.isValidCodePoint(0xD800));      // false (代理区)

使用场景

验证用户输入的码点是否合法
在生成 Unicode 字符前做安全检查

2. `Character.toCodePoint(char high, char low)`

定义

public static int toCodePoint(char high, char low)

功能

将一个代理对（Surrogate Pair） 转换为对应的 Unicode 码点（int）。

条件

high 必须是高位代理（High Surrogate）：0xD800 ≤ high ≤ 0xDBFF
low 必须是低位代理（Low Surrogate）：0xDC00 ≤ low ≤ 0xDFFF

示例

// 手动构造 😀 的代理对
char high = '\uD83D'; // 高位代理
char low  = '\uDE00'; // 低位代理

int codePoint = Character.toCodePoint(high, low);
System.out.println("Code Point: " + Integer.toHexString(codePoint)); // 1f600
System.out.println("Code Point: U+" + Integer.toHexString(codePoint).toUpperCase()); // U+1F600

使用技巧

// 安全转换（先验证）
if (Character.isHighSurrogate(high) && Character.isLowSurrogate(low)) {
    int cp = Character.toCodePoint(high, low);
}

3. `Character.codePointAt(CharSequence seq, int index)`

定义

public static int codePointAt(CharSequence seq, int index)

功能

从 CharSequence（如 String）的指定位置 index 开始，读取一个完整的 Unicode 字符（码点）。

行为

如果 seq.charAt(index) 是 BMP 字符，返回其码点。
如果是高位代理，则与下一个 char 组成代理对，返回增补字符的码点。
如果是低位代理，也返回其码点（但语义上不合法）。

示例

String text = "A😀B"; // A + 😀 + B
// 索引:   0 1 2 3 4

System.out.println(Character.codePointAt(text, 0)); // 65     ('A')
System.out.println(Character.codePointAt(text, 1)); // 128512 (😀, U+1F600)
System.out.println(Character.codePointAt(text, 3)); // 66     ('B')

⚠️ 注意：😀 占用两个 char 位置（索引 1 和 2），所以 B 在索引 3。

使用技巧：遍历字符串中的所有码点

String str = "Hello 😀!";
for (int i = 0; i < str.length(); ) {
    int cp = Character.codePointAt(str, i);
    System.out.println("Code Point: U+" + Integer.toHexString(cp).toUpperCase());
    i += Character.charCount(cp); // 跳过 1 或 2 个 char
}

4. `Character.codePointBefore(CharSequence seq, int index)`

定义

public static int codePointBefore(CharSequence seq, int index)

功能

从 CharSequence 的指定位置 index - 1 开始，向前读取一个完整的 Unicode 字符（码点）。

条件

index 必须 > 0 且 ≤ seq.length()
用于反向遍历字符串。

示例

String text = "A😀B";

System.out.println(Character.codePointBefore(text, 1)); // 65 ('A')
System.out.println(Character.codePointBefore(text, 3)); // 128512 (😀)
System.out.println(Character.codePointBefore(text, 4)); // 66 ('B')

使用技巧：反向遍历

String str = "😀World";
for (int i = str.length(); i > 0; ) {
    int cp = Character.codePointBefore(str, i);
    System.out.println("Code Point: U+" + Integer.toHexString(cp).toUpperCase());
    i -= Character.charCount(cp);
}

三、常见错误

❌ 1. 忽视代理对，错误计算字符数

String emoji = "😀";
System.out.println(emoji.length());     // 2 (char 数)
System.out.println(emoji.codePointCount(0, emoji.length())); // 1 (实际字符数)

❌ 2. 遍历时未跳过代理对

// 错误：会把代理对拆开处理
for (int i = 0; i < str.length(); i++) {
    char c = str.charAt(i);
    // ❌ 如果 c 是高位代理，下一个 char 是低位代理，会被重复处理
}

// 正确：使用 codePointAt + charCount
for (int i = 0; i < str.length(); ) {
    int cp = Character.codePointAt(str, i);
    i += Character.charCount(cp); // 正确跳过
}

❌ 3. 传入无效代理对

// 未验证就转换
int cp = Character.toCodePoint('A', 'B'); // 产生错误码点

四、注意事项

项目	说明
🧩 代理对处理	这些方法专为处理增补字符设计
🔤 CharSequence 支持	`codePointAt/Before` 接受 `String`, `StringBuilder`, `CharBuffer` 等
⚠️ 索引边界	`index` 必须在有效范围内，否则抛 `IndexOutOfBoundsException`
✅ 空安全	`CharSequence` 为 `null` 会抛 `NullPointerException`
📏 性能	高效，JVM 内联优化

五、最佳实践

✅ 1. 计算真实字符数

int charCount = str.codePointCount(0, str.length());

✅ 2. 安全遍历所有字符（码点）

public static void forEachCodePoint(String str, IntConsumer consumer) {
    for (int i = 0; i < str.length(); ) {
        int cp = Character.codePointAt(str, i);
        consumer.accept(cp);
        i += Character.charCount(cp);
    }
}

✅ 3. 验证码点合法性

if (Character.isValidCodePoint(cp)) {
    // 安全使用
}

✅ 4. 构造增补字符

int cp = 0x1F600;
if (Character.isValidCodePoint(cp)) {
    String emoji = new String(Character.toChars(cp));
}

六、性能优化

✅ codePointAt 等方法性能极高，通常内联。
✅ 在高频处理 Unicode 文本时，避免使用 charAt(i) 逐字符处理。
⚠️ 使用 codePointCount() 替代 length() 获取真实字符数。

七、总结

方法	用途	关键点
`isValidCodePoint(int)`	验证码点是否合法	范围 `0x0~0x10FFFF`，排除 `0xD800~0xDFFF`
`toCodePoint(high, low)`	代理对 → 码点	需确保 `high` 和 `low` 是有效代理
`codePointAt(seq, index)`	从索引读取码点	正向遍历的核心
`codePointBefore(seq, index)`	从索引前读取码点	反向遍历的核心

✅ 一句话掌握：
Character.isValidCodePoint, toCodePoint, codePointAt, codePointBefore 是 Java 处理 Unicode 增补字符（如 emoji） 的核心工具。它们解决了 char 无法表示所有 Unicode 字符的问题，让你能正确遍历、计数、转换包含 emoji 或罕见文字的字符串。在处理国际化文本时，应优先使用这些方法而非 charAt()。