其他

用正则表达式匹配字符串中汉字及中文标点符号

悠悠楠杉

2025-06-04

0 评论

176 阅读

正在检测是否收录...

06/04

1. 准备环境

2. 编写正则表达式

对于中文汉字和中文标点符号，我们可以使用以下正则表达式：
- 汉字：[\u4e00-\u9fa5]+（Unicode编码中汉字的范围）
- 中文标点符号：[\u3000-\u303F\uFF00-\uFFEF]+（包括一些常见的中文标点符号）

3. 编写Python脚本

```python
import re

def extractchineseandpunctuation(text): # 匹配汉字和中文标点符号 chinesewithpunctuation = re.findall(r'[\u4e00-\u9fa5\u3000-\u303F\uFF00-\uFFEF]+', text) return chinesewith_punctuation

def generatemarkdown(text): # 提取中文内容和中文标点符号 chinesewithpunctuation = extractchineseandpunctuation(text)
# 假设我们用前5个中文字符作为标题，后5个作为关键词（实际可以根据需要调整）
title = chinesewithpunctuation[0:5] if len(chinesewithpunctuation) > 5 else chinesewithpunctuation
keywords = chinesewithpunctuation[5:10] if len(chinesewithpunctuation) > 10 else chinesewithpunctuation[5:]
description = chinesewithpunctuation[10:20] if len(chinesewithpunctuation) > 20 else chinesewithpunctuation[10:]
# 剩余部分作为正文（注意控制长度）
content = "\n".join(chinesewithpunctuation[20:])[:1000] # 限制正文长度为1000字左右
# 生成Markdown格式的文本
markdown = f"""# 标题: {title}
关键词: {', '.join(keywords)}
描述: {description}
正文: {content}"""
return markdown

读取并处理文本文件（这里假设为example.txt）

with open("example.txt", "r", encoding="utf-8") as file:
text = file.read()
markdownoutput = generatemarkdown(text)
print(markdown_output)
```
注意：上述脚本简单地将中文内容分为标题、关键词、描述和正文，并且假设正文不超过1000字。实际应用中，你可能需要根据具体需求调整这些部分的划分逻辑和长度限制。此外，这个示例假设输入文本已经是一个完整的中文段落或文章。如果需要从更复杂的文本结构中提取信息（如从多个段落或文档中），你可能需要更复杂的逻辑来识别并分割这些内容。

朗读

版权属于：

至尊技术网

本文链接：

https://www.zzwws.cn/archives/28672/（转载时请注明本文出处及文章链接）

作品采用：

《署名-非商业性使用-相同方式共享 4.0 国际 (CC BY-NC-SA 4.0)》许可协议授权

用正则表达式匹配字符串中汉字及中文标点符号

1. 准备环境

2. 编写正则表达式

3. 编写Python脚本

读取并处理文本文件（这里假设为example.txt）

人生倒计时