2026/1/9 5:56:57
网站建设
项目流程
去哪个网站找建筑图纸,网站网站建设网页设计,凡科建站做网站需要几天,建模培训哪家好在Python中#xff0c;将段落分割成句子并保留结尾标点符号有多种方法。这里尝试示例以下是几种常用的方法#xff0c;所用例子收集和修改自网络资料。1 正则方案纯中文文本可以使用正则表达式#xff0c;以下是两个正则表达式分割示例。1.1 基础版分割正则表达式是最常用的…在Python中将段落分割成句子并保留结尾标点符号有多种方法。这里尝试示例以下是几种常用的方法所用例子收集和修改自网络资料。1 正则方案纯中文文本可以使用正则表达式以下是两个正则表达式分割示例。1.1 基础版分割正则表达式是最常用的句子分割手段示例如下。import re def split_paragraph_to_sentences(paragraph): 将段落分割成句子保留结尾标点符号 支持中文和英文 # 正则表达式匹配句子结束符。.!?;以及可能的后引号 pattern r(?[。.!?;])\s* sentences re.split(pattern, paragraph.strip()) # 过滤空字符串 sentences [s.strip() for s in sentences if s.strip()] return sentences # 示例 paragraph 这是一个测试段落。这是第二句话这是第三句话让我们继续。结尾标点.最后一句。 sentences split_paragraph_to_sentences(paragraph) for i, sentence in enumerate(sentences, 1): print(f句子{i}: {sentence})输出如下:句子1: 这是一个测试段落。句子2: 这是第二句话句子3: 这是第三句话句子4: 让我们继续。句子5: 结尾标点.句子6: 最后一句。1.2 更精细的正则分割这里尝试更精细的正则分割示例代码如下。import re def split_sentences_advanced(text): 更精细的句子分割处理特殊情况 # 处理缩写、小数点等特殊情况 pattern r(?!\w\.\w.)(?![A-Z][a-z]\.)(?\.|\?|!|。|||;|)\s sentences re.split(pattern, text) # 处理可能的分割后空白 sentences [s.strip() for s in sentences if s.strip()] return sentences # 示例 paragraph Dr. Smith went to the store. He bought apples, oranges, etc. The total was $12.50. Was that expensive? sentences split_sentences_advanced(paragraph) for i, sentence in enumerate(sentences, 1): print(f句子{i}: {sentence})输出如下句子1: Dr. Smith went to the store.句子2: He bought apples, oranges, etc.句子3: The total was $12.50.句子4: Was that expensive?2 NLTK方案NLTK库适合对英文文档进行分割需要提前安装punkt资源示例代码如下。import nltk # 第一次使用时需要下载punkt资源 # nltk.download(punkt) def split_sentences_nltk(text): 使用NLTK进行句子分割主要针对英文 from nltk.tokenize import sent_tokenize return sent_tokenize(text) # 示例 english_paragraph Hello world! This is a test. How are you? Im fine, thank you. sentences split_sentences_nltk(english_paragraph) for i, sentence in enumerate(sentences, 1): print(f句子{i}: {sentence})输出示例如下1. Dr. Smith met Mr. Jones at 5 p.m.2. They discussed the project.3. It was great!3 综合方案以下是多种综合方案兼容中英文处理等多种特殊情况。3.1 多级分割处理如果段落混杂中文和英文可以采用多级分割方式示例如下。import re def split_mixed_language_paragraph(paragraph): 处理混合中英文的段落分割 # 结合中文和英文的句子结束符 pattern r(?[。.!?;])\s*(?![a-zA-Z0-9]) sentences re.split(pattern, paragraph) # 二次处理对于英文句子使用更精确的模式 refined_sentences [] for sentence in sentences: if sentence.strip(): # 如果句子中包含英文标点进一步分割 if re.search(r[.!?], sentence) and len(sentence) 50: sub_sentences re.split(r(?[.!?])\s(?[A-Z]), sentence) refined_sentences.extend([s.strip() for s in sub_sentences if s.strip()]) else: refined_sentences.append(sentence.strip()) return refined_sentences # 示例 mixed_paragraph 这是一个测试。Hello world! 这是中文句子。How are you? 我很好 sentences split_mixed_language_paragraph(mixed_paragraph) for i, sentence in enumerate(sentences, 1): print(f句子{i}: {sentence})输出示例如下句子1: 这是一个测试。Hello world!句子2: 这是中文句子。How are you?句子3: 我很好3.2 特殊标记添加进一步支持添加特殊标记示例代码如下所示。import re class SentenceSplitter: def __init__(self): # 常见缩写列表防止错误分割 self.abbreviations { mr., mrs., ms., dr., prof., rev., hon., st., ave., blvd., rd., ln., etc., e.g., i.e., vs., jan., feb., mar., apr., jun., jul., aug., sep., oct., nov., dec. } def split(self, text): 主分割方法 if not text.strip(): return [] # 预处理在可能被错误分割的缩写后添加特殊标记 text self._protect_abbreviations(text) # 分割句子 pattern r(?[。.!?])\s sentences re.split(pattern, text) # 恢复被保护的缩写 sentences [self._restore_abbreviations(s.strip()) for s in sentences if s.strip()] return sentences def _protect_abbreviations(self, text): 保护缩写不被错误分割 import re def replace_abbr(match): abbr match.group(0).lower() if abbr in self.abbreviations: return match.group(0).replace(., [DOT]) return match.group(0) # 匹配可能的小写缩写 pattern r\b[a-z]\. text re.sub(pattern, replace_abbr, text, flagsre.IGNORECASE) return text def _restore_abbreviations(self, text): 恢复被保护的缩写 return text.replace([DOT], .) # 使用示例 splitter SentenceSplitter() paragraph Dr. Smith met Mr. Jones at 5 p.m. They discussed the project. It was great! sentences splitter.split(paragraph) for i, sentence in enumerate(sentences, 1): print(f{i}. {sentence})输出如下所示1. Dr. Smith met Mr. Jones at 5 p.m.2. They discussed the project.3. It was great!3.3 spaCy示例另外可以使用spacy进行句子分割适合对纯英文文本进行分割。# 需要先安装pip install spacy # 下载模型python -m spacy download en_core_web_sm import spacy def split_sentences_spacy(text, languageen): 使用spaCy进行句子分割 if language en: nlp spacy.load(en_core_web_sm) else: # 对于中文需要中文模型 # pip install spacy zh_core_web_sm # python -m spacy download zh_core_web_sm nlp spacy.load(zh_core_web_sm) doc nlp(text) return [sent.text.strip() for sent in doc.sents] # 示例 text This is the first sentence. This is the second one! And heres the third? sentences split_sentences_spacy(text) for i, sent in enumerate(sentences, 1): print(f句子{i}: {sent})3.4 综合示例如果混合中英文也可以采用如下的综合分割方法。这是一个综合分割示例可以选择分割方法所支持的语言等。def split_paragraph(paragraph, methodauto, languagemixed): 综合句子分割函数 参数: paragraph: 输入的段落文本 method: 分割方法可选 auto, regex, nltk, spacy language: 语言可选 zh, en, mixed 返回: 句子列表 if not paragraph or not paragraph.strip(): return [] if method auto: # 根据语言自动选择方法 if language en: try: from nltk.tokenize import sent_tokenize return sent_tokenize(paragraph) except: method regex else: method regex if method regex: if language zh: pattern r(?[。])\s* elif language en: pattern r(?[.!?])\s(?[A-Z]) else: # mixed pattern r(?[。.!?;])\s* sentences re.split(pattern, paragraph.strip()) return [s.strip() for s in sentences if s.strip()] elif method nltk: from nltk.tokenize import sent_tokenize return sent_tokenize(paragraph) elif method spacy: import spacy nlp spacy.load(en_core_web_sm if language en else zh_core_web_sm) doc nlp(paragraph) return [sent.text.strip() for sent in doc.sents] return [] # 使用示例 paragraph 这是一个测试。Hello world! 第二句话结束。 sentences split_paragraph(paragraph, methodregex, languagemixed) print(分割结果:, sentences)reference---