用 NLTK 进行分词和词性标注的指南35

自然语言处理 (NLP) 是一门计算机科学领域，侧重于理解和生成人类语言。分词和词性标注是 NLP 中的基本任务，它们允许计算机识别文本中的单词并确定它们的词性。

NLTK (Natural Language Toolkit) 是一个流行的 Python 库，它提供了用于执行分词和词性标注的丰富功能。本指南将介绍如何使用 NLTK 进行这些任务，并提供一些代码示例。## 分词

分词是将文本分解为单词或标记的过程。NLTK 提供了 `word_tokenize()` 函数，它使用正则表达式将文本拆分为单词。以下代码示例演示了如何使用此函数：```python
import nltk
text = "Natural language processing is a subfield of linguistics, computer science, and artificial intelligence."
tokens = nltk.word_tokenize(text)
print(tokens)
```
输出：
```
['Natural', 'language', 'processing', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.']
```
## 词性标注

词性标注是识别单词词性的过程，例如名词、动词、形容词等。NLTK 提供了 `pos_tag()` 函数，它使用训练过的模型为单词分配词性标签。以下代码示例演示了如何使用此函数：```python
tagged_tokens = nltk.pos_tag(tokens)
print(tagged_tokens)
```
输出：
```
[('Natural', 'NN'), ('language', 'NN'), ('processing', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('subfield', 'NN'), ('of', 'IN'), ('linguistics', 'NN'), (',', ','), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('and', 'CC'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('.', '.')]
```
上例中，每个单词都与相应的词性标签配对。例如，“Natural”被标记为名词（NN），而“processing”被标记为名词（NN）。
## 自定义词性标注器

NLTK 允许您创建自己的自定义词性标注器，以适应特定领域或语言。以下代码示例演示了如何创建一个简单的自定义标注器，将所有大写单词标记为专有名称（NNP）：```python
class CustomTagger():
def __init__(self, train_sents):
train_set = []
for tagged_sent in train_sents:
untagged_sent = (tagged_sent)
history = []
for i, (word, tag) in enumerate(tagged_sent):
featureset = (untagged_sent, i, history)
( (featureset, tag) )
(tag)
= (train_set, trace=0)
def features(self, sentence, i, history):
word = sentence[i]
if ():
return {'word': word, 'isupper': True}
else:
return {'word': word, 'isupper': False}
def tag(self, sentence):
history = []
for i, word in enumerate(sentence):
featureset = (sentence, i, history)
tag = (featureset)
(tag)
return zip(sentence, history)
```
要使用自定义标注器，您可以按照以下步骤操作：
1. 训练标注器，提供一组标注句子。
2. 使用 `tag()` 方法为新句子分配词性标签。
## 结论