NLP Lemmatisation(词性还原) 和 Stemming(词干提取) NLTK pos_tag word_tokenize

词形还原(lemmatization),是把一个词汇还原为一般形式(能表达完整语义),方法较为复杂;而词干提取(stemming)是抽取词的词干或词根形式(不一定能够表达完整语义),方法较为简单。
Stemming(词干提取):
基于语言的规则。如英语中名词变复数形式规则。由于基于规则,可能出现规则外的情况。

1# Porter Stemmer基于Porter词干提取算法 2from nltk.stem.porter import PorterStemmer 3porter_stemmer = PorterStemmer() 4porter_stemmer.stem('leaves') 5
1# 输出:'leav' 2# 但实际应该是名词'leaf' 3

nltk中主要有以下方法:

1# 基于Porter词干提取算法 2from nltk.stem.porter import PorterStemmer 3porter_stemmer = PorterStemmer() 4porter_stemmer.stem(‘maximum’) 5 6# 基于Lancaster 词干提取算法 7from nltk.stem.lancaster import LancasterStemmer 8lancaster_stemmer = LancasterStemmer() 9lancaster_stemmer.stem(‘maximum’) 10 11# 基于Snowball 词干提取算法 12from nltk.stem import SnowballStemmer 13snowball_stemmer = SnowballStemmer(“english”) 14snowball_stemmer.stem(‘maximum’) 15

Lemmatisation(词性还原):
基于字典的映射。nltk中要求手动注明词性,否则可能会有问题。因此一般先要分词、词性标注,再词性还原。

1from nltk.stem import WordNetLemmatizer 2lemmatizer = WordNetLemmatizer() 3lemmatizer.lemmatize('leaves') 4
1# 输出:'leaf' 2

完整过程:

1word_tokenize("apples % , I've loves green") 2
1pos_tag(word_tokenize("apples % , I've loves green")) 2
1wnl = WordNetLemmatizer() 2wnl.lemmatize('apples', pos='n') 3
1def lemmatize_all(sentence): 2 wnl = WordNetLemmatizer() 3 for word, tag in pos_tag(word_tokenize(sentence)): 4 if tag.startswith('NN'): 5 yield wnl.lemmatize(word, pos='n') 6 elif tag.startswith('VB'): 7 yield wnl.lemmatize(word, pos='v') 8 elif tag.startswith('JJ'): 9 yield wnl.lemmatize(word, pos='a') 10 elif tag.startswith('R'): 11 yield wnl.lemmatize(word, pos='r') 12 else: 13 yield word 14 15train_f = [] 16test_f = [] 17for i in range(0, len(train_feature)): 18 train_f.append(' '.join(lemmatize_all(train_feature[i]))) 19for i in range(0, len(test_feature)): 20 test_f.append(' '.join(lemmatize_all(test_train[i]))) 21

NLTK词性:

1CC 连词 and, or,but, if, while,although 2CD 数词 twenty-four, fourth, 1991,14:24 3DT 限定词 the, a, some, most,every, no 4EX 存在量词 there, there's 5FW 外来词 dolce, ersatz, esprit, quo,maitre 6IN 介词连词 on, of,at, with,by,into, under 7JJ 形容词 new,good, high, special, big, local 8JJR 比较级词语 bleaker braver breezier briefer brighter brisker 9JJS 最高级词语 calmest cheapest choicest classiest cleanest clearest 10LS 标记 A A. B B. C C. D E F First G H I J K 11MD 情态动词 can cannot could couldn't 12NN 名词 year,home, costs, time, education 13NNS 名词复数 undergraduates scotches 14NNP 专有名词 Alison,Africa,April,Washington 15NNPS 专有名词复数 Americans Americas Amharas Amityvilles 16PDT 前限定词 all both half many 17POS 所有格标记 ' 's 18PRP 人称代词 hers herself him himself hisself 19PRP$ 所有格 her his mine my our ours 20RB 副词 occasionally unabatingly maddeningly 21RBR 副词比较级 further gloomier grander 22RBS 副词最高级 best biggest bluntest earliest 23RP 虚词 aboard about across along apart 24SYM 符号 % & ' '' ''. ) ) 25TO 词to to 26UH 感叹词 Goodbye Goody Gosh Wow 27VB 动词 ask assemble assess 28VBD 动词过去式 dipped pleaded swiped 29VBG 动词现在分词 telegraphing stirring focusing 30VBN 动词过去分词 multihulled dilapidated aerosolized 31VBP 动词现在式非第三人称时态 predominate wrap resort sue 32VBZ 动词现在式第三人称时态 bases reconstructs marks 33WDT Wh限定词 who,which,when,what,where,how 34WP WH代词 that what whatever 35WP$ WH代词所有格 whose 36WRB WH副词 37
1# 查看说明 2nltk.help.upenn_tagset(JJ) 3

代码交流 2021