文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

6.6 将词语类型考虑进去

发布于 2024-01-30 22:34:09 字数 12681 浏览 0 评论 0 收藏 0

到目前为止，我们希望的是简单使用相互独立的词语，使词袋方法可以使用。然而，从我们的直觉上来看，中性推文中可能包含更大比例的名词，而正面或者负面情感的推文则更加丰富多彩，需要更多的形容词和动词。如果我们能利用推文中的语言信息，效果将会如何呢？如果能发现一个推文中有多少词语是名词、动词、形容词等，那么分类器也可以在分类时把这些信息利用起来。

6.6.1　确定词语的类型

确定词语类型是词性标注 （Part Of Speech tagging，POS标注）所要做的。词性标注器会对整句进行解析，目标是把它重新排列成一个依赖树的形式。树中的每个节点对应一个词语，而父子关系确定了这个词是依赖谁的。有了这个树，就可以做出更明智的决策，例如词语“book”是一个名词（“This is a good book”）还是一个动词（“Could you please book the flight?”）。

你可能已经猜到，NLTK在这里也会扮演一个角色。确实，它包含了各种解析器和标注器。我们将要使用的POS标注器nltk.pos_tag() ，其实是一个成熟的分类器。它是通过Pennn Treebank Project（http://www.cis.upenn.edu/~treebank ）中的人工标注句子训练出来的。它将一列切分后的词语作为输入，输出一列元组，其中每个元素包含部分原始句子以及它们的词性标签：

>>> import nltk >>> nltk.pos_tag(nltk.word_tokenize("This is a good book.")) [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('book', 'NN'), ('.', '.')] >>> nltk.pos_tag(nltk.word_tokenize("Could you please book the flight?")) [('Could', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('book', 'NN'), ('the', 'DT'), ('flight', 'NN'), ('?', '.')]

这些POS标签缩写来自于Penn Treebank Project（改编自http://americannationalcorpus.org/OANC/penn.html ）。

POS标签	描述	例子
CC	并列连词	or
CD	基数词	2 second
DT	限定词	the
EX	存在there	there are
FW	外来词	kindergarten
IN	介词/从属连词	On 、of 、like
JJ	形容词	cool
JJR	形容词，比较级形式	cooler
JJS	形容词，最高级形式	coolest
LS	列表标记	1)
MD	情态动词	could 、will
NN	名词，单数或质量	book
NNS	名词复数	books
NNP	专有名词，单数	Sean
NNPS	专有名词，复数	Vikings
PDT	前置限定词	both the boys
POS	所有格结束词	friend's
PRP	人称代词	I 、he 、it
PRP$	所有格代词	my 、his
RB	副词	However 、usually 、naturally 、here 、good
RBR	副词，比较级形式	better
RBS	副词，最高级形式	best
RP	助词	give up
TO	to	tg、thim
UH	感叹词	uhhuhhuhh
VB	动词，基本形式	take
VBD	动词，过去时	took
VBG	动词，动名词/进行时	taking
VBN	动词，过去分词	taken
VBP	动词，单数，现在时，非3D	take
VBZ	动词，第三人称单数，现在时	takes
WDT	疑问限定词	which
WP	疑问代词	wh、what
WP$	所有格疑问代词	whose
WRB	疑问副词	where 、when

有了这些，从pos_tag() 的输出中过滤出预期的标签就会非常容易。我们简单地统计一下词语个数即可。在这些词语的标签中，名词是以NN 开头的，动词是以VB 开头的，形容词是以JJ 开头的，而副词是以RB 开头的。

6.6.2　用SentiWordNet成功地作弊

我们之前讨论的语言信息很可能对我们有所帮助，但同时还有一些更好的东西，我们可以从中有所收获：SentiWordNet（http://sentiwordnet.isti.cnr.it ）。简单来说，它是一个13 MB的文件，赋予了大部分英文单词一个正向分值和一个负向分值。在一些更复杂的单词中，对它的每一个同义词集合都记录了正面情感和负面情感的分值。下面是一些例子：

POS （词性）	ID	PosScore （正向分值）	NegScore （负向分值）	SynsetTerms （同义词）	详细说明
a	03311354	0.25	0.125	studious#1	Marked by care and effort; "made a studious attempt tfix the television set"
a	00311663	0	0.5	careless#1	Marked by lack of attention or consideration or forethought or thoroughness; not careful
n	03563710	0	0	implant#1	A prosthesis placed permanently in tissue
v	00362128	0	0	kink#2 curve#5 curl#1	Form a curl, curve, or kink; "the cigar smoke curled up at the ceiling"

通过词性（POS）这列中的信息，我们可以区分出名词的“book”和动词的“book”。PosScore 和NegScore 一起可以帮助我们确定词语的中性程度，它等于1 - PosScore - NegScore 。SynsetTerms 列出了同义词集合。ID 和Description 则可以忽略。

同义词集合元素的后面都跟着一个数字，因为这些词语会在不同的同义词集合中出现多次。例如，“fantasize”包含了两个完全不同的含义，这也导致了不同的分值：

POS （词性）	ID	PosScore （正向分值）	NegScore （负向分值）	SynsetTerms （同义词）	详细说明
v	01636859	0.375	0	fantasize#2 fantasise#2	Portray in the mind; "he is fantasizing the ideal wife"
v	01637368	0	0.125	fantasy#1 fantasize#1 fantasise#1	Indulge in fantasies; "he is fantasizing when he says that he plans tstart his own company

要弄明白应该使用哪些同义词，我们需要真正理解推文的意思，这已经超出本章所要讨论的范围。专注于解决这个难题的研究领域叫做词义消歧 （word sense disambiguation）。现在，我们只需要采取比较容易的方式即可：简单地对所有同义词的分数求平均值。对于“fantasize”，PosScore 是0.1875 ，NegScore 是0.0625 。

下面这个函数load_sent_word_net() 把这些都做好了，并返回到了一个字典。字典的键是“word type/word”形式的字符串，例如“n/implant”，而值是正向和负向分值：

import csv, collections def load_sent_word_net(): sent_scores = collections.defaultdict(list) with open(os.path.join(DATA_DIR, SentiWordNet_3.0.0_20130122.txt"), "r") as csvfile: reader = csv.reader(csvfile, delimiter='\t', quotechar='"') for line in reader: if line[0].startswith("#"): continue if len(line)==1: continue POS,ID,PosScore,NegScore,SynsetTerms,Gloss = line if len(POS)==0 or len(ID)==0: continue #打印出 POS, PosScore, NegScore, SynsetTerms for term in SynsetTerms.split(" "): # 扔掉每个词语后面的数字 term = term.split("#")[0] term = term.replace("-", " ").replace("_", " ") key = "%s/%s"%(POS,term.split("#")[0]) sent_scores[key].append((float(PosScore), float(NegScore))) for key, value in sent_scores.iteritems(): sent_scores[key] = np.mean(value, axis=0) return sent_scores

6.6.3　我们第一个估算器

现在，创建第一个估算器的准备工作都做好了。最方便的实现方式就是继承自BaseEstimator 类。它要求我们运用以下3种方法。

get_feature_names() 　这个返回一个特征字符串列表，它包含用transform() 返回的所有特征。

fit(document, y=None) 　由于我们并不是实现分类器，所以可以忽略这个，简单返回self 即可。

transform(documents) 　这个将返回numpy.array() ，它包含了一个大小数组（len(documents), len(get_feature_names) ）。这意味着，对documents 中的每一个文档，它会为每一个特征名（在get_feature_names() 中）返回一个值。

现在来运用这些方法：

sent_word_net = load_sent_word_net() class LinguisticVectorizer(BaseEstimator): def get_feature_names(self): return np.array(['sent_neut', 'sent_pos', 'sent_neg', 'nouns', 'adjectives', 'verbs', 'adverbs', 'allcaps', 'exclamation', 'question', 'hashtag', 'mentioning']) # 我们并不进行拟合，但需要返回一个引用 # 以便可以按照fit(d).transform(d)的方式使用 def fit(self, documents, y=None): return self def _get_sentiments(self, d): sent = tuple(d.split()) tagged = nltk.pos_tag(sent) pos_vals = [] neg_vals = [] nouns = 0. adjectives = 0. verbs = 0. adverbs = 0. for w,t in tagged: p, n = 0,0 sent_pos_type = None if t.startswith("NN"): sent_pos_type = "n" nouns += 1 elif t.startswith("JJ"): sent_pos_type = "a" adjectives += 1 elif t.startswith("VB"): sent_pos_type = "v" verbs += 1 elif t.startswith("RB"): sent_pos_type = "r" adverbs += 1 if sent_pos_type is not None: sent_word = "%s/%s"%(sent_pos_type, w) if sent_word in sent_word_net: p,n = sent_word_net[sent_word] pos_vals.append(p) neg_vals.append(n) l = len(sent) avg_pos_val = np.mean(pos_vals) avg_neg_val = np.mean(neg_vals) return [1-avg_pos_val-avg_neg_val, avg_pos_val, avg_neg_val, nouns/l, adjectives/l, verbs/l, adverbs/l] def transform(self, documents): obj_val, pos_val, neg_val, nouns, adjectives, \ verbs, adverbs = np.array([self._get_sentiments(d) \ for d in documents]).T allcaps = [] exclamation = [] question = [] hashtag = [] mentioning = [] for d in documents: allcaps.append(np.sum([t.isupper() \ for t in d.split() if len(t)>2])) exclamation.append(d.count("!")) question.append(d.count("?")) hashtag.append(d.count("#")) mentioning.append(d.count("@")) result = np.array([obj_val, pos_val, neg_val, nouns, adjectives, verbs, adverbs, allcaps, exclamation, question, hashtag, mentioning]).T return result

6.6.4　把所有东西融合在一起

然而，如果不考虑词语本身，独立使用语言特征并不会让我们走得太远。因此，我们需要把TfidfVectorizer 和语言特征结合起来。这可以用scikit-learn的FeatureUnion 类得到。它的初始化方式跟Pipiline 一样，但与顺序执行的估算器的效果衡量方式（在每一轮中将前一次的输出传递给下一轮）不同，FeatureUnion 会并行处理，然后把输出的向量融合在一起：

def create_union_model(params=None): def preprocessor(tweet): tweet = tweet.lower() for k in emo_repl_order: tweet = tweet.replace(k, emo_repl[k]) for r, repl in re_repl.iteritems(): tweet = re.sub(r, repl, tweet) return tweet.replace("-", " ").replace("_", " ") tfidf_ngrams = TfidfVectorizer(preprocessor=preprocessor, analyzer="word") ling_stats = LinguisticVectorizer() all_features = FeatureUnion([('ling', ling_stats), ('tfidf', tfidf_ngrams)]) clf = MultinomialNB() pipeline = Pipeline([('all', all_features), ('clf', clf)]) if params: pipeline.set_params(**params) return pipeline

在融合后的特征处理器上进行训练和测试，在“正面情感 vs. 负面情感”的分类中可以得到额外的0.6%的提升。

== Pos vs. neg == 0.808 0.016 0.892 0.010 == Pos/neg vs. irrelevant/neutral == 0.794 0.009 0.707 0.033 == Pos vs. rest == 0.886 0.006 0.533 0.026 == Neg vs. rest == 0.881 0.012 0.629 0.037

看到这些结果，我们可能不会再使用“负面情感推文 vs. 余下的推文”和“正面情感推文vs. 余下的推文”的分类器了。相反，我们会先用分类器确定推文中是否包含情感（正向/负向 vs. 无关/中性）。然后，如果包含的话，再使用“正向情感 vs. 负向情感”的分类器来确定实际的情感。

分享到QQ

分享到微博