提供《雪中悍刀行》相关粉丝资讯

Python之全职高手分词分析

2018-09-26 23:12:31

The King's Avatar

男频作者中最喜欢两个作家,虫爹和烽火,虫爹的作品最爱全职高手,烽火的最爱雪中悍刀行,最近虫爹又在新开王者时刻,补坑天醒之路,烽火在写剑来,顺带一个老猫吧,大道朝天可以追追。犹记得去年还没有这么多书可以追的时候,我反复看了十几遍全职高手,论我对全职的爱,可比山高,比海深。论我对徐凤年的喜,可鉴日月,鉴天地。(哈哈,叶修也只能排第二)

全职高手全文分词

之前对用户导入书和搜索query做分词统计的时候,顺带撸了一把全职高手和全职高手的评论,评论本来是想做正向标签的功能,嗯,结果策略还没想好,所以就玩玩儿全职高手的全文吧。

What have I done?

加载了一个butterfly自定义词典啦,内含大量全职人物名,武器名,技能名,游戏角色名,战队名等,防止分错词,加载一个stopwords文件。

做了词频统计,云词图输出

word2vec试一把相似词召回

from wordcloud import WordCloud, ImageColorGenerator

import matplotlib.pyplot as plt

from scipy.misc import imread

import jieba

import operator

import numpy as np

from pandas import read_csv,read_table

import pandas

import matplotlib

import jieba.analyse

from collections import Counter

from wordcloud import WordCloud

from pandas import Series,DataFrame

from gensim.models import word2vec

#加载自定义词典

jieba.load_userdict("c:\\users\\liuhuizhu\\desktop\\butterfly.txt")

new_str_gaoshou = ""

#全职高手分词结果,打开一个文件,写入操作

quanzhigaoshou_fenci_data = open('C:\\Users\\liuhuizhu\\Downloads\\quanzhigaoshou_fenci_data.txt','w')

#全职高手全文

quanzhigaoshou_data = open('C:\\Users\\liuhuizhu\\Downloads\\quanzhigaoshou_perfect.txt','r')

lines = quanzhigaoshou_data.readlines()

#type(lines)

def stop_lines(filepath):

stopwords = [line.strip() for line in open(filepath, 'r').readlines()]

return stopwords

for line in lines:

#print(type(line))

cut_data = jieba.cut(line)

#print(", ".join(cut_data))

stopwords = stop_lines('c:\\users\\liuhuizhu\\desktop\\stopwords.txt')

for word in cut_data:

if word not in stopwords and len(word)>=2 and ('\u4e00' <= word <= '\u9fff'):

if word !='\t':

new_str_gaoshou += word

new_str_gaoshou +=" "

quanzhigaoshou_fenci_data.write(new_str_gaoshou+'\n')

#统计词频

infile = open('C:\\Users\\liuhuizhu\\Downloads\\quanzhigaoshou_fenci_data.txt','r')

line_list = []

word_dict = {}

for line in infile.read().split('\n'):

line_list.append(line.split(' '))

for item in line_list:

for item2 in item:

if(( item2 not in word_dict)&('\u4e00' <= item2 <= '\u9fff')):

word_dict[item2] = 1

elif((item2 in word_dict)&('\u4e00' <= item2 <= '\u9fff')):

word_dict[item2] += 1

#将词频输出

a1 = sorted(word_dict.items(),key = lambda x:x[1],reverse = True)

df = DataFrame(a1,columns=['word','num'])

print(type(a1))

df.to_csv("c:\\users\\liuhuizhu\\desktop\\quanzhi_allcipin_num.csv")

#弄个云词图

cloud = WordCloud(

width = 1600,

height = 800,

#设置字体,不指定就会出现乱码

font_path="C:\\Windows\\Fonts\\STFANGSO.ttf",

#font_path=path.join(d,'simsun.ttc'),

#设置背景色

background_color='white',

#词云形状

#mask=color_mask,

#允许最大词汇

max_words=800,

#最大号字体

max_font_size=80

cloud.generate_from_frequencies(word_dict)

plt.imshow(cloud, interpolation='bilinear')

plt.axis("off")

plt.figure()

cloud.to_file('c:\\users\\liuhuizhu\\desktop\\test.jpg')

#word2vec参数sg为1是skip-gram,适用于训练集较少的情况,vector打散至200维

sentences =word2vec.Text8Corpus("C:\\Users\\liuhuizhu\\Downloads\\quanzhigaoshou_fenci_data.txt")  # 加载语料

model =word2vec.Word2Vec(sentences,sg = 1,size=200)

print (model)

y2 = model.most_similar(u"苏沐橙", topn=20)  # 20个最相关的

print (u"和【苏沐橙】最相关的词有:\n")

for item in y2:

print (item[0], item[1])

print ("-----\n")

之前还在一种b2v算法场景中也验证了小数据量用skip-gram训练更为准确一些,毕竟我这单机根本支撑不起浩大的工程,只能自娱自乐啦。