如何用python对一个文件夹下的多个txt文本进行去停用词

 我来答

1个回答

#热议# 为什么说不要把裤子提到肚脐眼？

从空去听8
2017-09-13 · TA获得超过7439个赞

知道大有可为答主

回答量：6907

采纳率：93%

帮助的人：5507万

我也去答题访问个人页

关注

展开全部

在用 for 循环去停用词的部分，出错，仅去掉了 stopwords 中的部分停用词，且相同停用词只去除了一次。求大神告知错误之处，贴上代码再好不过！！
#encoding=utf-8
import sys
import re
import codecs
import os
import shutil
import jieba
import jieba.analyse

#导入自定义词典
#jieba.load_userdict("dict_baidu.txt")

#Read file and cut
def read_file_cut():

#create path
stopwords = {}.fromkeys([ line.strip() for line in open('stopword.txt') ])
path = "Lon\\"
respath = "Lon_Result\\"
if os.path.isdir(respath): #如果respath这个路径存在
shutil.rmtree(respath, True) #则递归移除这个路径
os.makedirs(respath) #重新建立一个respath目录

num = 1
while num<=20:
name = "%d" % num
fileName = path + str(name) + ".txt"
resName = respath + str(name) + ".txt"
source = open(fileName, 'r')
if os.path.exists(resName):
os.remove(resName)
result = codecs.open(resName, 'w', 'utf-8')
line = source.readline()
line = line.rstrip('\n')

while line!="":
line = unicode(line, "utf-8")
output=''
seglist = jieba.cut(line,cut_all=False)
for seg in seglist:
seg=seg.encode('utf-8')
if seg not in stopwords:
output+=seg
output = ' '.join(list(seglist))#空格拼接
print output
result.write(output + '\r\n')
line = source.readline()
else:
print 'End file: ' + str(num)
source.close()
result.close()
num = num + 1
else:
print 'End All'

#Run function
if __name__ == '__main__':
read_file_cut()

我觉得是这样啦:
...
seglist = jieba.cut(line,cut_all=False)
seglist = (seg.encode('utf-8') for seg in seglist)
seglist = [seg for seg in seglist if seg not in stopwords]
output = ' '.join(seglist)
print output
...

不太懂你这两行的意思:
output+=seg
output = ' '.join(list(seglist))#空格拼接

每次 output 都会被设定成 ' '.join(list(seglist)) 那 output+=seg 好像就没有意义了。

已赞过 已踩过<

评论收起

推荐律师服务：若未解决您的问题，请您详细描述您的问题，通过百度律临进行免费专业咨询

如何用python对一个文件夹下的多个txt文本进行去停用词

其他类似问题

为你推荐：