马蜂窝游记文本分词后以词语间距为筛选条件生成共词矩阵和社会网络图 ...

2024-2-1 11:16| 发布者: Fuller| 查看: 5272| 评论: 0

摘要: 在Notebook中使用Gooseeker文本分词和情感分析软件导出的分词效果和共词矩阵excel表格；使用python编程语言对共现词的距离进行计算，使用引入了距离因素的共词矩阵表画图 ...

1 背景介绍

1.1 实验目的

前两天我们发布了《知乎话题文本分词后的共词矩阵怎样引入词语距离因素》这篇Notebook，这是针对使用Gooseeker文本分词和情感分析软件做毕业设计的同学提出的以下问题的答复：

共词矩阵生成时能否考虑共现词之间的位置距离？比如2个词的距离小于某个值时才算作共现词?

像知乎回答这样的文本，长短不一，而且差距很大，很有必要根据词语间的距离进一步筛选共词关系；另外，长文本很普遍的内容分析任务，例如，政策文件等等，可以选择不同的分析粒度，除了整个文章以外，还可以是段落和句子。但是，这往往需要手工切分。此时，利用词语间距进行自动过滤、避免繁琐的手工处理，也是一个很好的处理方法。

另外还有一项重要的处理：Gooseeker文本分词和情感分析软件在分词处理过程中，会把长文本切成多段，通常每段都不超过1万字。这样，共现关系就变了。比如，一对词在前一段中共现了，在后一段中又共现了，如果切分成两段，就是说这对词在两个文档中共现过，一共共现了两次。然而，本质是只共现了一次。所以，本notebook首先要把切分开的文本进行合并，纠正重复计数的问题。【注意】Gooseeker文本分词和情感分析软件并没有合并，下一个版本才会实现合并功能。当前（2024年1月31日）得依赖于这个notebook进行合并。

下面，我们展示分析游记文本时怎样利用词语间距进行自动过滤。大致过程就是：在Notebook中使用Gooseeker文本分词和情感分析软件导出的分词效果和共词矩阵excel表格；使用python编程语言对共现词的距离进行计算，并对共词矩阵excel表格进行修改；修改后的表格会以一个新的名称保存在data/processed目录下；然后使用引入了距离因素的共词矩阵表画图。

显然可以观察到：由于距离因素的引入，减少了很多误判的共现关系，画出的网络图显得更清爽。展望未来，由于Gephi提供了强大的统计、过滤和网络图的布局功能，我们会写另一篇教程，把修改后的共词矩阵表导入Gephi做进一步的观察。

1.1.1 选词矩阵和选词结果表能否也引入距离因素

引入距离参数，用来过滤共现关系，这个设想是可行的，此前发布了《知乎话题文本分词后的共词矩阵怎样引入词语距离因素》，就是将Gooseeker文本分词和情感分析软件导出的共词矩阵进行过滤和删减。

选词矩阵和选词结果表，能否也引入距离因素后做修改，因为分析和研究需要一套基于相同标准筛选处理的数据。

但是，《知乎话题文本分词后的共词矩阵怎样引入词语距离因素》介绍的方法稍感粗糙，因为，判断两个词是否在规定距离范围内时，只检查了第一次出现共词的情况，如果一条文本中，同样一对共现词出现多次，第一次距离很远，第二次反而符合距离要求，就会被错误地过滤掉。本篇notebook将实现一个精确地算法，从修改选词矩阵入手（同时也会修改选词匹配表），这样就可以得到精确的选词矩阵、选词匹配表、共词矩阵，因为共词矩阵是利用选词矩阵相乘得到的，参看《共词分析中的共词关系是怎么得到的？》。

图1. GooSeeker分词和情感分析软件的导出结果

图2. 选词匹配表

图3. 选词矩阵表

1.1.2 为什么要考虑词语距离因素

基于原始文本自动分词得到的结果，词的数量很大，很多词没有分析价值，除了增加处理的难度，而且会干扰分析结果的合理性。实际生成共现矩阵时，GooSeeker分词工具提供了人工选词的功能，可以根据不同的研究目的自行裁量使用哪些词。也提供了按词性或词频大小进行快速过滤的功能。假如2个词在人工选词或过滤阶段已经被过滤掉，那么也不会算作共现词。

但是，如果被分析的文本长短不一，甚至相差悬殊，例如，知乎网站上的回答，有些回答是一篇相当长篇幅的文章。在这样的长文本中，词语共现的可能性极大。但是，很多词其实隶属于不同的段落，为了表达不同的主题。解决这个问题的方法可以是规定不同的内容分析粒度，比如，统一使用段落粒度或者句子粒度，而不用整篇粒度，人工切分。而本notebook通过规定词语距离，可以自动化地解决远距离不相关词语的剔除问题。

图4. 选词

今天实验的目的，就是尝试怎样使用python修改GooSeeker分词工具生成并导出的选词匹配表和选词矩阵表，把2个词的位置距离考虑进来，预设某个距离值，小于此预设距离值的词关系保留，而大于此预设距离值的词关系不保留。

1.2 本notebook使用方法

基本操作顺序是：

在GooSeeker文本分词和情感分析软件上创建分析任务并导入包含原始内容的excel
在自动分词完成后进行人工选词，然后导出“选词匹配表”，“选词矩阵表”，“分词效果表”
将导出的选词匹配表放在本notebook的data/raw文件夹中
从头到尾执行本notebook的单元。本notebook将使用python对导出的“选词匹配表”和“选词矩阵表”进行修改，然后分别保存到一个新的excel表格中，基于预设的某个词语距离值，大于这个预置值的共现词将从共词矩阵和选词结果里面减去，小于等于这个预设距离值的共现词和数值保留。
在本notebook的data/processed文件夹中新生成“选词匹配已参考距离值-xxxxxx.xlsx”，“选词矩阵已参考距离值-xxxxxx.xlsx”。修改后的表和原有的raw目录下的原表结构一样，不过矩阵中的数值经过了修改，词语列表里的词会有减少。

注意：每个notebook项目目录都预先规划好了，具体参看Jupyter Notebook项目目录规划参考。如果要做多个分析项目，把整个模板目录专门拷贝一份给每个分析项目。

1.3 实验数据来源

使用马蜂窝游记详情采集这个快捷采集工具，爬取关于武汉的游记，导出excel。把该excel导入GooSeeker文本分词和情感分析软件，经过自动分词，人工选词和过滤，共词匹配后，导出“选词匹配表”，“选词矩阵表”，“分词效果表”：

2 第三方库

本系列notebook使用了多个第三方库，参看下面的导入程序包代码，如果运行中遇到缺程序库的情况，要预先安装到notebook所用的python目录中。在此不一一列举，仅以pandas库的安装为例。

如果未安装，请先使用下面的命令安装pandas库，再运行实验本notebook。【注意】电脑上可能有多个python环境，要确保把程序包安装到本notebook所使用的环境中。例如，如果是在Jupyter中运行notebook，而且安装的是Anaconda套件，那么，利用套件的Anaconda Prompt命令窗口中执行下面的命令（如下图），在macOS电脑上也可以用conda命令代替pip命令。

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas #国内安装使用清华的源，速度快

图5. pip安装

3 准备程序环境

导入必要的Python程序包，设定要分析的文件名变量。在这一系列notebook中，我们都使用以下变量对应GooSeeker分词结果表：

file_word_freq：词频表
file_seg_effect: 分词效果表
file_word_choice_matrix: 选词矩阵表
file_word_choice_match: 选词匹配表
file_word_choice_result: 选词结果表
file_co_word_matrix: 共词矩阵表
file_co_occ_matrix_modified：共词矩阵引入词语距离后的表
file_word_choice_matrix_modified: 选词矩阵引入词语距离后的表
file_word_choice_match_modified: 选词匹配引入词语距离后的表

3.1 导入程序包并声明变量

import pandas as pd
import os
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import pylab
from time import time
import datetime

%xmode Verbose
import warnings
warnings.filterwarnings('ignore')
from IPython.core.interactiveshell import InteractiveShell
warnings.filterwarnings("ignore", category=DeprecationWarning)
# 运行在一个cell中的多个输出都显示
InteractiveShell.ast_node_interactivity = "all"
# 存原始数据的目录
raw_data_dir = os.path.join(os.getcwd(), '..\\..\\data\\raw')
# 存处理后的数据的目录
processed_data_dir = os.path.join(os.getcwd(), '..\\..\\data\\processed')
filename_temp = pd.Series(['分词效果','选词匹配','选词矩阵'])

输出结果如下：

Exception reporting mode: Verbose

3.2 检测data\raw目录下是否有分词效果表

以下的演示以GooSeeker分词和文本分析软件生成的分词效果和共词矩阵excel表为例，需要把分词效果和共词矩阵excel表放到本notebook的data/raw文件夹下，如果没有数据表，下面代码执行后将提示“不存在”

# 0:'词频', 1:'分词效果', 2:'选词矩阵', 3:'选词匹配', 4:'选词结果', 5:'共词矩阵'
file_seg_effect = ''
print(raw_data_dir + '\r\n')
for item_filename in os.listdir(raw_data_dir):
if filename_temp[0] in item_filename:
file_seg_effect = item_filename
continue
if file_seg_effect:
print("分词效果excel表：", "data\\raw\\", file_seg_effect)
else:
print("分词效果excel表：不存在")

输出结果如下：

C:\Users\work\workspacexxxxxxxxxx\notebook\eda\..\..\data\raw
分词效果excel表： data\raw\ 分词效果_202401291719572350.xlsx

3.3 检测data\raw目录下是否有选词匹配表

# 0:'词频', 1:'分词效果', 2:'选词矩阵', 3:'选词匹配', 4:'选词结果', 5:'共词矩阵'
file_word_choice_match = ''
file_word_choice_match_modified = ''
print(raw_data_dir + '\r\n')
for item_filename in os.listdir(raw_data_dir):
if filename_temp[1] in item_filename:
file_word_choice_match = item_filename
continue
if file_word_choice_match:
print(filename_temp[1],"excel表：", "data\\raw\\",file_word_choice_match)
file_word_choice_match_modified = file_word_choice_match.replace('选词匹配','选词匹配已引入距离值')
else:
print(filename_temp[1],"excel表：不存在")

输出结果如下：

C:\Users\work\workspacexxxxxxxxxxxx\notebook\eda\..\..\data\raw
选词匹配 excel表： data\raw\ 选词匹配_202401291719572370.xlsx

3.4 检测data\raw目录下是否有选词矩阵表

# 0:'词频', 1:'分词效果', 2:'选词矩阵', 3:'选词匹配', 4:'选词结果', 5:'共词矩阵'
file_word_choice_matrix = ''
file_word_choice_matrix_modified = ''
print(raw_data_dir + '\r\n')
for item_filename in os.listdir(raw_data_dir):
if filename_temp[2] in item_filename:
file_word_choice_matrix = item_filename
continue
if file_word_choice_matrix:
print(filename_temp[2],"excel表：", "data\\raw\\",file_word_choice_matrix)
file_word_choice_matrix_modified = file_word_choice_matrix.replace('选词矩阵','选词矩阵已引入距离值')
else:
print(filename_temp[2],"excel表：不存在")

输出结果如下：

C:\Users\work\workspacexxxxxxxxxxxxx\notebook\eda\..\..\data\raw
选词矩阵 excel表： data\raw\ 选词矩阵_202401291720061040.xlsx

4 定义词语位置距离阀门值

如果某段文中2个词语的位置距离高于该阈值，则不计入共现次数。马蜂窝游记的语句通常比较规整，相对于社交媒体也会比较长一些，所以，我们将距离参数设置大一些。下面设置成10，估计是在句子粒度上筛选。还可以试试设置更大一些，跨句子筛选。

k_word_distance = 10

5 读入分词效果表并观察

5.1 读取分词效果表

df_file_seg_effect = pd.read_excel(os.path.join(raw_data_dir, file_seg_effect))

5.2 查看前10行数据

使用df.head()函数查看dataframe的前n行数据,可以看到数据有5列：原数据，分词数据，关键词，序号，发布时间。

其中“分词数据”就包含了中文分词后最完整的词语序列。

df_file_seg_effect.head(10)

5.3 合并相同序号的记录

上面我们查看前10行的数据，发现有多行记录对应同一个序号的情况，这种情况是因为原数据太长，分词工具就把原数据分成了几段来处理。为了后面处理的方便和准确，我们把1个序号对应多条记录的合并成1条记录

df_file_seg_effect_agg = df_file_seg_effect.groupby('序号').agg({'原数据':lambda x:''.join(str(x) for x in x.values),'分词数据':lambda x:' '.join(str(x) for x in x.values),'关键词':lambda x:','.join(str(x) for x in x.values) })
df_file_seg_effect_agg = df_file_seg_effect_agg.reset_index()
df_file_seg_effect_agg.head(10)

6 读入选词匹配表并观察

6.1 读取选词匹配表

df_file_word_choice_match = pd.read_excel(os.path.join(raw_data_dir, file_word_choice_match))

6.2 查看前10行数据

使用df.head()函数查看dataframe的前n行数据,可以看到数据有3列：序号，原数据，打标词。

其中“原数据”就是创建分词任务时的每行原始文本，“打标词”就包含了人工选词结果的词语序列。

df_file_word_choice_match.head(10)

6.3 合并相同序号的记录

df_file_word_choice_match_agg = df_file_word_choice_match.groupby('序号').agg({'原数据':lambda x:''.join(str(x) for x in x.values), '打标词':lambda x:', '.join(str(x) for x in x.values)})
# df_file_word_choice_match_temp = df_file_word_choice_match.groupby('序号').agg({'打标词':lambda x:', '.join(str(x) for x in x.values)})
# df_file_word_choice_match_agg = pd.concat([df_file_word_choice_match_agg, df_file_word_choice_match_temp], axis=1)
df_file_word_choice_match_agg = df_file_word_choice_match_agg.reset_index()
df_file_word_choice_match_agg.head(10)

7 读入选词矩阵表并观察

7.1 读取选词矩阵表

df_file_word_choice_matrix = pd.read_excel(os.path.join(raw_data_dir, file_word_choice_matrix))

7.2 查看前10行数据

使用df.head()函数查看dataframe的前n行数据,可以看到该excel表中存储的是序号，原始文本以及所有的打标词(也就是人工选词的所有词)，每个打标词是1列。矩阵中的数值是每个打标词在某段正文中出现的次数。

df_file_word_choice_matrix.head(10)

7.3 取选词矩阵表头所有打标词的列表

word_list = df_file_word_choice_matrix.columns.values[2:]
word_list

7.4 合并相同序号的记录

# 先合并'正文'这一列
df_file_word_choice_matrix_agg = df_file_word_choice_matrix.groupby('序号').agg({'正文':lambda x:''.join(str(x) for x in x.values)})
df_file_word_choice_matrix_agg.head(10)

# 循环合并其它列
for column in word_list:
df_file_word_choice_matrix_agg_temp = df_file_word_choice_matrix.groupby('序号').agg({column:sum})
df_file_word_choice_matrix_agg = pd.concat([df_file_word_choice_matrix_agg, df_file_word_choice_matrix_agg_temp], axis=1)
df_file_word_choice_matrix_agg = df_file_word_choice_matrix_agg.reset_index()
df_file_word_choice_matrix_agg.head(10)

8 基于词语位置距离修改共词矩阵表

【注意】以下代码只考虑了功能的正确性，并没有做性能上的优化。运行时会发现在修改原始数据表时花费很长时间，但是不影响最后结果的正确性。

8.1 创建函数cal_word_distance(text, word1, word2)

函数功能：

在入参文本text中查找词语word1和词语word2的位置，并计算位置距离

函数返回值：

-1：表示2个词在该text文本中没有共现关系
大于等于0的整数：该值为2个词的位置距离

def cal_word_distance(text, word1, word2):
text_array = text.split(' ')
word_distance = -1
word1_pos = -1
word2_pos = -1
if word1 in text and word2 in text:
for i,item in enumerate(text_array):
if item == word1 and word1_pos == -1:
word1_pos = i
if item == word2 and word2_pos == -1:
word2_pos = i
if word1_pos > -1 and word2_pos > -1:
word_distance = word1_pos - word2_pos if word1_pos > word2_pos else word2_pos - word1_pos
break
return word_distance

8.2 创建函数is_within_distance(text, word, wordlist, distance_threshold)

函数功能：

在入参文本text中查找并计算词语word和wordlist列表中所有词的位置距离，只要有一对词的距离大于distance_threshold，就返回False。否则返回True

函数返回值：

True：入参词word和词列表wordlist中所有词的距离均小于等于distance_threshold
False：词列表wordlist中至少有1个词和入参词word的距离大于distance_threshold

def is_within_distance(text, word, wordlist, distance_threshold):
return_value = True
for word2 in wordlist:
if word == word2:
continue
else:
# word和wordlist中的每个词计算距离，出现大于distance_threshold的情况则退出循环返回False
if cal_word_distance(str(text),word,word2) > distance_threshold:
return_value = False
break
return return_value

做个小测试：

is_within_distance('懒惰致贫的才是真正少数大多数', '懒惰', ['才是','懒惰','大多数'],10)

输出结果是：

True

8.3 复制dataframe

下面的修改都在复制的dataframe上做，dataframe名字中去掉_file_，以示区别。保留两个dataframe是为了后面画图对比。

df_word_choice_match = df_file_word_choice_match_agg.copy(deep=True)
df_word_choice_matrix = df_file_word_choice_matrix_agg.copy(deep=True)

8.4 修改选词匹配表和选词矩阵表数据

修改规则是：

针对选词匹配表每行正文对应的所有打标词，每次取一个打标词，判断这个打标词和其它打标词的距离是否大于预定义的距离阈值k_word_distance。
如果当前词和某个词的距离大于k_word_distance，则将当前词从该行正文的打标词列表里删除。
如果当前词和其它词的距离都小于等于k_word_distance，则保留当前词。
对于已删除的打标词，根据正文序号在选词矩阵里找到该词的值并置为0

print('修改开始',datetime.datetime.now())
# 遍历选词匹配表
for index, row in df_word_choice_match.iterrows():
# 取序号，下面会使用序号查找分词效果表，并更新选词矩阵表和选词匹配表
sn = row['序号']
# 取当前行的打标词字段
word_choice = row['打标词']
# 把打标词含有的多个词字符串转换为词列表
word_choice_list = word_choice.split(', ')
# 此变量存储当前行更新后的打标词
word_choice_updated = []
if len(word_choice_list) > 0:
# 逐个从当前行的打标词列表里取当前词
for word1 in word_choice_list:
# 计算当前词和其它词的距离是否有大于k_word_distance的情况
if is_within_distance(df_file_seg_effect_agg.loc[df_file_seg_effect_agg['序号'] == sn, '分词数据'].values[0], word1, word_choice_list, k_word_distance):
# 当前词和其它词的距离都不大于k_word_distance，当前词继续保留在打标词列表中
word_choice_updated.append(word1)
else:
# 更新选词矩阵：当前词和其它词的距离存在大于k_word_distance的情况，则矩阵中的对应值置为0
df_word_choice_matrix.loc[df_word_choice_matrix['序号'] == sn, word1] = 0
# 更新选词匹配表中当前行的打标词字段
df_word_choice_match.loc[df_word_choice_match['序号'] == sn, '打标词'] = ', '.join(word_choice_updated)
print('修改完成',datetime.datetime.now())

输出结果是：

修改开始 2024-01-31 11:14:05.989960
修改完成 2024-01-31 11:18:02.199262

8.5 删除选词矩阵中全部列值都为0的列

如果在修改后的选词矩阵中，某个词和所有正文的值都为0，则删除该列

print('删除开始',datetime.datetime.now())
for column in df_word_choice_matrix.columns[2:]:
if df_word_choice_matrix[column].sum() == 0:
df_word_choice_matrix = df_word_choice_matrix.drop(column, axis=1)
print('删除完成',datetime.datetime.now())

输出结果是：

删除开始 2024-01-31 11:18:02.207422
删除完成 2024-01-31 11:18:02.839763

8.6 查看修改后的选词矩阵

使用head(10)查看前10行

df_word_choice_matrix.head(10)

8.7 保存修改后的选词矩阵

保存到文件中，后续分析可以使用。

df_word_choice_matrix=df_word_choice_matrix.set_index('序号')
df_word_choice_matrix.to_excel(os.path.join(processed_data_dir, file_word_choice_matrix_modified))
df_word_choice_matrix=df_word_choice_matrix.reset_index()

8.8 查看修改后的选词匹配表

使用head(10)查看前10行

df_word_choice_match.head(10)

8.9 保存修改后的选词匹配表

保存到文件中，后续分析可以使用。

df_word_choice_match=df_word_choice_match.set_index('序号')
df_word_choice_match.to_excel(os.path.join(processed_data_dir, file_word_choice_match_modified))
df_word_choice_match=df_word_choice_match.reset_index()

9 生成共词矩阵

在《共词分析中的共词关系是怎么得到的？》这篇notebook中，我们介绍了选词矩阵与共词矩阵的关系。现在我们利用修改后的选词矩阵生成新的共词矩阵

9.1 修改选词矩阵的值

选词矩阵中的数值目前表示词频，下面修改成0或者1，表示是否出现该词。

9.1.1 切出共现词数组

将用来画图，在网络图上为节点显示中文词语。

column_names = df_word_choice_matrix.columns.values[2:]
print("There are ", len(column_names), " words")
column_names

9.1.2 将dataframe转换成数组

数组中每个元素表示词频数

array_word_frequence = df_word_choice_matrix.values[:, 2:]
array_word_frequence

array_word_frequence.shape

输出结果是：

(2465, 91)

9.1.3 词频值改成0或1表示是否出现

array_word_occurrence = array_word_frequence.copy()
array_word_occurrence[array_word_occurrence > 0] = 1
array_word_occurrence

9.2 矩阵相乘得到共词矩阵

matrix_coword = np.dot(np.transpose(array_word_occurrence), array_word_occurrence)
matrix_coword

matrix_coword.shape

输出结果是：

(91, 91)

可见，原来有297个词，现在只剩下91个词了，说明很多词相距很远，根本不存在共现关系。

# 把矩阵的对角线上的值设置为0
for i in range(len(matrix_coword)):
matrix_coword[i][i] = 0

9.3 画网络关系图

9.3.1 从NumPy数组生成networkx图

转换成浮点型数据，不然后面画图会出错

matrix_coword = matrix_coword[:, :].astype(float)
matrix_coword

graph_matrix_coword = nx.from_numpy_array(matrix_coword)
#print(nx.info(graph_matrix_coword))
#graph_matrix_coword.edges(data=True)

9.3.2 给图节点标上中文词

coword_labels = nx.get_node_attributes(graph_matrix_coword,'labels')
for idx, node in enumerate(graph_matrix_coword.nodes()):
print("idx=", idx, "; node=", node)
coword_labels[node] = column_names[idx]
graph_matrix_coword = nx.relabel_nodes(graph_matrix_coword, coword_labels)
sorted(graph_matrix_coword)

9.3.3 引入中文字体

plt.rcParams['font.sans-serif']=['SimHei']
# 上面一行在macOS上没有效果，所以，使用下面的字体
#plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

9.3.4 画图

pos = nx.spring_layout(graph_matrix_coword)
plt.figure(1,figsize=(30,30))
nx.draw(graph_matrix_coword, pos, node_size=10, with_labels=True, font_size=22, font_color="red")
plt.show()

输出结果是

9.3.5 删除孤立点

graph_matrix_coword.remove_nodes_from(list(nx.isolates(graph_matrix_coword)))
#pos = nx.circular_layout(graph_matrix_coword)
pos = nx.spring_layout(graph_matrix_coword)
plt.figure(1,figsize=(30,30))
nx.draw(graph_matrix_coword, pos, node_size=10, with_labels=True, font_size=22, font_color="red")
plt.show()

输出结果是：