Jupyter Notebook使用gensim做Word2Vec计算

2021-9-8 17:07| 发布者: Fuller| 查看: 3613| 评论: 0

摘要: 那么什么是Word2Vec模型? 能否实际在Python里做相关实验呢?本Notebook将做对这2个问题进行研究和探索。用Python编程时,主要为了观察Word2Vec的计算过程,体验gensim库函数的用法,此后在另一个Jupyter Notebook中 ...

1,本Notebook背景介绍

在此前介绍的几篇论文范例中,研究者都使用了Word2Vec模型用于对文本的分析:

1.《基于热点主题识别的突发事件次生衍生事件探测》,该范文介绍了研究作者获取了133947条2018年长春长生疫苗事件相关微博数据,并基于危机传播四阶段理论对该事件各生命周期阶段的微博数据提取微博文本的特征词,通过word2vec模型和K-means聚类算法提取突发事件各生命周期微博文本的主题,再基于H指数计算各主题的影响力,筛选得到该突发事件的热点主题,并构建突发事件次生衍生事件的判定规则,探测其次生衍生事件。

2.《基于语义网络的研究兴趣相似性度量方法》,该范文介绍了研究作者收集了《中文社会科学引文索引》(CSSCI)上的期刊论文一共2791篇,涉及作者2104位,关键词4725个。为便于作者兴趣矩阵相似性的计算, 本文针对各核心作者选取相同数量的关键词进行word2vec 建模学习。另外, 在选取关键词表示作者研究兴趣时, 删除对分析作者研究兴趣相似性以及分析领域热点较低贡献的概括性关键词, 如电子政务、电子政府等。通过引入word2vec模型对作者关键词进行词向量表示,将关键词表示成语义级别的低维实值分布;计算关键词之间的语义相关度并构造关键词语义网络,采用JS距离对构建的作者研究兴趣矩阵进行相似性度量。

那么什么是Word2Vec模型? 能否实际在Python里做相关实验呢?本Notebook将做对这2个问题进行研究和探索。用Python编程时,主要为了观察Word2Vec的计算过程,体验gensim库函数的用法,此后在另一个Jupyter Notebook中我们将用实际数据做实验。为了观察到计算过程,打开了日志输出。

1.1,什么是Word2Vec?

下面这张思维导图参考了知乎文章《深入浅出Word2Vec原理解析

参看csdn文章《大白话讲解word2vec到底在做些什么》,我是这样理解的:

word2vec也叫word embeddings,中文名“词向量”,作用就是将自然语言中的字词转为计算机可以理解的稠密向量(Dense Vector)。

word2vec主要分为CBOW(Continuous Bag of Words)和Skip-Gram两种模式。

举个例子。

对同样一个句子:Hangzhou is a nice city。我们要构造一个语境与目标词汇的映射关系,其实就是input与label的关系。

这里假设滑窗尺寸为1,那么分别看看两种方法构造映射关系的方法有什么不同

1. CBOW可以制造的映射关系为:[Hangzhou,a]—>is,[is,nice]—>a,[a,city]—>nice

可以这样看:第一个目标词是is,它前一个和后一个词构成一个语境是[Hangzhou,a];窗口往后滑动1,到了a,它前一个和后一个词构成一个语境是[is,nice],依次往后滑动。

2. Skip-Gram可以制造的映射关系为(is,Hangzhou),(is,a),(a,is), (a,nice),(nice,a),(nice,city)

可以这样看:同样第一个目标词是is,分别与前一个词和后一个词构成映射关系是(is,Hangzhou)和(is,a);然后往后滑动1,就是a,又构成(a,is)和(a,nice)。

有结论说:CBOW是从原始语句推测目标字词;而Skip-Gram正好相反,是从目标字词推测出原始语句。CBOW对小型数据库比较合适,而Skip-Gram在大型语料中表现更好。

到底是什么原因或者背后的原理是什么,读者自己去搜索研究一下。

1.2,能不能在python中做依存句法分析方面的实验呢?

笔者查了一下,Python的Gensim库提供word2vec模型相关的方法,并且官网给出了例子。

gensim库简介:

Gensim是一个免费的 Python库,旨在从文档中自动提取语义主题,尽可能高效(计算机方面)和无痛(人性化)。

Gensim旨在处理原始的非结构化数字文本(“ 纯文本 ”)。

在Gensim的算法,比如Word2Vec,FastText,潜在语义分析(LSI,LSA,见LsiModel),隐含狄利克雷分布(LDA,见LdaModel)等。这些算法是无监督的,这意味着不需要人工输入 - 您只需要一个纯文本文档。

一旦找到这些统计模式,任何纯文本文档(句子,短语,单词......)都可以在新的语义表示中简洁地表达,并查询与其他文档(单词,短语......)的主题相似性。

2,第三方库

本notebook使用了gensim库,gensim库用于做Word2vec实验。

如果未安装,请先使用下面的命令安装gensim库,再运行实验本notebook:

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple gensim #国内安装使用清华的源,速度快

3,本notebook所做的测试

基于测试数据和Gensim官网的教程,在Jupyter Notebook中使用Python做word2vec模型实验。

4,把matplotlib输出的图形内嵌到Jupyter Notebook中

下面这行执行后,使用matplotlib画图,会直接显示在Jupyter Notebook中

%matplotlib inline

5,开启日志输出

把实验过程中的日志信息直接在Jupyter Notebook中输出,这样容易观察word2vec的计算过程。

import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


6,引入gensim库

引入gensim库下的models,datapath,utils

from gensim.test.utils import datapath

from gensim import utils

import gensim.models


可以看到下面的日志:

2021-09-08 16:09:58,638 : INFO : adding document #0 to Dictionary(0 unique tokens: [])

2021-09-08 16:09:58,643 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)

2021-09-08 16:09:58,644 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)", 'datetime': '2021-09-08T16:09:58.643202', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}


7,定义MyCorpus语料库类

在安装了gensim库后,会自动安装“Lee评估语料库”,我们可以基于此语料库做实验

class MyCorpus:

    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):

        corpus_path = datapath('lee_background.cor')

        for line in open(corpus_path):

            # 假设每行一篇文档,每个文档由空格分隔的多个词组成

            # assume there's one document per line, tokens separated by whitespace

            yield utils.simple_preprocess(line)


8,生成一个句子变量

sentences = MyCorpus()

9,基于句子训练word2vec模型

训练时,句子(sentences)参数必须指定。除此之外,还有其它几个可选参数:

min_count:缺省值是5, 表示在预料库里不少于5次的词才会被保留

vector_size: Gensim Word2vec 映射单词的 N 维空间的维度 (N) 数。

model = gensim.models.Word2Vec(sentences=sentences)

可以看到这些日志:

2021-09-08 16:11:26,610 : INFO : collecting all words and their counts

2021-09-08 16:11:26,612 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types

2021-09-08 16:11:26,737 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences

2021-09-08 16:11:26,737 : INFO : Creating a fresh vocabulary

2021-09-08 16:11:26,767 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1750 unique words (25.068041827818366%% of original 6981, drops 5231)', 'datetime': '2021-09-08T16:11:26.767168', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'prepare_vocab'}

2021-09-08 16:11:26,769 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 49335 word corpus (84.83801073049938%% of original 58152, drops 8817)', 'datetime': '2021-09-08T16:11:26.769168', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'prepare_vocab'}

2021-09-08 16:11:26,785 : INFO : deleting the raw counts dictionary of 6981 items

2021-09-08 16:11:26,787 : INFO : sample=0.001 downsamples 51 most-common words

2021-09-08 16:11:26,788 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 35935.33721568072 word corpus (72.8%% of prior 49335)', 'datetime': '2021-09-08T16:11:26.788156', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'prepare_vocab'}

2021-09-08 16:11:26,827 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes

2021-09-08 16:11:26,828 : INFO : resetting layer weights

2021-09-08 16:11:26,851 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2021-09-08T16:11:26.851118', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'build_vocab'}

2021-09-08 16:11:26,852 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5', 'datetime': '2021-09-08T16:11:26.852121', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'train'}

2021-09-08 16:11:26,977 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:11:26,981 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:11:26,983 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:11:26,984 : INFO : EPOCH - 1 : training on 58152 raw words (35896 effective words) took 0.1s, 278642 effective words/s

2021-09-08 16:11:27,084 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:11:27,086 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:11:27,094 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:11:27,096 : INFO : EPOCH - 2 : training on 58152 raw words (35990 effective words) took 0.1s, 329902 effective words/s

2021-09-08 16:11:27,195 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:11:27,202 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:11:27,205 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:11:27,207 : INFO : EPOCH - 3 : training on 58152 raw words (35921 effective words) took 0.1s, 339330 effective words/s

2021-09-08 16:11:27,312 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:11:27,313 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:11:27,318 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:11:27,319 : INFO : EPOCH - 4 : training on 58152 raw words (36054 effective words) took 0.1s, 329277 effective words/s

2021-09-08 16:11:27,420 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:11:27,421 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:11:27,430 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:11:27,432 : INFO : EPOCH - 5 : training on 58152 raw words (35870 effective words) took 0.1s, 337083 effective words/s

2021-09-08 16:11:27,433 : INFO : Word2Vec lifecycle event {'msg': 'training on 290760 raw words (179731 effective words) took 0.6s, 309329 effective words/s', 'datetime': '2021-09-08T16:11:27.433787', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'train'}

2021-09-08 16:11:27,433 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec(vocab=1750, vector_size=100, alpha=0.025)', 'datetime': '2021-09-08T16:11:27.433787', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}


10,获取某个词的矢量

从已经训练好的模型中,获取单词“king”的矢量, 并显示输出

vec_king = model.wv['king']

print(vec_king)


输出结果是:

[-6.92510232e-03  4.14995067e-02  1.59823261e-02  1.46031603e-02

  5.08679077e-03 -6.39730245e-02  4.07829806e-02  7.24346042e-02

 -1.78169943e-02 -8.31861142e-03 -5.16973110e-03 -6.55126572e-02

  5.94190601e-03  1.41455485e-02  7.18757603e-03 -7.89120328e-03

 -1.98559370e-03 -9.05822031e-03 -6.99098874e-03 -5.86852096e-02

  3.40853035e-02  2.03361809e-02  8.10676254e-03 -8.25211697e-04

 -1.60578378e-02  1.37015795e-02 -2.02752780e-02 -3.86285335e-02

 -2.95265168e-02  6.02840958e-03  2.31080819e-02 -2.63607632e-02

  3.32033597e-02 -3.91531475e-02  5.09816781e-03  3.13201882e-02

  1.86994895e-02 -9.04701278e-03 -6.88288175e-03 -3.53833996e-02

 -7.21966149e-03 -1.11886691e-02 -2.19836980e-02  7.16285687e-03

  2.69754194e-02 -2.45087147e-02 -2.37088483e-02 -3.23272659e-03

  1.52070969e-02  2.66285371e-02  1.97965223e-02 -2.19922252e-02

 -2.94247735e-02  4.79266094e-03 -1.58476236e-03  1.47990910e-02

  7.19935913e-03  1.92051902e-02 -2.50961687e-02  1.73416305e-02

 -3.54305084e-04  6.87906425e-03  8.16148333e-03 -1.34457452e-02

 -4.21114005e-02  4.24223766e-02 -2.52795685e-03  2.80112531e-02

 -2.31643170e-02  3.85041460e-02 -1.70745552e-02  5.71923843e-03

  5.32592610e-02 -1.80432275e-02  3.00998446e-02  2.71892790e-02

 -8.73102620e-03 -2.51155049e-02 -3.51627842e-02 -4.78182407e-03

 -2.93990150e-02  3.29859066e-03 -3.37860733e-02  4.85829152e-02

  6.55127733e-05 -1.12926299e-02  5.20297652e-03  4.24825326e-02

  3.99986207e-02  1.23887025e-02  3.60009037e-02  3.50367576e-02

  3.11715566e-02  7.25049153e-03  8.06821436e-02  2.94752847e-02

  2.93809213e-02 -1.59397889e-02  1.09631000e-02 -1.44344661e-02]


11,查看模型的前10个词

for index, word in enumerate(model.wv.index_to_key):

    if index == 10:

        break

    print(f"word #{index}/{len(model.wv.index_to_key)} is {word}")


输出结果是:

word #0/1750 is the

word #1/1750 is to

word #2/1750 is of

word #3/1750 is in

word #4/1750 is and

word #5/1750 is he

word #6/1750 is is

word #7/1750 is for

word #8/1750 is on

word #9/1750 is said


12,保存模型

把训练得到的模型保存到磁盘,以便以后使用,或者进一步往模型中加数据。

import tempfile

model_path = ''

with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:

    temporary_filepath = tmp.name

    model.save(temporary_filepath)

    model_path = temporary_filepath

可以看到日志:

2021-09-08 16:11:43,410 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'C:\\Users\\work\\AppData\\Local\\Temp\\gensim-model-kpg71m3y', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-09-08T16:11:43.410988', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'saving'}

2021-09-08 16:11:43,412 : INFO : not storing attribute cum_table

2021-09-08 16:11:43,415 : INFO : saved C:\Users\work\AppData\Local\Temp\gensim-model-kpg71m3y


13,加载模型

从已保存的文件里加载模型

new_model = gensim.models.Word2Vec.load(model_path)


可以看到日志:

2021-09-08 16:11:48,920 : INFO : loading Word2Vec object from C:\Users\work\AppData\Local\Temp\gensim-model-kpg71m3y

2021-09-08 16:11:48,984 : INFO : loading wv recursively from C:\Users\work\AppData\Local\Temp\gensim-model-kpg71m3y.wv.* with mmap=None

2021-09-08 16:11:48,985 : INFO : setting ignored attribute cum_table to None

2021-09-08 16:11:49,022 : INFO : Word2Vec lifecycle event {'fname': 'C:\\Users\\work\\AppData\\Local\\Temp\\gensim-model-kpg71m3y', 'datetime': '2021-09-08T16:11:49.022784', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'loaded'}


14,继续训练

基于加载的模型,我们可以使用新的预料进行进一步的训练

more_sentences = [

    ['Advanced', 'users', 'can', 'load', 'a', 'model',

     'and', 'continue', 'training', 'it', 'with', 'more', 'sentences'],

]

new_model.build_vocab(more_sentences, update=True)

new_model.train(more_sentences, total_examples=model.corpus_count, epochs=model.epochs)


可以看到日志:

2021-09-08 16:12:02,226 : INFO : collecting all words and their counts

2021-09-08 16:12:02,227 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types

2021-09-08 16:12:02,227 : INFO : collected 13 word types from a corpus of 13 raw words and 1 sentences

2021-09-08 16:12:02,228 : INFO : Updating model with new vocabulary

2021-09-08 16:12:02,236 : INFO : Word2Vec lifecycle event {'msg': 'added 0 new unique words (0.0%% of original 13) and increased the count of 0 pre-existing words (0.0%% of original 13)', 'datetime': '2021-09-08T16:12:02.236508', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'prepare_vocab'}

2021-09-08 16:12:02,237 : INFO : deleting the raw counts dictionary of 13 items

2021-09-08 16:12:02,238 : INFO : sample=0.001 downsamples 0 most-common words

2021-09-08 16:12:02,238 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 0 word corpus (0.0%% of prior 0)', 'datetime': '2021-09-08T16:12:02.238507', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'prepare_vocab'}

2021-09-08 16:12:02,255 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes

2021-09-08 16:12:02,256 : INFO : updating layer weights

2021-09-08 16:12:02,257 : INFO : Word2Vec lifecycle event {'update': True, 'trim_rule': 'None', 'datetime': '2021-09-08T16:12:02.257496', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'build_vocab'}

2021-09-08 16:12:02,258 : WARNING : Effective 'alpha' higher than previous training cycles

2021-09-08 16:12:02,258 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5', 'datetime': '2021-09-08T16:12:02.258496', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'train'}

2021-09-08 16:12:02,263 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:12:02,265 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:12:02,266 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:12:02,267 : INFO : EPOCH - 1 : training on 13 raw words (5 effective words) took 0.0s, 1180 effective words/s

2021-09-08 16:12:02,268 : WARNING : EPOCH - 1 : supplied example count (1) did not equal expected count (300)

2021-09-08 16:12:02,272 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:12:02,272 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:12:02,273 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:12:02,274 : INFO : EPOCH - 2 : training on 13 raw words (6 effective words) took 0.0s, 2247 effective words/s

2021-09-08 16:12:02,274 : WARNING : EPOCH - 2 : supplied example count (1) did not equal expected count (300)

2021-09-08 16:12:02,280 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:12:02,281 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:12:02,282 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:12:02,284 : INFO : EPOCH - 3 : training on 13 raw words (5 effective words) took 0.0s, 1003 effective words/s

2021-09-08 16:12:02,285 : WARNING : EPOCH - 3 : supplied example count (1) did not equal expected count (300)

2021-09-08 16:12:02,288 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:12:02,289 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:12:02,290 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:12:02,290 : INFO : EPOCH - 4 : training on 13 raw words (5 effective words) took 0.0s, 2002 effective words/s

2021-09-08 16:12:02,291 : WARNING : EPOCH - 4 : supplied example count (1) did not equal expected count (300)

2021-09-08 16:12:02,297 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:12:02,298 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:12:02,299 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:12:02,301 : INFO : EPOCH - 5 : training on 13 raw words (6 effective words) took 0.0s, 1440 effective words/s

2021-09-08 16:12:02,302 : WARNING : EPOCH - 5 : supplied example count (1) did not equal expected count (300)

2021-09-08 16:12:02,303 : INFO : Word2Vec lifecycle event {'msg': 'training on 65 raw words (27 effective words) took 0.0s, 612 effective words/s', 'datetime': '2021-09-08T16:12:02.303471', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'train'}


15,应用例子

上面是基于gemsim官网例子做的word2vec模型实验。gemsim官网提供了一个例子,基于已经训练好的模型,查询输出car, minivan, bicycle ,airplane,cereal,communism 的相似度。这个模型文件有2G大,需要从国外下载模型,在内地环境我们就不做实验了。

代码如下,有条件的同学可以自行实验。

"""

import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

pairs = [

    ('car', 'minivan'),   # a minivan is a kind of car

    ('car', 'bicycle'),   # still a wheeled vehicle

    ('car', 'airplane'),  # ok, no wheels, but still a vehicle

    ('car', 'cereal'),    # ... and so on

    ('car', 'communism'),

]

for w1, w2 in pairs:

    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

"""


输出结果是:

"\nimport gensim.downloader as api\nwv = api.load('word2vec-google-news-300')\n\npairs = [\n    ('car', 'minivan'),   # a minivan is a kind of car\n    ('car', 'bicycle'),   # still a wheeled vehicle\n    ('car', 'airplane'),  # ok, no wheels, but still a vehicle\n    ('car', 'cereal'),    # ... and so on\n    ('car', 'communism'),\n]\nfor w1, w2 in pairs:\n    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))\n"


16,后续的实验

由于gensim官网提供的是英文预料库的例子,下一步我们会基于中文语料库做word2vec模型训练实验

17,下载本notebook

下载源代码请进:使用gensim做Word2Vec计算


鲜花

握手

雷人

路过

鸡蛋

最新评论

GMT+8, 2024-12-19 01:06