pythainlp.word_vector

The word_vector contains functions that makes use of a pre-trained vector public data. The pythainlp.word_vector module is a valuable resource for working with pre-trained word vectors. These word vectors are trained on large corpora and can be used for various natural language processing tasks, such as word similarity, document similarity, and more.

Dependencies

Installation of numpy and gensim is required.

Before using this module, you need to ensure that the numpy and gensim libraries are installed in your environment. These libraries are essential for loading and working with the pre-trained word vectors.

Modules

class pythainlp.word_vector.WordVector(model_name: str = 'thai2fit_wv')[source]

Word Vector class

Parameters:: model_name (str) – model name

Options for model_name

thai2fit_wv (default) - word vector from thai2fit
ltw2v - word vector from LTW2V: The Large Thai Word2Vec v0.1
ltw2v_v1.0_15_window - word vector from LTW2V v1.0 and 15 window
ltw2v_v1.0_5_window - word vector from LTW2V v1.0 and 5 window

The WordVector class encapsulates word vector operations and functions. It provides a convenient interface for loading models, finding word similarities, and generating sentence vectors.

__init__(model_name: str = 'thai2fit_wv') → None[source]

Word Vector class

Parameters:: model_name (str) – model name

Options for model_name

thai2fit_wv (default) - word vector from thai2fit
ltw2v - word vector from LTW2V: The Large Thai Word2Vec
ltw2v_v1.0_15_window - word2vec from LTW2V 1.0 and 15 window
ltw2v_v1.0_5_window - word2vec from LTW2V v1.0 and 5 window

load_wordvector(model_name: str)[source]

Load word vector model.

Parameters:: model_name (str) – model name

get_model() → KeyedVectors[source]

Get word vector model.

Returns:: gensim word2vec model
Return type:: gensim.models.keyedvectors.Word2VecKeyedVectors

doesnt_match(words: List[str]) → str[source]

This function returns one word that is mostly unrelated to other words in the list. We use the function doesnt_match() from gensim.

Parameters:

words (list) – a list of words

Raises:

KeyError – if there is any word in positive or negative that is not in the vocabulary of the model.

Returns:

the word is that mostly unrelated

Return type:

str

Note:

If a word in words is not in the vocabulary, KeyError will be raised.

Example:

Pick the word “พริกไทย” (name of food) out of the list of meals (“อาหารเช้า”, “อาหารเที่ยง”, “อาหารเย็น”). >>> from pythainlp.word_vector import WordVector >>> >>> wv = WordVector() >>> words = [‘อาหารเช้า’, ‘อาหารเที่ยง’, ‘อาหารเย็น’, ‘พริกไทย’] >>> wv.doesnt_match(words) พริกไทย

Pick the word “เรือ” (name of vehicle) out of the list of words related to occupation (“ดีไซน์เนอร์”, “พนักงานเงินเดือน”, “หมอ”).

>>> from pythainlp.word_vector import WordVector
>>>
>>> wv = WordVector()
>>> words = ['ดีไซน์เนอร์', 'พนักงานเงินเดือน', 'หมอ', 'เรือ']
>>> wv.doesnt_match(words)
เรือ

most_similar_cosmul(positive: List[str], negative: List[str]) → List[Tuple[str, float]][source]

This function finds the top-10 words that are most similar with respect to two lists of words labeled as positive and negative. The top-10 most similar words are obtained using multiplication combination objective from Omer Levy and Yoav Goldberg [OmerLevy_YoavGoldberg_2014].

We use the function gensim.most_similar_cosmul() directly from gensim.

Parameters:

positive (list) – a list of words to add
negative (list) – a list of words to subtract

Raises:

KeyError – if there is any word in positive or negative that is not in the vocabulary of the model.

Returns:

list of top-10 most similar words and its similarity score

Return type:

list[tuple[str, float]]

Note:

With a single word in the positive list, it will find the most similar words to the word given (similar to gensim.most_similar())
If a word in positive or negative is not in the vocabulary, KeyError will be raised.

Example:

Find the top-10 most similar words to the word: “แม่น้ำ”.

>>> from pythainlp.word_vector import WordVector
>>>
>>> wv = WordVector()
>>> list_positive = ['แม่น้ำ']
>>> list_negative = []
>>> wv.most_similar_cosmul(list_positive, list_negative)
[('ลำน้ำ', 0.8206598162651062), ('ทะเลสาบ', 0.775945782661438),
('ลุ่มน้ำ', 0.7490593194961548), ('คลอง', 0.7471904754638672),
('ปากแม่น้ำ', 0.7354257106781006), ('ฝั่งแม่น้ำ', 0.7120099067687988),
('ทะเล', 0.7030453681945801), ('ริมแม่น้ำ', 0.7015200257301331),
('แหล่งน้ำ', 0.6997432112693787), ('ภูเขา', 0.6960948705673218)]

Find the top-10 most similar words to the words: “นายก”, “รัฐมนตรี”, and “ประเทศ”.

>>> from pythainlp.word_vector import WordVector
>>>
>>> wv = WordVector()
>>> list_positive = ['นายก', 'รัฐมนตรี', 'ประเทศ']
>>> list_negative = []
>>> wv.most_similar_cosmul(list_positive, list_negative)
[('รองนายกรัฐมนตรี', 0.2730445861816406),
('เอกอัครราชทูต', 0.26500266790390015),
('นายกรัฐมนตรี', 0.2649088203907013),
('ผู้ว่าราชการจังหวัด', 0.25119125843048096),
('ผู้ว่าการ', 0.2510434687137604), ('เลขาธิการ', 0.24824175238609314),
('ผู้ว่า', 0.2453523576259613), ('ประธานกรรมการ', 0.24147476255893707),
('รองประธาน', 0.24123257398605347), ('สมาชิกวุฒิสภา',
0.2405330240726471)]

Find the top-10 most similar words when having only positive list and both positive and negative lists.

>>> from pythainlp.word_vector import WordVector
>>>
>>> wv = WordVector()
>>> list_positive = ['ประเทศ', 'ไทย', 'จีน', 'ญี่ปุ่น']
>>> list_negative = []
>>> wv.most_similar_cosmul(list_positive, list_negative)
[('ประเทศจีน', 0.22022421658039093), ('เกาหลี', 0.2196873426437378),
('สหรัฐอเมริกา', 0.21660110354423523),
('ประเทศญี่ปุ่น', 0.21205860376358032),
('ประเทศไทย', 0.21159221231937408), ('เกาหลีใต้',
0.20321202278137207),
('อังกฤษ', 0.19610872864723206), ('ฮ่องกง', 0.1928885132074356),
('ฝรั่งเศส', 0.18383873999118805), ('พม่า', 0.18369348347187042)]
>>>
>>> list_positive = ['ประเทศ', 'ไทย', 'จีน', 'ญี่ปุ่น']
>>> list_negative = ['อเมริกา']
>>> wv.most_similar_cosmul(list_positive, list_negative)
[('ประเทศไทย', 0.3278159201145172), ('เกาหลี', 0.3201899230480194),
('ประเทศจีน', 0.31755179166793823), ('พม่า', 0.30845439434051514),
('ประเทศญี่ปุ่น', 0.306713730096817),
('เกาหลีใต้', 0.3003999888896942),
('ลาว', 0.2995176911354065), ('คนไทย', 0.2885020673274994),
('เวียดนาม', 0.2878379821777344), ('ชาวไทย', 0.28480708599090576)]

The function returns KeyError when the term “เมนูอาหารไทย” is not in the vocabulary.

>>> from pythainlp.word_vector import WordVector
>>>
>>> wv = WordVector()
>>> list_positive = ['เมนูอาหารไทย']
>>> list_negative = []
>>> wv.most_similar_cosmul(list_positive, list_negative)
KeyError: "word 'เมนูอาหารไทย' not in vocabulary"

similarity(word1: str, word2: str) → float[source]

This function computes cosine similarity between two words.

Parameters:

word1 (str) – first word to be compared with
word2 (str) – second word to be compared with

Raises:

KeyError – if either word1 or word2 is not in the vocabulary of the model.

Returns:

the cosine similarity between the two word vectors

Return type:

float

Note:

If a word in word1 or word2 is not in the vocabulary, KeyError will be raised.

Example:

Compute consine similarity between two words: “รถไฟ” and “รถไฟฟ้า” (train and electric train).

>>> from pythainlp.word_vector import WordVector
>>> wv = WordVector()
>>> wv.similarity('รถไฟ', 'รถไฟฟ้า')
0.43387136

Compute consine similarity between two words: “เสือดาว” and “รถไฟฟ้า” (leopard and electric train).

>>> from pythainlp.word_vector import WordVector
>>>
>>> wv = WordVector()
>>> wv.similarity('เสือดาว', 'รถไฟฟ้า')
0.04300258

sentence_vectorizer(text: str, use_mean: bool = True) → ndarray[source]

This function converts a Thai sentence into vector. Specifically, it first tokenizes that text and map each tokenized word with the word vectors from the model. Then, word vectors are aggregated into one vector of 300 dimension by calculating either mean or summation of all word vectors.

Parameters:

text (str) – text input
use_mean (bool) – if True aggregate word vectors with mean of all word vectors. Otherwise, aggregate with summation of all word vectors

Returns:

300-dimension vector representing the given sentence in form of numpy array

Return type:

numpy.ndarray((1,300))

Example:

Vectorize the sentence, “อ้วนเสี้ยวเข้ายึดแคว้นกิจิ๋ว ในปี พ.ศ. 735”, into one sentence vector with two aggregation methods: mean and summation.

>>> from pythainlp.word_vector import WordVector
>>>
>>> wv = WordVector()
>>> sentence = 'อ้วนเสี้ยวเข้ายึดแคว้นกิจิ๋ว ในปี พ.ศ. 735'
>>> wv.sentence_vectorizer(sentence, use_mean=True)
array([[-0.00421414, -0.08881307,  0.05081136, -0.05632929,
     -0.06607185, 0.03059357, -0.113882  , -0.00074836,  0.05035743,
     0.02914307,
     ...
    0.02893357,  0.11327957,  0.04562086, -0.05015393,  0.11641257,
    0.32304936, -0.05054322,  0.03639471, -0.06531371,  0.05048079]])
>>>
>>> wv.sentence_vectorizer(sentence, use_mean=False)
array([[-0.05899798, -1.24338295,  0.711359  , -0.78861002,
     -0.92500597, 0.42831   , -1.59434797, -0.01047703,  0.705004
    ,  0.40800299,
    ...
    0.40506999,  1.58591403,  0.63869202, -0.702155  ,  1.62977601,
    4.52269109, -0.70760502,  0.50952601, -0.914392  ,  0.70673105]])

References

[Omer Levy and Yoav Goldberg (2014). Linguistic Regularities in Sparse and Explicit Word Representations](https://www.aclweb.org/anthology/W14-1618/) This reference points to the work by Omer Levy and Yoav Goldberg, which discusses linguistic regularities in word representations. It underlines the theoretical foundation of word vectors and their applications in NLP.

This enhanced documentation provides a more detailed and organized overview of the pythainlp.word_vector module, making it a valuable resource for NLP practitioners and researchers working with pre-trained word vectors in the Thai language.