pythainlp.word_vector
The word_vector
contains functions that makes use of a pre-trained vector public data.
Dependencies
Installation of numpy
and gensim
is required.
Modules
- class pythainlp.word_vector.WordVector(model_name: str = 'thai2fit_wv')[source]
Word Vector class
- Parameters
model_name (str) – model name
- Options for model_name
thai2fit_wv (default) - word vector from thai2fit
ltw2v - word vector from LTW2V: The Large Thai Word2Vec
- __init__(model_name: str = 'thai2fit_wv') None [source]
Word Vector class
- Parameters
model_name (str) – model name
- Options for model_name
thai2fit_wv (default) - word vector from thai2fit
ltw2v - word vector from LTW2V: The Large Thai Word2Vec
- load_wordvector(model_name: str)[source]
Load word vector model.
- Parameters
model_name (str) – model name
- get_model() KeyedVectors [source]
Get word vector model.
- Returns
gensim word2vec model
- Return type
gensim.models.keyedvectors.Word2VecKeyedVectors
- doesnt_match(words: List[str]) str [source]
This function returns one word that is mostly unrelated to other words in the list. We use the function
doesnt_match()
fromgensim
.- Parameters
words (list) – a list of words
- Raises
KeyError – if there is any word in positive or negative not in the vocabulary of the model.
- Returns
the word that mostly unrelated
- Return type
strt
- Note
If a word in words is not in the vocabulary,
KeyError
will be raised.
- Example
Pick the word “พริกไทย” (name of food) out of the list of meals (“อาหารเช้า”, “อาหารเที่ยง”, “อาหารเย็น”). >>> from pythainlp.word_vector import WordVector >>> >>> wv = WordVector() >>> words = [‘อาหารเช้า’, ‘อาหารเที่ยง’, ‘อาหารเย็น’, ‘พริกไทย’] >>> wv.doesnt_match(words) พริกไทย
Pick the word “เรือ” (name of vehicle) out of the list of words related to occupation (“ดีไซน์เนอร์”, “พนักงานเงินเดือน”, “หมอ”).
>>> from pythainlp.word_vector import WordVector >>> >>> wv = WordVector() >>> words = ['ดีไซน์เนอร์', 'พนักงานเงินเดือน', 'หมอ', 'เรือ'] >>> wv.doesnt_match(words) เรือ
- most_similar_cosmul(positive: List[str], negative: List[str]) List[Tuple[str, float]] [source]
This function find the top-10 words that are most similar with respect to from two lists of words labeled as positive and negative. The top-10 most similar words are obtained using multiplication combination objective from Omer Levy and Yoav Goldberg [OmerLevy_YoavGoldberg_2014].
We use the function
gensim.most_similar_cosmul()
directly fromgensim
.- Parameters
- Raises
KeyError – if there is any word in positive or negative not in the vocabulary of the model.
- Returns
list of top-10 most similar words and its similarity score
- Return type
- Note
With a single word in the positive list, it will find the most similar words to the word given (similar to
gensim.most_similar()
)If a word in positive or negative is not in the vocabulary,
KeyError
will be raised.
- Example
Find the top-10 most similar words to the word: “แม่น้ำ”.
>>> from pythainlp.word_vector import WordVector >>> >>> wv = WordVector() >>> list_positive = ['แม่น้ำ'] >>> list_negative = [] >>> wv.most_similar_cosmul(list_positive, list_negative) [('ลำน้ำ', 0.8206598162651062), ('ทะเลสาบ', 0.775945782661438), ('ลุ่มน้ำ', 0.7490593194961548), ('คลอง', 0.7471904754638672), ('ปากแม่น้ำ', 0.7354257106781006), ('ฝั่งแม่น้ำ', 0.7120099067687988), ('ทะเล', 0.7030453681945801), ('ริมแม่น้ำ', 0.7015200257301331), ('แหล่งน้ำ', 0.6997432112693787), ('ภูเขา', 0.6960948705673218)]
Find the top-10 most similar words to the words: “นายก”, “รัฐมนตรี”, and “ประเทศ”.
>>> from pythainlp.word_vector import WordVector >>> >>> wv = WordVector() >>> list_positive = ['นายก', 'รัฐมนตรี', 'ประเทศ'] >>> list_negative = [] >>> wv.most_similar_cosmul(list_positive, list_negative) [('รองนายกรัฐมนตรี', 0.2730445861816406), ('เอกอัครราชทูต', 0.26500266790390015), ('นายกรัฐมนตรี', 0.2649088203907013), ('ผู้ว่าราชการจังหวัด', 0.25119125843048096), ('ผู้ว่าการ', 0.2510434687137604), ('เลขาธิการ', 0.24824175238609314), ('ผู้ว่า', 0.2453523576259613), ('ประธานกรรมการ', 0.24147476255893707), ('รองประธาน', 0.24123257398605347), ('สมาชิกวุฒิสภา', 0.2405330240726471)]
Find the top-10 most similar words when having only positive list and both positive and negative lists.
>>> from pythainlp.word_vector import WordVector >>> >>> wv = WordVector() >>> list_positive = ['ประเทศ', 'ไทย', 'จีน', 'ญี่ปุ่น'] >>> list_negative = [] >>> wv.most_similar_cosmul(list_positive, list_negative) [('ประเทศจีน', 0.22022421658039093), ('เกาหลี', 0.2196873426437378), ('สหรัฐอเมริกา', 0.21660110354423523), ('ประเทศญี่ปุ่น', 0.21205860376358032), ('ประเทศไทย', 0.21159221231937408), ('เกาหลีใต้', 0.20321202278137207), ('อังกฤษ', 0.19610872864723206), ('ฮ่องกง', 0.1928885132074356), ('ฝรั่งเศส', 0.18383873999118805), ('พม่า', 0.18369348347187042)] >>> >>> list_positive = ['ประเทศ', 'ไทย', 'จีน', 'ญี่ปุ่น'] >>> list_negative = ['อเมริกา'] >>> wv.most_similar_cosmul(list_positive, list_negative) [('ประเทศไทย', 0.3278159201145172), ('เกาหลี', 0.3201899230480194), ('ประเทศจีน', 0.31755179166793823), ('พม่า', 0.30845439434051514), ('ประเทศญี่ปุ่น', 0.306713730096817), ('เกาหลีใต้', 0.3003999888896942), ('ลาว', 0.2995176911354065), ('คนไทย', 0.2885020673274994), ('เวียดนาม', 0.2878379821777344), ('ชาวไทย', 0.28480708599090576)]
The function return
KeyError
when the term “เมนูอาหารไทย” is not in the vocabulary.>>> from pythainlp.word_vector import WordVector >>> >>> wv = WordVector() >>> list_positive = ['เมนูอาหารไทย'] >>> list_negative = [] >>> wv.most_similar_cosmul(list_positive, list_negative) KeyError: "word 'เมนูอาหารไทย' not in vocabulary"
- similarity(word1: str, word2: str) float [source]
This function computae cosine similarity between two words.
- Parameters
- Raises
KeyError – if either word1 or word2 is not in the vocabulary of the model.
- Returns
the cosine similarity between the two word vectors
- Return type
- Note
If a word in word1 or word2 is not in the vocabulary,
KeyError
will be raised.
- Example
Compute consine similarity between two words: “รถไฟ” and “รถไฟฟ้า” (train and electric train).
>>> from pythainlp.word_vector import WordVector >>> wv = WordVector() >>> wv.similarity('รถไฟ', 'รถไฟฟ้า') 0.43387136
Compute consine similarity between two words: “เสือดาว” and “รถไฟฟ้า” (leopard and electric train).
>>> from pythainlp.word_vector import WordVector >>> >>> wv = WordVector() >>> wv.similarity('เสือดาว', 'รถไฟฟ้า') 0.04300258
- sentence_vectorizer(text: str, use_mean: bool = True) ndarray [source]
This function convert a Thai sentence into vector. Specifically, it first tokenize that text and map each tokenized words with the word vectors from the model. Then, word vectors are aggregatesd into one vector of 300 dimension by calulating either mean, or summation of all word vectors.
- Parameters
- Returns
300-dimension vector representing the given sentence in form of
numpy
array- Return type
numpy.ndarray((1,300))
- Example
Vectorize the sentence, “อ้วนเสี้ยวเข้ายึดแคว้นกิจิ๋ว ในปี พ.ศ. 735”, into one sentence vector with two aggregation meanthods: mean
and summation.
>>> from pythainlp.word_vector import WordVector >>> >>> wv = WordVector() >>> sentence = 'อ้วนเสี้ยวเข้ายึดแคว้นกิจิ๋ว ในปี พ.ศ. 735' >>> wv.sentence_vectorizer(sentence, use_mean=True) array([[-0.00421414, -0.08881307, 0.05081136, -0.05632929, -0.06607185, 0.03059357, -0.113882 , -0.00074836, 0.05035743, 0.02914307, ... 0.02893357, 0.11327957, 0.04562086, -0.05015393, 0.11641257, 0.32304936, -0.05054322, 0.03639471, -0.06531371, 0.05048079]]) >>> >>> wv.sentence_vectorizer(sentence, use_mean=False) array([[-0.05899798, -1.24338295, 0.711359 , -0.78861002, -0.92500597, 0.42831 , -1.59434797, -0.01047703, 0.705004 , 0.40800299, ... 0.40506999, 1.58591403, 0.63869202, -0.702155 , 1.62977601, 4.52269109, -0.70760502, 0.50952601, -0.914392 , 0.70673105]])
References
- 1
Omer Levy and Yoav Goldberg (2014). Linguistic Regularities in Sparse and Explicit Word Representations.