pythainlp.corpus
The pythainlp.corpus
provides access to corpus that comes with PyThaiNLP.
Modules
- pythainlp.corpus.countries() FrozenSet[str] [source]
Return a frozenset of country names in Thai such as “แคนาดา”, “โรมาเนีย”, “แอลจีเรีย”, and “ลาว”.
(See: dev/pythainlp/corpus/countries_th.txt)
- pythainlp.corpus.get_corpus(filename: str, as_is: bool = False) Union[frozenset, list] [source]
Read corpus data from file and return a frozenset or a list.
Each line in the file will be a member of the set or the list.
By default, a frozenset will be return, with whitespaces stripped, and empty values and duplicates removed.
If as_is is True, a list will be return, with no modifications in member values and their orders.
- Parameters
filename (str) – filename of the corpus to be read
- Returns
- Return type
- Example
from pythainlp.corpus import get_corpus get_corpus('negations_th.txt') # output: # frozenset({'แต่', 'ไม่'}) get_corpus('ttc_freq.txt') # output: # frozenset({'โดยนัยนี้\t1', # 'ตัวบท\t10', # 'หยิบยื่น\t3', # ...})
- pythainlp.corpus.get_corpus_db(url: str) requests.models.Response [source]
Get corpus catalog from server.
- Parameters
url (str) – URL corpus catalog
- pythainlp.corpus.get_corpus_db_detail(name: str, version: Optional[str] = None) dict [source]
Get details about a corpus, using information from local catalog.
- pythainlp.corpus.get_corpus_default_db(name: str, version: Optional[str] = None) Optional[str] [source]
Get model path from default_db.json
- Parameters
name (str) – corpus name
- Returns
path to the corpus or None of the corpus doesn’t exist in the device
- Return type
If you want edit default_db.json, you can edit in pythainlp/corpus/default_db.json
- pythainlp.corpus.get_corpus_path(name: str, version: Optional[str] = None) Optional[str] [source]
Get corpus path.
- Parameters
name (str) – corpus name
- Returns
path to the corpus or None of the corpus doesn’t exist in the device
- Return type
- Example
(Please see the filename from this file
If the corpus already exists:
from pythainlp.corpus import get_corpus_path print(get_corpus_path('ttc')) # output: /root/pythainlp-data/ttc_freq.txt
If the corpus has not been downloaded yet:
from pythainlp.corpus import download, get_corpus_path print(get_corpus_path('wiki_lm_lstm')) # output: None download('wiki_lm_lstm') # output: # Download: wiki_lm_lstm # wiki_lm_lstm 0.32 # thwiki_lm.pth?dl=1: 1.05GB [00:25, 41.5MB/s] # /root/pythainlp-data/thwiki_model_lstm.pth print(get_corpus_path('wiki_lm_lstm')) # output: /root/pythainlp-data/thwiki_model_lstm.pth
- pythainlp.corpus.download(name: str, force: bool = False, url: Optional[str] = None, version: Optional[str] = None) bool [source]
Download corpus.
The available corpus names can be seen in this file: https://pythainlp.github.io/pythainlp-corpus/db.json
- Parameters
- Returns
True if the corpus is found and succesfully downloaded. Otherwise, it returns False.
- Return type
- Example
from pythainlp.corpus import download download('wiki_lm_lstm', force=True) # output: # Corpus: wiki_lm_lstm # - Downloading: wiki_lm_lstm 0.1 # thwiki_lm.pth: 26%|██▌ | 114k/434k [00:00<00:00, 690kB/s]
By default, downloaded corpus and model will be saved in
$HOME/pythainlp-data/
(e.g./Users/bact/pythainlp-data/wiki_lm_lstm.pth
).
- pythainlp.corpus.remove(name: str) bool [source]
Remove corpus
- Parameters
name (str) – corpus name
- Returns
True if the corpus is found and succesfully removed. Otherwise, it returns False.
- Return type
- Example
from pythainlp.corpus import remove, get_corpus_path, get_corpus print(remove('ttc')) # output: True print(get_corpus_path('ttc')) # output: None get_corpus('ttc') # output: # FileNotFoundError: [Errno 2] No such file or directory: # '/usr/local/lib/python3.6/dist-packages/pythainlp/corpus/ttc'
- pythainlp.corpus.provinces(details: bool = False) Union[FrozenSet[str], List[str]] [source]
Return a frozenset of Thailand province names in Thai such as “กระบี่”, “กรุงเทพมหานคร”, “กาญจนบุรี”, and “อุบลราชธานี”.
(See: dev/pythainlp/corpus/thailand_provinces_th.txt)
- pythainlp.corpus.thai_stopwords() FrozenSet[str] [source]
Return a frozenset of Thai stopwords such as “มี”, “ไป”, “ไง”, “ขณะ”, “การ”, and “ประการหนึ่ง”.
- (See: dev/pythainlp/corpus/stopwords_th.txt)
We using stopword lists by thesis’s เพ็ญศิริ ลี้ตระกูล.
- See Also
เพ็ญศิริ ลี้ตระกูล . การเลือกประโยคสำคัญในการสรุปความภาษาไทยโดยใช้แบบจำลองแบบลำดับชั้น. กรุงเทพมหานคร : มหาวิทยาลัยธรรมศาสตร์; 2551.
- pythainlp.corpus.thai_words() FrozenSet[str] [source]
Return a frozenset of Thai words such as “กติกา”, “กดดัน”, “พิษ”, and “พิษภัย”.
(See: dev/pythainlp/corpus/words_th.txt)
- pythainlp.corpus.thai_syllables() FrozenSet[str] [source]
Return a frozenset of Thai syllables such as “กรอบ”, “ก็”, “๑”, “โมบ”, “โมน”, “โม่ง”, “กา”, “ก่า”, and, “ก้า”.
- (See: dev/pythainlp/corpus/syllables_th.txt)
We using thai syllables list from KUCut.
- pythainlp.corpus.thai_negations() FrozenSet[str] [source]
Return a frozenset of Thai negation words including “ไม่” and “แต่”.
(See: dev/pythainlp/corpus/negations_th.txt)
- pythainlp.corpus.thai_family_names() FrozenSet[str] [source]
Return a frozenset of Thai family names
(See: dev/pythainlp/corpus/family_names_th.txt)
- pythainlp.corpus.thai_female_names() FrozenSet[str] [source]
Return a frozenset of Thai female names
(See: dev/pythainlp/corpus/person_names_female_th.txt)
- pythainlp.corpus.thai_male_names() FrozenSet[str] [source]
Return a frozenset of Thai male names
(See: dev/pythainlp/corpus/person_names_male_th.txt)
ConceptNet
ConceptNet is an open, multilingual knowledge graph See: https://github.com/commonsense/conceptnet5/wiki/API
- pythainlp.corpus.conceptnet.edges(word: str, lang: str = 'th')[source]
Get edges from ConceptNet API. ConceptNet is a public semantic network, designed to help computers understand the meanings of words that people use.
For example, the term “ConceptNet” is a “knowledge graph”, and “knowledge graph” has “common sense knowledge” which is a part of “artificial inteligence”. Also, “ConcepNet” is used for “natural language understanding” which is a part of “artificial intelligence”.
“ConceptNet” –is a–> “knowledge graph” –has–> “common sense” –a part of–> “artificial intelligence”“ConceptNet” –used for–> “natural language understanding” –a part of–> “artificial intelligence”With this illustration, it shows relationships (represented as Edge) between the terms (represented as Node)
- Parameters
- Returns
return edges of the given word according to the ConceptNet network.
- Return type
- Example
from pythainlp.corpus.conceptnet import edges edges('hello', lang='en') # output: # [{ # '@id': '/a/[/r/IsA/,/c/en/hello/,/c/en/greeting/]', # '@type': 'Edge', # 'dataset': '/d/conceptnet/4/en', # 'end': {'@id': '/c/en/greeting', # '@type': 'Node', # 'label': 'greeting', # 'language': 'en', # 'term': '/c/en/greeting'}, # 'license': 'cc:by/4.0', # 'rel': {'@id': '/r/IsA', '@type': 'Relation', 'label': 'IsA'}, # 'sources': [ # { # '@id': '/and/[/s/activity/omcs/vote/,/s/contributor/omcs/bmsacr/]', # '@type': 'Source', # 'activity': '/s/activity/omcs/vote', # 'contributor': '/s/contributor/omcs/bmsacr' # }, # { # '@id': '/and/[/s/activity/omcs/vote/,/s/contributor/omcs/test/]', # '@type': 'Source', # 'activity': '/s/activity/omcs/vote', # 'contributor': '/s/contributor/omcs/test'} # ], # 'start': {'@id': '/c/en/hello', # '@type': 'Node', # 'label': 'Hello', # 'language': 'en', # 'term': '/c/en/hello'}, # 'surfaceText': '[[Hello]] is a kind of [[greeting]]', # 'weight': 3.4641016151377544 # }, ...] edges('สวัสดี', lang='th') # output: # [{ # '@id': '/a/[/r/RelatedTo/,/c/th/สวัสดี/n/,/c/en/prosperity/]', # '@type': 'Edge', # 'dataset': '/d/wiktionary/en', # 'end': {'@id': '/c/en/prosperity', # '@type': 'Node', # 'label': 'prosperity', # 'language': 'en', # 'term': '/c/en/prosperity'}, # 'license': 'cc:by-sa/4.0', # 'rel': { # '@id': '/r/RelatedTo', '@type': 'Relation', # 'label': 'RelatedTo'}, # 'sources': [{ # '@id': '/and/[/s/process/wikiparsec/2/,/s/resource/wiktionary/en/]', # '@type': 'Source', # 'contributor': '/s/resource/wiktionary/en', # 'process': '/s/process/wikiparsec/2'}], # 'start': {'@id': '/c/th/สวัสดี/n', # '@type': 'Node', # 'label': 'สวัสดี', # 'language': 'th', # 'sense_label': 'n', # 'term': '/c/th/สวัสดี'}, # 'surfaceText': None, # 'weight': 1.0 # }, ...]
TNC
- pythainlp.corpus.tnc.word_freqs() List[Tuple[str, int]] [source]
Get word frequency from Thai National Corpus (TNC)
(See: dev/pythainlp/corpus/tnc_freq.txt)
Credit: Korakot Chaovavanich https://bit.ly/3wSkZsF
- pythainlp.corpus.tnc.unigram_word_freqs() collections.defaultdict [source]
Get unigram word frequency from Thai National Corpus (TNC)
- pythainlp.corpus.tnc.bigram_word_freqs() collections.defaultdict [source]
Get bigram word frequency from Thai National Corpus (TNC)
- pythainlp.corpus.tnc.trigram_word_freqs() collections.defaultdict [source]
Get trigram word frequency from Thai National Corpus (TNC)
TTC
- pythainlp.corpus.ttc.word_freqs() List[Tuple[str, int]] [source]
Get word frequency from Thai Textbook Corpus (TTC)
- pythainlp.corpus.ttc.unigram_word_freqs() collections.defaultdict [source]
Get unigram word frequency from Thai Textbook Corpus (TTC)
OSCAR
- pythainlp.corpus.oscar.word_freqs() List[Tuple[str, int]] [source]
Get word frequency from OSCAR Corpus (icu word tokenize)
- pythainlp.corpus.oscar.unigram_word_freqs() collections.defaultdict [source]
Get unigram word frequency from OSCAR Corpus (icu word tokenize)
Util
- pythainlp.corpus.util.find_badwords(tokenize: Callable[[str], List[str]], training_data: Iterable[Iterable[str]]) Set[str] [source]
Find words that do not work well with the tokenize function for the provided training_data.
- pythainlp.corpus.util.revise_wordset(tokenize: Callable[[str], List[str]], orig_words: Iterable[str], training_data: Iterable[Iterable[str]]) Set[str] [source]
Revise a set of word that could improve tokenization performance of a dictionary-based tokenize function.
orign_words will be used as a base set for the dictionary. Words that do not performed well with training_data will be removed. The remaining words will be returned.
- Parameters
tokenize (Callable[[str], List[str]]) – a tokenize function, can be any function that takes a string as input and returns a List[str]
orig_words (Iterable[str]) – words that used by the tokenize function, will be used as a base for revision
training_data (Iterable[Iterable[str]]) – tokenized text, to be used as a training set
- Returns
words that considered making tokenize perform unwell
- Return type
Set[str]
- Example:
from pythainlp.corpus import thai_words from pythainlp.corpus.util import revise_wordset from pythainlp.tokenize.longest import segment
base_words = thai_words() more_words = {
“ถวิล อุดล”, “ทองอินทร์ ภูริพัฒน์”, “เตียง ศิริขันธ์”, “จำลอง ดาวเรือง”
} base_words = base_words.union(more_words) dict_trie = Trie(wordlist)
tokenize = lambda text: segment(text, dict_trie)
- training_data = [
[str, str, str. …], [str, str, str, str, …], …
]
revised_words = revise_wordset(tokenize, wordlist, training_data)
- pythainlp.corpus.util.revise_newmm_default_wordset(training_data: Iterable[Iterable[str]]) Set[str] [source]
Revise a set of word that could improve tokenization performance of pythainlp.tokenize.newmm, a dictionary-based tokenizer and a default tokenizer for PyThaiNLP.
Words from pythainlp.corpus.thai_words() will be used as a base set for the dictionary. Words that do not performed well with training_data will be removed. The remaining words will be returned.
WordNet
PyThaiNLP API is an exact copy of NLTK WordNet API. See: https://www.nltk.org/howto/wordnet.html
- pythainlp.corpus.wordnet.synsets(word: str, pos: Optional[str] = None, lang: str = 'tha')[source]
This function return the synonym sets for all lemmas given the word with an optional argument to constrain the part of speech of the word.
- Parameters
- Returns
Synset
for all lemmas for the word constrained with the argument pos.- Return type
list[
Synset
]- Example
>>> from pythainlp.corpus.wordnet import synsets >>> >>> synsets("ทำงาน") [Synset('function.v.01'), Synset('work.v.02'), Synset('work.v.01'), Synset('work.v.08')] >>> >>> synsets("บ้าน", lang="tha")) [Synset('duplex_house.n.01'), Synset('dwelling.n.01'), Synset('house.n.01'), Synset('family.n.01'), Synset('home.n.03'), Synset('base.n.14'), Synset('home.n.01'), Synset('houseful.n.01'), Synset('home.n.07')]
When specifying the part of speech constrain. For example, the word “แรง” cound be interpreted as force (n.) or hard (adj.).
>>> from pythainlp.corpus.wordnet import synsets >>> # By default, accept all part of speech >>> synsets("แรง", lang="tha") >>> >>> # only Noun >>> synsets("แรง", pos="n", lang="tha") [Synset('force.n.03'), Synset('force.n.02')] >>> >>> # only Adjective >>> synsets("แรง", pos="a", lang="tha") [Synset('hard.s.10'), Synset('strong.s.02')]
- pythainlp.corpus.wordnet.synset(name_synsets)[source]
This function return the synonym set (synset) given the name of synset (i.e. ‘dog.n.01’, ‘chase.v.01’).
- Parameters
name_synsets (str) – name of the sysset
- Returns
Synset
of the given name- Return type
Synset
- Example
>>> from pythainlp.corpus.wordnet import synset >>> >>> difficult = synset('difficult.a.01') >>> difficult Synset('difficult.a.01') >>> >>> difficult.definition() 'not easy; requiring great physical or mental effort to accomplish or comprehend or endure'
- pythainlp.corpus.wordnet.all_lemma_names(pos: Optional[str] = None, lang: str = 'tha')[source]
This function returns all lemma names for all synsets for the given part of speech tag and language. If part of speech tag is not specified, all synsets for all part of speech will be used.
- Parameters
- Returns
Synset
of lemmas names given the pos and language- Return type
list[
Synset
]- Example
>>> from pythainlp.corpus.wordnet import all_lemma_names >>> >>> all_lemma_names() ['อเมริโก_เวสปุชชี', 'เมืองชีย์เอนเน', 'การรับเลี้ยงบุตรบุญธรรม', 'ผู้กัด', 'ตกแต่งเรือด้วยธง', 'จิโอวานนิ_เวอร์จินิโอ',...] >>> >>> len(all_lemma_names()) 80508 >>> >>> all_lemma_names(pos="a") ['ซึ่งไม่มีแอลกอฮอล์', 'ซึ่งตรงไปตรงมา', 'ที่เส้นศูนย์สูตร', 'ทางจิตใจ',...] >>> >>> len(all_lemma_names(pos="a")) 5277
- pythainlp.corpus.wordnet.all_synsets(pos: Optional[str] = None)[source]
This function iterates over all synsets constrained by given part of speech tag.
- Parameters
pos (str) – part of speech tag
- Returns
list of synsets constrained by given part of speech tag.
- Return type
Iterable[
Synset
]- Example
>>> from pythainlp.corpus.wordnet import all_synsets >>> >>> generator = all_synsets(pos="n") >>> next(generator) Synset('entity.n.01') >>> next(generator) Synset('physical_entity.n.01') >>> next(generator) Synset('abstraction.n.06') >>> >>> generator = all_synsets() >>> next(generator) Synset('able.a.01') >>> next(generator) Synset('unable.a.01')
- pythainlp.corpus.wordnet.langs()[source]
This function return a set of ISO-639 language codes.
- Returns
ISO-639 language codes
- Return type
- Example
>>> from pythainlp.corpus.wordnet import langs >>> langs() ['eng', 'als', 'arb', 'bul', 'cat', 'cmn', 'dan', 'ell', 'eus', 'fas', 'fin', 'fra', 'glg', 'heb', 'hrv', 'ind', 'ita', 'jpn', 'nld', 'nno', 'nob', 'pol', 'por', 'qcn', 'slv', 'spa', 'swe', 'tha', 'zsm']
- pythainlp.corpus.wordnet.lemmas(word: str, pos: Optional[str] = None, lang: str = 'tha')[source]
This function returns all lemmas given the word with an optional argument to constrain the part of speech of the word.
- Parameters
- Returns
Synset
for all lemmas for the word constraine with the argument pos.- Return type
list[
Lemma
]- Example
>>> from pythainlp.corpus.wordnet import lemmas >>> >>> lemmas("โปรด") [Lemma('like.v.03.โปรด'), Lemma('like.v.02.โปรด')]
>>> print(lemmas("พระเจ้า")) [Lemma('god.n.01.พระเจ้า'), Lemma('godhead.n.01.พระเจ้า'), Lemma('father.n.06.พระเจ้า'), Lemma('god.n.03.พระเจ้า')]
When specify the part of speech tag.
>>> from pythainlp.corpus.wordnet import lemmas >>> >>> lemmas("ม้วน") [Lemma('roll.v.18.ม้วน'), Lemma('roll.v.17.ม้วน'), Lemma('roll.v.08.ม้วน'), Lemma('curl.v.01.ม้วน'), Lemma('roll_up.v.01.ม้วน'), Lemma('wind.v.03.ม้วน'), Lemma('roll.n.11.ม้วน')] >>> >>> # only lammas with Noun as the part of speech >>> lemmas("ม้วน", pos="n") [Lemma('roll.n.11.ม้วน')]
- pythainlp.corpus.wordnet.lemma(name_synsets)[source]
This function return lemma object given the name.
Note
Support only English language (eng).
- Parameters
name_synsets (str) – name of the synset
- Returns
lemma object with the given name
- Return type
Lemma
- Example
>>> from pythainlp.corpus.wordnet import lemma >>> >>> lemma('practice.v.01.exercise') Lemma('practice.v.01.exercise') >>> >>> lemma('drill.v.03.exercise') Lemma('drill.v.03.exercise') >>> >>> lemma('exercise.n.01.exercise') Lemma('exercise.n.01.exercise')
- pythainlp.corpus.wordnet.lemma_from_key(key)[source]
This function returns lemma object given the lemma key. This is similar to
lemma()
but it needs to supply the key of lemma instead of the name.Note
Support only English language (eng).
- Parameters
key (str) – key of the lemma object
- Returns
lemma object with the given key
- Return type
Lemma
- Example
>>> from pythainlp.corpus.wordnet import lemma, lemma_from_key >>> >>> practice = lemma('practice.v.01.exercise') >>> practice.key() exercise%2:41:00:: >>> lemma_from_key(practice.key()) Lemma('practice.v.01.exercise')
- pythainlp.corpus.wordnet.path_similarity(synsets1, synsets2)[source]
This function returns similarity between two synsets based on the shortest path distance from the equation as follows.
\[path\_similarity = {1 \over shortest\_path\_distance(synsets1, synsets2) + 1}\]The shortest path distance is calculated by the connection through the is-a (hypernym/hyponym) taxonomy. The score is in the ranage 0 to 1. Path similarity of 1 indicates identicality.
- Parameters
synsets1 (Synset) – first synset supplied to measures the path similarity
synsets2 (Synset) – second synset supplied to measures the path similarity
- Returns
path similarity between two synsets
- Return type
- Example
>>> from pythainlp.corpus.wordnet import path_similarity, synset >>> >>> entity = synset('entity.n.01') >>> obj = synset('object.n.01') >>> cat = synset('cat.n.01') >>> >>> path_similarity(entity, obj) 0.3333333333333333 >>> path_similarity(entity, cat) 0.07142857142857142 >>> path_similarity(obj, cat) 0.08333333333333333
- pythainlp.corpus.wordnet.lch_similarity(synsets1, synsets2)[source]
This function returns Leacock Chodorow similarity (LCH) between two synsets, based on the shortest path distance and the maximum depth of the taxonomy. The equation to calculate LCH similarity is shown below:
\[lch\_similarity = {-log(shortest\_path\_distance(synsets1, synsets2) \over 2 * taxonomy\_depth}\]- Parameters
synsets1 (Synset) – first synset supplied to measures the LCH similarity
synsets2 (Synset) – second synset supplied to measures the LCH similarity
- Returns
LCH similarity between two synsets
- Return type
- Example
>>> from pythainlp.corpus.wordnet import lch_similarity, synset >>> >>> entity = synset('entity.n.01') >>> obj = synset('object.n.01') >>> cat = synset('cat.n.01') >>> >>> lch_similarity(entity, obj) 2.538973871058276 >>> lch_similarity(entity, cat) 0.9985288301111273 >>> lch_similarity(obj, cat) 1.1526795099383855
- pythainlp.corpus.wordnet.wup_similarity(synsets1, synsets2)[source]
This function returns Wu-Palmer similarity (WUP) between two synsets, based on the depth of the two senses in the taxonomy and their Least Common Subsumer (most specific ancestor node).
- Parameters
synsets1 (Synset) – first synset supplied to measures the WUP similarity
synsets2 (Synset) – second synset supplied to measures the WUP similarity
- Returns
WUP similarity between two synsets
- Return type
- Example
>>> from pythainlp.corpus.wordnet import wup_similarity, synset >>> >>> entity = synset('entity.n.01') >>> obj = synset('object.n.01') >>> cat = synset('cat.n.01') >>> >>> wup_similarity(entity, obj) 0.5 >>> wup_similarity(entity, cat) 0.13333333333333333 >>> wup_similarity(obj, cat) 0.35294117647058826
- pythainlp.corpus.wordnet.morphy(form, pos: Optional[str] = None)[source]
This function finds a possible base form for the given form, with the given part of speech.
- Parameters
- Returns
base form of the given form
- Return type
- Example
>>> from pythainlp.corpus.wordnet import morphy >>> >>> morphy("dogs") 'dogs' >>> >>> morphy("thieves") 'thief' >>> >>> morphy("mixed") 'mix' >>> >>> morphy("calculated") 'calculate'
- pythainlp.corpus.wordnet.custom_lemmas(tab_file, lang: str)[source]
This function reads a custom tab file (see: http://compling.hss.ntu.edu.sg/omw/) containing mappings of lemmas in the given language.
- Parameters
tab_file – Tab file as a file or file-like object
lang (str) – abbreviation of language (i.e. eng, tha).
Definition
- Synset
a set of synonyms that share a common meaning.