pythainlp.corpus

The pythainlp.corpus module provides access to various Thai language corpora and resources that come bundled with PyThaiNLP. These resources are essential for natural language processing tasks in the Thai language.

Modules

countries

pythainlp.corpus.countries() → FrozenSet[str][source]

Return a frozenset of country names in Thai such as “แคนาดา”, “โรมาเนีย”, “แอลจีเรีย”, and “ลาว”.

(See: dev/pythainlp/corpus/countries_th.txt)

return:

frozenset containing country names in Thai

rtype:

frozenset

find_synonym

get_corpus

pythainlp.corpus.get_corpus(filename: str, comments: bool = True) → frozenset[source]

Read corpus data from file and return a frozenset.

Each line in the file will be a member of the set.

Whitespace stripped and empty values and duplicates removed.

If comments is False, any text at any position after the character ‘#’ in each line will be discarded.

Parameters:

filename (str) – filename of the corpus to be read
comments (bool) – keep comments

Returns:

frozenset consisting of lines in the file

Return type:

frozenset

Example:

from pythainlp.corpus import get_corpus

# input file (negations_th.txt):
# แต่
# ไม่

get_corpus("negations_th.txt")
# output:
# frozenset({'แต่', 'ไม่'})

# input file (ttc_freq.txt):
# ตัวบท<tab>10
# โดยนัยนี้<tab>1

get_corpus("ttc_freq.txt")
# output:
# frozenset({'โดยนัยนี้\t1',
#    'ตัวบท\t10',
#     ...})

# input file (icubrk_th.txt):
# # Thai Dictionary for ICU BreakIterator
# กก
# กกขนาก

get_corpus("icubrk_th.txt")
# output:
# frozenset({'กกขนาก',
#     '# Thai Dictionary for ICU BreakIterator',
#     'กก',
#     ...})

get_corpus("icubrk_th.txt", comments=False)
# output:
# frozenset({'กกขนาก',
#     'กก',
#     ...})

get_corpus_as_is

pythainlp.corpus.get_corpus_as_is(filename: str) → list[source]

Read corpus data from file, as it is, and return a list.

Each line in the file will be a member of the list.

No modifications in member values and their orders.

If strip or comment removal is needed, use get_corpus() instead.

Parameters:: filename (str) – filename of the corpus to be read
Returns:: list consisting of lines in the file
Return type:: list
Example:

from pythainlp.corpus import get_corpus

# input file (negations_th.txt):
# แต่
# ไม่

get_corpus_as_is("negations_th.txt")
# output:
# ['แต่', 'ไม่']

get_corpus_db

pythainlp.corpus.get_corpus_db(url: str)[source]

Get corpus catalog from server.

Parameters:: url (str) – URL corpus catalog

get_corpus_db_detail

pythainlp.corpus.get_corpus_db_detail(name: str, version: str = '') → dict[source]

Get details about a corpus, using information from local catalog.

Parameters:: name (str) – name of corpus
Returns:: details about corpus
Return type:: dict

get_corpus_default_db

pythainlp.corpus.get_corpus_default_db(name: str, version: str = '') → str | None[source]

Get model path from default_db.json

Parameters:: name (str) – corpus name
Returns:: path to the corpus or None if the corpus doesn’t exist on the device
Return type:: str

If you want to edit default_db.json, you can edit pythainlp/corpus/default_db.json

get_corpus_path

pythainlp.corpus.get_corpus_path(name: str, version: str = '', force: bool = False) → str | None[source]

Get corpus path.

Parameters:

name (str) – corpus name
version (str) – version
force (bool) – force downloading

Returns:

path to the corpus or None if the corpus doesn’t exist on the device

Return type:

str

Example:

(Please see the filename in this file

If the corpus already exists:

from pythainlp.corpus import get_corpus_path

print(get_corpus_path('ttc'))
# output: /root/pythainlp-data/ttc_freq.txt

If the corpus has not been downloaded yet:

from pythainlp.corpus import download, get_corpus_path

print(get_corpus_path('wiki_lm_lstm'))
# output: None

download('wiki_lm_lstm')
# output:
# Download: wiki_lm_lstm
# wiki_lm_lstm 0.32
# thwiki_lm.pth?dl=1: 1.05GB [00:25, 41.5MB/s]
# /root/pythainlp-data/thwiki_model_lstm.pth

print(get_corpus_path('wiki_lm_lstm'))
# output: /root/pythainlp-data/thwiki_model_lstm.pth

download

pythainlp.corpus.download(name: str, force: bool = False, url: str = '', version: str = '') → bool[source]

Download corpus.

The available corpus names can be seen in this file: https://pythainlp.org/pythainlp-corpus/db.json

Parameters:

name (str) – corpus name
force (bool) – force downloading
url (str) – URL of the corpus catalog
version (str) – version of the corpus

Returns:

True if the corpus is found and successfully downloaded. Otherwise, it returns False.

Return type:

bool

Example:

from pythainlp.corpus import download

download("wiki_lm_lstm", force=True)
# output:
# Corpus: wiki_lm_lstm
# - Downloading: wiki_lm_lstm 0.1
# thwiki_lm.pth:  26%|██▌       | 114k/434k [00:00<00:00, 690kB/s]

By default, downloaded corpora and models will be saved in $HOME/pythainlp-data/ (e.g. /Users/bact/pythainlp-data/wiki_lm_lstm.pth).

remove

pythainlp.corpus.remove(name: str) → bool[source]

Remove corpus

Parameters:: name (str) – corpus name
Returns:: True if the corpus is found and successfully removed. Otherwise, it returns False.
Return type:: bool
Example:

from pythainlp.corpus import remove, get_corpus_path, get_corpus

print(remove("ttc"))
# output: True

print(get_corpus_path("ttc"))
# output: None

get_corpus("ttc")
# output:
# FileNotFoundError: [Errno 2] No such file or directory:
# '/usr/local/lib/python3.6/dist-packages/pythainlp/corpus/ttc'

provinces

pythainlp.corpus.provinces(details: bool = False) → FrozenSet[str] | List[dict][source]

Return a frozenset of Thailand province names in Thai such as “กระบี่”, “กรุงเทพมหานคร”, “กาญจนบุรี”, and “อุบลราชธานี”.

(See: dev/pythainlp/corpus/thailand_provinces_th.txt)

param bool details:

return details of provinces or not

return:

frozenset containing province names of Thailand (if details is False) or list containing dict of province names and details such as [{‘name_th’: ‘นนทบุรี’, ‘abbr_th’: ‘นบ’, ‘name_en’: ‘Nonthaburi’, ‘abbr_en’: ‘NBI’}].

rtype:

frozenset or list

thai_dict

pythainlp.corpus.thai_dict() → dict[source]

Return Thai dictionary with definition from wiktionary.

(See: thai_dict)

return:

Thai words with part-of-speech type and definition

rtype:

dict

thai_stopwords

pythainlp.corpus.thai_stopwords() → FrozenSet[str][source]

Return a frozenset of Thai stopwords such as “มี”, “ไป”, “ไง”, “ขณะ”, “การ”, and “ประการหนึ่ง”.

(See: dev/pythainlp/corpus/stopwords_th.txt)

We use stopword lists by thesis’s เพ็ญศิริ ลี้ตระกูล.

See Also:

เพ็ญศิริ ลี้ตระกูล . การเลือกประโยคสำคัญในการสรุปความภาษาไทยโดยใช้แบบจำลองแบบลำดับชั้น. กรุงเทพมหานคร : มหาวิทยาลัยธรรมศาสตร์; 2551.

return:: frozenset containing stopwords.
rtype:: frozenset

thai_words

pythainlp.corpus.thai_words() → FrozenSet[str][source]

Return a frozenset of Thai words such as “กติกา”, “กดดัน”, “พิษ”, and “พิษภัย”.

(See: dev/pythainlp/corpus/words_th.txt)

return:

frozenset containing words in the Thai language.

rtype:

frozenset

thai_wsd_dict

pythainlp.corpus.thai_wsd_dict() → dict[source]

Return Thai Word Sense Disambiguation dictionary with definition from wiktionary.

(See: thai_dict)

return:

Thai words with part-of-speech type and definition

rtype:

dict

thai_orst_words

pythainlp.corpus.thai_orst_words() → FrozenSet[str][source]

Return a frozenset of Thai words from Royal Society of Thailand

(See: dev/pythainlp/corpus/thai_orst_words.txt)

return:

frozenset containing words in the Thai language.

rtype:

frozenset

thai_synonyms

pythainlp.corpus.thai_synonyms() → dict[source]

Return Thai synonyms.

(See: thai_synonym)

return:

Thai words with part-of-speech type and synonym

rtype:

dict

thai_syllables

pythainlp.corpus.thai_syllables() → FrozenSet[str][source]

Return a frozenset of Thai syllables such as “กรอบ”, “ก็”, “๑”, “โมบ”, “โมน”, “โม่ง”, “กา”, “ก่า”, and, “ก้า”.

(See: dev/pythainlp/corpus/syllables_th.txt)

We use the Thai syllable list from KUCut.

return:: frozenset containing syllables in the Thai language.
rtype:: frozenset

thai_negations

pythainlp.corpus.thai_negations() → FrozenSet[str][source]

Return a frozenset of Thai negation words including “ไม่” and “แต่”.

(See: dev/pythainlp/corpus/negations_th.txt)

return:

frozenset containing negations in the Thai language.

rtype:

frozenset

thai_family_names

pythainlp.corpus.thai_family_names() → FrozenSet[str][source]

Return a frozenset of Thai family names

(See: dev/pythainlp/corpus/family_names_th.txt)

return:

frozenset containing Thai family names.

rtype:

frozenset

thai_female_names

pythainlp.corpus.thai_female_names() → FrozenSet[str][source]

Return a frozenset of Thai female names

(See: dev/pythainlp/corpus/person_names_female_th.txt)

return:

frozenset containing Thai female names.

rtype:

frozenset

thai_male_names

pythainlp.corpus.thai_male_names() → FrozenSet[str][source]

Return a frozenset of Thai male names

(See: dev/pythainlp/corpus/person_names_male_th.txt)

return:

frozenset containing Thai male names.

rtype:

frozenset

pythainlp.corpus.th_en_translit.get_transliteration_dict

pythainlp.corpus.th_en_translit.get_transliteration_dict() → defaultdict[source]

Get Thai to English transliteration dictionary.

The returned dict is in dict[str, dict[List[str], List[Optional[bool]]]] format.

TNC (Thai National Corpus) —

The Thai National Corpus (TNC) is a collection of text data in the Thai language. This module provides access to word frequency data from the TNC corpus.

pythainlp.corpus.tnc.word_freqs

pythainlp.corpus.tnc.word_freqs() → List[Tuple[str, int]][source]: Get word frequency from Thai National Corpus (TNC)

(See: dev/pythainlp/corpus/tnc_freq.txt)

Credit: Korakot Chaovavanich https://www.facebook.com/groups/thainlp/posts/434330506948445

pythainlp.corpus.tnc.unigram_word_freqs

pythainlp.corpus.tnc.unigram_word_freqs() → dict[str, int][source]: Get unigram word frequency from Thai National Corpus (TNC)

pythainlp.corpus.tnc.bigram_word_freqs

pythainlp.corpus.tnc.bigram_word_freqs() → dict[Tuple[str, str], int][source]: Get bigram word frequency from Thai National Corpus (TNC)

pythainlp.corpus.tnc.trigram_word_freqs

pythainlp.corpus.tnc.trigram_word_freqs() → dict[Tuple[str, str, str], int][source]: Get trigram word frequency from Thai National Corpus (TNC)

TTC (Thai Textbook Corpus) —

The Thai Textbook Corpus (TTC) is a collection of Thai language text data, primarily sourced from textbooks.

pythainlp.corpus.ttc.word_freqs

pythainlp.corpus.ttc.word_freqs() → List[Tuple[str, int]][source]: Get word frequency from Thai Textbook Corpus (TTC)

(See: dev/pythainlp/corpus/ttc_freq.txt)

pythainlp.corpus.ttc.unigram_word_freqs

pythainlp.corpus.ttc.unigram_word_freqs() → dict[str, int][source]: Get unigram word frequency from Thai Textbook Corpus (TTC)

OSCAR

OSCAR is a multilingual corpus that includes Thai text data. This module provides access to word frequency data from the OSCAR corpus.

pythainlp.corpus.oscar.word_freqs

pythainlp.corpus.oscar.word_freqs() → List[Tuple[str, int]][source]: Get word frequency from OSCAR Corpus (words tokenized using ICU)

pythainlp.corpus.oscar.unigram_word_freqs

pythainlp.corpus.oscar.unigram_word_freqs() → dict[str, int][source]: Get unigram word frequency from OSCAR Corpus (words tokenized using ICU)

Util

Utilities for working with the corpus data.

pythainlp.corpus.util.find_badwords

pythainlp.corpus.util.find_badwords(tokenize: Callable[[str], List[str]], training_data: Iterable[Iterable[str]]) → Set[str][source]

Find words that do not work well with the tokenize function for the provided training_data.

Parameters:

tokenize (Callable[[str], List[str]]) – a tokenize function
training_data (Iterable[Iterable[str]]) – tokenized text, to be used as a training set

Returns:

words that are considered to make tokenize perform badly

Return type:

Set[str]

pythainlp.corpus.util.revise_wordset

pythainlp.corpus.util.revise_wordset(tokenize: Callable[[str], List[str]], orig_words: Iterable[str], training_data: Iterable[Iterable[str]]) → Set[str][source]

Revise a set of words that could improve tokenization performance of a dictionary-based tokenize function.

orig_words will be used as a base set for the dictionary. Words that do not performed well with training_data will be removed. The remaining words will be returned.

Parameters:

tokenize (Callable[[str], List[str]]) – a tokenize function, can be any function that takes a string as input and returns a List[str]
orig_words (Iterable[str]) – words that used by the tokenize function, will be used as a base for revision
training_data (Iterable[Iterable[str]]) – tokenized text, to be used as a training set

Returns:

words that are considered to make tokenize perform badly

Return type:

Set[str]

Example::

from pythainlp.corpus import thai_words
from pythainlp.corpus.util import revise_wordset
from pythainlp.tokenize.longest import segment

base_words = thai_words()
more_words = {
    "ถวิล อุดล", "ทองอินทร์ ภูริพัฒน์", "เตียง ศิริขันธ์", "จำลอง ดาวเรือง"
}
base_words = base_words.union(more_words)
dict_trie = Trie(wordlist)

tokenize = lambda text: segment(text, dict_trie)

training_data = [
    [str, str, str. ...],
    [str, str, str, str, ...],
    ...
]

revised_words = revise_wordset(tokenize, wordlist, training_data)

pythainlp.corpus.util.revise_newmm_default_wordset

pythainlp.corpus.util.revise_newmm_default_wordset(training_data: Iterable[Iterable[str]]) → Set[str][source]

Revise a set of word that could improve tokenization performance of pythainlp.tokenize.newmm, a dictionary-based tokenizer and a default tokenizer for PyThaiNLP.

Words from pythainlp.corpus.thai_words() will be used as a base set for the dictionary. Words that do not performed well with training_data will be removed. The remaining words will be returned.

Parameters:: training_data (Iterable[Iterable[str]]) – tokenized text, to be used as a training set
Returns:: words that are considered to make tokenize perform badly
Return type:: Set[str]

WordNet

PyThaiNLP API includes the WordNet module, which is an exact copy of NLTK’s WordNet API for the Thai language. WordNet is a lexical database for English and other languages.

For more details on WordNet, refer to the NLTK WordNet documentation.

pythainlp.corpus.wordnet.synsets

pythainlp.corpus.wordnet.synset

pythainlp.corpus.wordnet.all_lemma_names

pythainlp.corpus.wordnet.all_synsets

pythainlp.corpus.wordnet.langs

pythainlp.corpus.wordnet.lemmas

pythainlp.corpus.wordnet.lemma

pythainlp.corpus.wordnet.lemma_from_key

pythainlp.corpus.wordnet.path_similarity

pythainlp.corpus.wordnet.lch_similarity

pythainlp.corpus.wordnet.wup_similarity

pythainlp.corpus.wordnet.morphy

pythainlp.corpus.wordnet.custom_lemmas

Definition

Synset

A synset is a set of synonyms that share a common meaning. The WordNet module provides functionality to work with these synsets.

This documentation is designed to help you navigate and use the various resources and modules available in the pythainlp.corpus package effectively. If you have any questions or need further assistance, please refer to the PyThaiNLP documentation or reach out to the PyThaiNLP community for support.

We hope you find this documentation helpful for your natural language processing tasks in the Thai language.