pythainlp.corpus
The pythainlp.corpus
module provides access to various Thai language corpora and resources that come bundled with PyThaiNLP. These resources are essential for natural language processing tasks in the Thai language.
Modules
countries
- pythainlp.corpus.countries() FrozenSet[str] [source]
Return a frozenset of country names in Thai such as “แคนาดา”, “โรมาเนีย”, “แอลจีเรีย”, and “ลาว”.
(See: dev/pythainlp/corpus/countries_th.txt)
find_synonym
get_corpus
- pythainlp.corpus.get_corpus(filename: str, comments: bool = True) frozenset [source]
Read corpus data from file and return a frozenset.
Each line in the file will be a member of the set.
Whitespace stripped and empty values and duplicates removed.
If comments is False, any text at any position after the character ‘#’ in each line will be discarded.
- Parameters:
- Returns:
frozenset
consisting of lines in the file- Return type:
- Example:
from pythainlp.corpus import get_corpus # input file (negations_th.txt): # แต่ # ไม่ get_corpus("negations_th.txt") # output: # frozenset({'แต่', 'ไม่'}) # input file (ttc_freq.txt): # ตัวบท<tab>10 # โดยนัยนี้<tab>1 get_corpus("ttc_freq.txt") # output: # frozenset({'โดยนัยนี้\t1', # 'ตัวบท\t10', # ...}) # input file (icubrk_th.txt): # # Thai Dictionary for ICU BreakIterator # กก # กกขนาก get_corpus("icubrk_th.txt") # output: # frozenset({'กกขนาก', # '# Thai Dictionary for ICU BreakIterator', # 'กก', # ...}) get_corpus("icubrk_th.txt", comments=False) # output: # frozenset({'กกขนาก', # 'กก', # ...})
get_corpus_as_is
- pythainlp.corpus.get_corpus_as_is(filename: str) list [source]
Read corpus data from file, as it is, and return a list.
Each line in the file will be a member of the list.
No modifications in member values and their orders.
If strip or comment removal is needed, use get_corpus() instead.
- Parameters:
filename (str) – filename of the corpus to be read
- Returns:
list
consisting of lines in the file- Return type:
- Example:
from pythainlp.corpus import get_corpus # input file (negations_th.txt): # แต่ # ไม่ get_corpus_as_is("negations_th.txt") # output: # ['แต่', 'ไม่']
get_corpus_db
get_corpus_db_detail
get_corpus_default_db
- pythainlp.corpus.get_corpus_default_db(name: str, version: str = '') str | None [source]
Get model path from default_db.json
- Parameters:
name (str) – corpus name
- Returns:
path to the corpus or None if the corpus doesn’t exist on the device
- Return type:
If you want to edit default_db.json, you can edit pythainlp/corpus/default_db.json
get_corpus_path
- pythainlp.corpus.get_corpus_path(name: str, version: str = '', force: bool = False) str | None [source]
Get corpus path.
- Parameters:
- Returns:
path to the corpus or None if the corpus doesn’t exist on the device
- Return type:
- Example:
(Please see the filename in this file
If the corpus already exists:
from pythainlp.corpus import get_corpus_path print(get_corpus_path('ttc')) # output: /root/pythainlp-data/ttc_freq.txt
If the corpus has not been downloaded yet:
from pythainlp.corpus import download, get_corpus_path print(get_corpus_path('wiki_lm_lstm')) # output: None download('wiki_lm_lstm') # output: # Download: wiki_lm_lstm # wiki_lm_lstm 0.32 # thwiki_lm.pth?dl=1: 1.05GB [00:25, 41.5MB/s] # /root/pythainlp-data/thwiki_model_lstm.pth print(get_corpus_path('wiki_lm_lstm')) # output: /root/pythainlp-data/thwiki_model_lstm.pth
download
- pythainlp.corpus.download(name: str, force: bool = False, url: str = '', version: str = '') bool [source]
Download corpus.
The available corpus names can be seen in this file: https://pythainlp.org/pythainlp-corpus/db.json
- Parameters:
- Returns:
True if the corpus is found and successfully downloaded. Otherwise, it returns False.
- Return type:
- Example:
from pythainlp.corpus import download download("wiki_lm_lstm", force=True) # output: # Corpus: wiki_lm_lstm # - Downloading: wiki_lm_lstm 0.1 # thwiki_lm.pth: 26%|██▌ | 114k/434k [00:00<00:00, 690kB/s]
By default, downloaded corpora and models will be saved in
$HOME/pythainlp-data/
(e.g./Users/bact/pythainlp-data/wiki_lm_lstm.pth
).
remove
- pythainlp.corpus.remove(name: str) bool [source]
Remove corpus
- Parameters:
name (str) – corpus name
- Returns:
True if the corpus is found and successfully removed. Otherwise, it returns False.
- Return type:
- Example:
from pythainlp.corpus import remove, get_corpus_path, get_corpus print(remove("ttc")) # output: True print(get_corpus_path("ttc")) # output: None get_corpus("ttc") # output: # FileNotFoundError: [Errno 2] No such file or directory: # '/usr/local/lib/python3.6/dist-packages/pythainlp/corpus/ttc'
provinces
thai_dict
thai_stopwords
- pythainlp.corpus.thai_stopwords() FrozenSet[str] [source]
Return a frozenset of Thai stopwords such as “มี”, “ไป”, “ไง”, “ขณะ”, “การ”, and “ประการหนึ่ง”.
- (See: dev/pythainlp/corpus/stopwords_th.txt)
We use stopword lists by thesis’s เพ็ญศิริ ลี้ตระกูล.
- See Also:
เพ็ญศิริ ลี้ตระกูล . การเลือกประโยคสำคัญในการสรุปความภาษาไทยโดยใช้แบบจำลองแบบลำดับชั้น. กรุงเทพมหานคร : มหาวิทยาลัยธรรมศาสตร์; 2551.
thai_words
- pythainlp.corpus.thai_words() FrozenSet[str] [source]
Return a frozenset of Thai words such as “กติกา”, “กดดัน”, “พิษ”, and “พิษภัย”.
(See: dev/pythainlp/corpus/words_th.txt)
thai_wsd_dict
thai_orst_words
- pythainlp.corpus.thai_orst_words() FrozenSet[str] [source]
Return a frozenset of Thai words from Royal Society of Thailand
(See: dev/pythainlp/corpus/thai_orst_words.txt)
thai_synonyms
- pythainlp.corpus.thai_synonyms() dict [source]
Return Thai synonyms.
(See: thai_synonym)
- return:
Thai words with part-of-speech type and synonym
- rtype:
dict
thai_syllables
- pythainlp.corpus.thai_syllables() FrozenSet[str] [source]
Return a frozenset of Thai syllables such as “กรอบ”, “ก็”, “๑”, “โมบ”, “โมน”, “โม่ง”, “กา”, “ก่า”, and, “ก้า”.
- (See: dev/pythainlp/corpus/syllables_th.txt)
We use the Thai syllable list from KUCut.
thai_negations
- pythainlp.corpus.thai_negations() FrozenSet[str] [source]
Return a frozenset of Thai negation words including “ไม่” and “แต่”.
(See: dev/pythainlp/corpus/negations_th.txt)
thai_family_names
- pythainlp.corpus.thai_family_names() FrozenSet[str] [source]
Return a frozenset of Thai family names
(See: dev/pythainlp/corpus/family_names_th.txt)
thai_female_names
- pythainlp.corpus.thai_female_names() FrozenSet[str] [source]
Return a frozenset of Thai female names
(See: dev/pythainlp/corpus/person_names_female_th.txt)
thai_male_names
- pythainlp.corpus.thai_male_names() FrozenSet[str] [source]
Return a frozenset of Thai male names
(See: dev/pythainlp/corpus/person_names_male_th.txt)
pythainlp.corpus.th_en_translit.get_transliteration_dict
- pythainlp.corpus.th_en_translit.get_transliteration_dict() defaultdict [source]
Get Thai to English transliteration dictionary.
The returned dict is in dict[str, dict[List[str], List[Optional[bool]]]] format.
ConceptNet
ConceptNet is an open, multilingual knowledge graph used for various natural language understanding tasks. For more information, refer to the ConceptNet documentation.
pythainlp.corpus.conceptnet.edges
- pythainlp.corpus.conceptnet.edges(word: str, lang: str = 'th')[source]
Get edges from ConceptNet API. ConceptNet is a public semantic network, designed to help computers understand the meanings of words that people use.
For example, the term “ConceptNet” is a “knowledge graph”, and “knowledge graph” has “common sense knowledge” which is a part of “artificial intelligence”. Also, “ConcepNet” is used for “natural language understanding” which is a part of “artificial intelligence”.
“ConceptNet” –is a–> “knowledge graph” –has–> “common sense” –a part of–> “artificial intelligence”“ConceptNet” –used for–> “natural language understanding” –a part of–> “artificial intelligence”With this illustration, it shows relationships (represented as Edge) between the terms (represented as Node).
This function requires an internet connection to access the ConceptNet API. Please use it considerately. It will timeout after 10 seconds.
- Parameters:
- Returns:
return edges of the given word according to the ConceptNet network.
- Return type:
- Example:
from pythainlp.corpus.conceptnet import edges edges('hello', lang='en') # output: # [{ # '@id': '/a/[/r/IsA/,/c/en/hello/,/c/en/greeting/]', # '@type': 'Edge', # 'dataset': '/d/conceptnet/4/en', # 'end': {'@id': '/c/en/greeting', # '@type': 'Node', # 'label': 'greeting', # 'language': 'en', # 'term': '/c/en/greeting'}, # 'license': 'cc:by/4.0', # 'rel': {'@id': '/r/IsA', '@type': 'Relation', 'label': 'IsA'}, # 'sources': [ # { # '@id': '/and/[/s/activity/omcs/vote/,/s/contributor/omcs/bmsacr/]', # '@type': 'Source', # 'activity': '/s/activity/omcs/vote', # 'contributor': '/s/contributor/omcs/bmsacr' # }, # { # '@id': '/and/[/s/activity/omcs/vote/,/s/contributor/omcs/test/]', # '@type': 'Source', # 'activity': '/s/activity/omcs/vote', # 'contributor': '/s/contributor/omcs/test'} # ], # 'start': {'@id': '/c/en/hello', # '@type': 'Node', # 'label': 'Hello', # 'language': 'en', # 'term': '/c/en/hello'}, # 'surfaceText': '[[Hello]] is a kind of [[greeting]]', # 'weight': 3.4641016151377544 # }, ...] edges('สวัสดี', lang='th') # output: # [{ # '@id': '/a/[/r/RelatedTo/,/c/th/สวัสดี/n/,/c/en/prosperity/]', # '@type': 'Edge', # 'dataset': '/d/wiktionary/en', # 'end': {'@id': '/c/en/prosperity', # '@type': 'Node', # 'label': 'prosperity', # 'language': 'en', # 'term': '/c/en/prosperity'}, # 'license': 'cc:by-sa/4.0', # 'rel': { # '@id': '/r/RelatedTo', '@type': 'Relation', # 'label': 'RelatedTo'}, # 'sources': [{ # '@id': '/and/[/s/process/wikiparsec/2/,/s/resource/wiktionary/en/]', # '@type': 'Source', # 'contributor': '/s/resource/wiktionary/en', # 'process': '/s/process/wikiparsec/2'}], # 'start': {'@id': '/c/th/สวัสดี/n', # '@type': 'Node', # 'label': 'สวัสดี', # 'language': 'th', # 'sense_label': 'n', # 'term': '/c/th/สวัสดี'}, # 'surfaceText': None, # 'weight': 1.0 # }, ...]
TNC (Thai National Corpus) —
The Thai National Corpus (TNC) is a collection of text data in the Thai language. This module provides access to word frequency data from the TNC corpus.
pythainlp.corpus.tnc.word_freqs
- pythainlp.corpus.tnc.word_freqs() List[Tuple[str, int]] [source]
Get word frequency from Thai National Corpus (TNC)
(See: dev/pythainlp/corpus/tnc_freq.txt)
Credit: Korakot Chaovavanich https://www.facebook.com/groups/thainlp/posts/434330506948445
pythainlp.corpus.tnc.unigram_word_freqs
pythainlp.corpus.tnc.bigram_word_freqs
pythainlp.corpus.tnc.trigram_word_freqs
- pythainlp.corpus.tnc.trigram_word_freqs() dict[Tuple[str, str, str], int] [source]
Get trigram word frequency from Thai National Corpus (TNC)
TTC (Thai Textbook Corpus) —
The Thai Textbook Corpus (TTC) is a collection of Thai language text data, primarily sourced from textbooks.
pythainlp.corpus.ttc.word_freqs
pythainlp.corpus.ttc.unigram_word_freqs
OSCAR
OSCAR is a multilingual corpus that includes Thai text data. This module provides access to word frequency data from the OSCAR corpus.
pythainlp.corpus.oscar.word_freqs
pythainlp.corpus.oscar.unigram_word_freqs
Util
Utilities for working with the corpus data.
pythainlp.corpus.util.find_badwords
pythainlp.corpus.util.revise_wordset
- pythainlp.corpus.util.revise_wordset(tokenize: Callable[[str], List[str]], orig_words: Iterable[str], training_data: Iterable[Iterable[str]]) Set[str] [source]
Revise a set of words that could improve tokenization performance of a dictionary-based tokenize function.
orig_words will be used as a base set for the dictionary. Words that do not performed well with training_data will be removed. The remaining words will be returned.
- Parameters:
tokenize (Callable[[str], List[str]]) – a tokenize function, can be any function that takes a string as input and returns a List[str]
orig_words (Iterable[str]) – words that used by the tokenize function, will be used as a base for revision
training_data (Iterable[Iterable[str]]) – tokenized text, to be used as a training set
- Returns:
words that are considered to make tokenize perform badly
- Return type:
Set[str]
- Example::
from pythainlp.corpus import thai_words from pythainlp.corpus.util import revise_wordset from pythainlp.tokenize.longest import segment base_words = thai_words() more_words = { "ถวิล อุดล", "ทองอินทร์ ภูริพัฒน์", "เตียง ศิริขันธ์", "จำลอง ดาวเรือง" } base_words = base_words.union(more_words) dict_trie = Trie(wordlist) tokenize = lambda text: segment(text, dict_trie) training_data = [ [str, str, str. ...], [str, str, str, str, ...], ... ] revised_words = revise_wordset(tokenize, wordlist, training_data)
pythainlp.corpus.util.revise_newmm_default_wordset
- pythainlp.corpus.util.revise_newmm_default_wordset(training_data: Iterable[Iterable[str]]) Set[str] [source]
Revise a set of word that could improve tokenization performance of pythainlp.tokenize.newmm, a dictionary-based tokenizer and a default tokenizer for PyThaiNLP.
Words from pythainlp.corpus.thai_words() will be used as a base set for the dictionary. Words that do not performed well with training_data will be removed. The remaining words will be returned.
WordNet
PyThaiNLP API includes the WordNet module, which is an exact copy of NLTK’s WordNet API for the Thai language. WordNet is a lexical database for English and other languages.
For more details on WordNet, refer to the NLTK WordNet documentation.
pythainlp.corpus.wordnet.synsets
pythainlp.corpus.wordnet.synset
pythainlp.corpus.wordnet.all_lemma_names
pythainlp.corpus.wordnet.all_synsets
pythainlp.corpus.wordnet.langs
pythainlp.corpus.wordnet.lemmas
pythainlp.corpus.wordnet.lemma
pythainlp.corpus.wordnet.lemma_from_key
pythainlp.corpus.wordnet.path_similarity
pythainlp.corpus.wordnet.lch_similarity
pythainlp.corpus.wordnet.wup_similarity
pythainlp.corpus.wordnet.morphy
pythainlp.corpus.wordnet.custom_lemmas
Definition
Synset
A synset is a set of synonyms that share a common meaning. The WordNet module provides functionality to work with these synsets.
This documentation is designed to help you navigate and use the various resources and modules available in the pythainlp.corpus package effectively. If you have any questions or need further assistance, please refer to the PyThaiNLP documentation or reach out to the PyThaiNLP community for support.
We hope you find this documentation helpful for your natural language processing tasks in the Thai language.