pythainlp.tokenize
The pythainlp.tokenize
contains multiple functions for tokenizing a chunk of Thai text into desirable units.
Modules
- pythainlp.tokenize.clause_tokenize(doc: List[str]) List[List[str]] [source]
Clause tokenizer. (or Clause segmentation)
Tokenizes running word list into list of clauses (list of strings). split by CRF trained on LST20 Corpus.
- Parameters
doc (str) – word list to be clause
- Returns
list of claues
- Return type
- Example
Clause tokenizer:
from pythainlp.tokenize import clause_tokenize clause_tokenize(["ฉัน","นอน","และ","คุณ","เล่น","มือถือ","ส่วน","น้อง","เขียน","โปรแกรม"]) # [['ฉัน', 'นอน'], # ['และ', 'คุณ', 'เล่น', 'มือถือ'], # ['ส่วน', 'น้อง', 'เขียน', 'โปรแกรม']]
- pythainlp.tokenize.sent_tokenize(text: str, engine: str = 'crfcut', keep_whitespace: bool = True) List[str] [source]
Sentence tokenizer.
Tokenizes running text into “sentences”
- Parameters
- Returns
list of splited sentences
- Return type
- Options for engine
crfcut - (default) split by CRF trained on TED dataset
whitespace+newline - split by whitespaces and newline.
whitespace - split by whitespaces. Specifiaclly, with
regex
patternr" +"
tltk - split by TLTK.,
- Example
Split the text based on whitespace:
from pythainlp.tokenize import sent_tokenize sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม" sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \ และได้รับมอบหมายให้ประจำในระดับภูมิภาค" sent_tokenize(sentence_1, engine="whitespace") # output: ['ฉันไปประชุมเมื่อวันที่', '11', 'มีนาคม'] sent_tokenize(sentence_2, engine="whitespace") # output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ', # '\nและได้รับมอบหมายให้ประจำในระดับภูมิภาค']
Split the text based on whitespace and newline:
sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม" sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \ และได้รับมอบหมายให้ประจำในระดับภูมิภาค" sent_tokenize(sentence_1, engine="whitespace+newline") # output: ['ฉันไปประชุมเมื่อวันที่', '11', 'มีนาคม'] sent_tokenize(sentence_2, engine="whitespace+newline") # output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ', '\nและได้รับมอบหมายให้ประจำในระดับภูมิภาค']
Split the text using CRF trained on TED dataset:
sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม" sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \ และเขาได้รับมอบหมายให้ประจำในระดับภูมิภาค" sent_tokenize(sentence_1, engine="crfcut") # output: ['ฉันไปประชุมเมื่อวันที่ 11 มีนาคม'] sent_tokenize(sentence_2, engine="crfcut") # output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ ', 'และเขาได้รับมอบหมายให้ประจำในระดับภูมิภาค']
- pythainlp.tokenize.subword_tokenize(text: str, engine: str = 'tcc', keep_whitespace: bool = True) List[str] [source]
Subword tokenizer. Can be smaller than syllable.
Tokenizes text into inseparable units of Thai contiguous characters namely Thai Character Clusters (TCCs) TCCs are the units based on Thai spelling feature that could not be separated any character further such as ‘ก็’, ‘จะ’, ‘ไม่’, and ‘ฝา’. If the following units are separated, they could not be spelled out. This function apply the TCC rules to tokenizes the text into the smallest units.
For example, the word ‘ขนมชั้น’ would be tokenized into ‘ข’, ‘น’, ‘ม’, and ‘ชั้น’.
- Parameters
- Returns
list of subwords
- Return type
- Options for engine
tcc (default) - Thai Character Cluster (Theeramunkong et al. 2000)
etcc - Enhanced Thai Character Cluster (Inrut et al. 2001)
wangchanberta - SentencePiece from wangchanberta model.
dict - newmm word tokenizer with a syllable dictionary
ssg - CRF syllable segmenter for Thai
tltk - syllable tokenizer from tltk
- Example
Tokenize text into subword based on tcc:
from pythainlp.tokenize import subword_tokenize text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง" text_2 = "ความแปลกแยกและพัฒนาการ" subword_tokenize(text_1, engine='tcc') # output: ['ยุ', 'ค', 'เริ่ม', 'แร', 'ก', # 'ข', 'อ', 'ง', ' ', 'รา', 'ช', 'ว', 'ง', # 'ศ', '์', 'ห', 'มิ', 'ง'] subword_tokenize(text_2, engine='tcc') # output: ['ค', 'วา', 'ม', 'แป', 'ล', 'ก', 'แย', 'ก', 'และ', 'พัฒ','นา', 'กา', 'ร']
Tokenize text into subword based on etcc:
text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง" text_2 = "ความแปลกแยกและพัฒนาการ" subword_tokenize(text_1, engine='etcc') # output: ['ยุคเริ่มแรกของ ราชวงศ์หมิง'] subword_tokenize(text_2, engine='etcc') # output: ['ความแปลกแยกและ', 'พัฒ', 'นาการ']
Tokenize text into subword based on wangchanberta:
text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง" text_2 = "ความแปลกแยกและพัฒนาการ" subword_tokenize(text_1, engine='wangchanberta') # output: ['▁', 'ยุค', 'เริ่มแรก', 'ของ', '▁', 'ราชวงศ์', 'หมิง'] subword_tokenize(text_2, engine='wangchanberta') # output: ['▁ความ', 'แปลก', 'แยก', 'และ', 'พัฒนาการ']
- pythainlp.tokenize.syllable_tokenize(text: str, engine: str = 'dict', keep_whitespace: bool = True) List[str] [source]
Syllable tokenizer.
syllable_tokenize is deprecated, use subword_tokenize instead
Tokenizes text into syllable (Thai: พยางค์), a unit of pronunciation having one vowel sound. For example, the word ‘รถไฟ’ contains two syallbles including ‘รถ’, and ‘ไฟ’.
Under the hood, this function uses
pythainlp.tokenize.word_tokenize()
with newmm as a tokenizer. The function tokenize the text with the dictionary of Thai words frompythainlp.corpus.common.thai_words()
and then dictionary of Thai syllable frompythainlp.corpus.common.thai_syllables()
. As a result, only syllables are obtained.- Parameters
- Returns
list of syllables where whitespaces in the text are included
- Return type
- Options for engine
dict (default) - newmm word tokenizer with a syllable dictionary
ssg - CRF syllable segmenter for Thai
- Example:
from pythainlp.tokenize import syllable_tokenize text = 'รถไฟสมัยใหม่จะใช้กำลังจากหัวรถจักรดีเซล หรือจากไฟฟ้า' syllable_tokenize(text) ['รถ', 'ไฟ', 'สมัย', 'ใหม่', 'ใช้', 'กำ', 'ลัง', 'จาก', 'หัว', 'รถ', 'จักร', 'ดี', 'เซล', ' ', 'หรือ', 'จาก', 'ไฟ', 'ฟ้า']
- pythainlp.tokenize.word_tokenize(text: str, custom_dict: Optional[pythainlp.util.trie.Trie] = None, engine: str = 'newmm', keep_whitespace: bool = True) List[str] [source]
Word tokenizer.
Tokenizes running text into words (list of strings).
- Parameters
text (str) – text to be tokenized
engine (str) – name of the tokenizer to be used
custom_dict (pythainlp.util.Trie) – dictionary trie
keep_whitespace (bool) – True to keep whitespaces, a common mark for end of phrase in Thai. Otherwise, whitespaces are omitted.
- Returns
list of words
- Return type
- Options for engine
newmm (default) - dictionary-based, Maximum Matching + Thai Character Cluster
newmm-safe - newmm, with a mechanism to help avoid long processing time for text with continuous ambiguous breaking points
nlpo3 - Python binding for nlpO3. It is newmm engine in Rust.
longest - dictionary-based, Longest Matching
icu - wrapper for ICU (International Components for Unicode, using PyICU), dictionary-based
attacut - wrapper for AttaCut., learning-based approach
deepcut - wrapper for DeepCut, learning-based approach
nercut - Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster (TCC) boundaries, and combining tokens that are parts of the same named-entity.
sefr_cut - wrapper for SEFR CUT.,
tltk - wrapper for TLTK.,
oskut - wrapper for OSKut.,
- Note
The parameter custom_dict can be provided as an argument only for newmm, longest, and deepcut engine.
- Example
Tokenize text with different tokenizer:
from pythainlp.tokenize import word_tokenize text = "โอเคบ่พวกเรารักภาษาบ้านเกิด" word_tokenize(text, engine="newmm") # output: ['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด'] word_tokenize(text, engine='attacut') # output: ['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด']
Tokenize text by omiting whitespaces:
text = "วรรณกรรม ภาพวาด และการแสดงงิ้ว " word_tokenize(text, engine="newmm") # output: # ['วรรณกรรม', ' ', 'ภาพวาด', ' ', 'และ', 'การแสดง', 'งิ้ว', ' '] word_tokenize(text, engine="newmm", keep_whitespace=False) # output: ['วรรณกรรม', 'ภาพวาด', 'และ', 'การแสดง', 'งิ้ว']
Tokenize with default and custom dictionary:
from pythainlp.corpus.common import thai_words from pythainlp.tokenize import dict_trie text = 'ชินโซ อาเบะ เกิด 21 กันยายน' word_tokenize(text, engine="newmm") # output: # ['ชิน', 'โซ', ' ', 'อา', 'เบะ', ' ', # 'เกิด', ' ', '21', ' ', 'กันยายน'] custom_dict_japanese_name = set(thai_words() custom_dict_japanese_name.add('ชินโซ') custom_dict_japanese_name.add('อาเบะ') trie = dict_trie(dict_source=custom_dict_japanese_name) word_tokenize(text, engine="newmm", custom_dict=trie)) # output: # ['ชินโซ', ' ', 'อาเบะ', # ' ', 'เกิด', ' ', '21', ' ', 'กันยายน']
- class pythainlp.tokenize.Tokenizer(custom_dict: Optional[Union[pythainlp.util.trie.Trie, Iterable[str], str]] = None, engine: str = 'newmm', keep_whitespace: bool = True)[source]
Tokenizer class, for a custom tokenizer.
This class allows users to pre-define custom dictionary along with tokenizer and encapsulate them into one single object. It is an wrapper for both two functions including
pythainlp.tokenize.word_tokenize()
, andpythainlp.util.dict_trie()
- Example
Tokenizer object instantiated with
pythainlp.util.Trie
:from pythainlp.tokenize import Tokenizer from pythainlp.corpus.common import thai_words from pythainlp.util import dict_trie custom_words_list = set(thai_words()) custom_words_list.add('อะเฟเซีย') custom_words_list.add('Aphasia') trie = dict_trie(dict_source=custom_words_list) text = "อะเฟเซีย (Aphasia*) เป็นอาการผิดปกติของการพูด" _tokenizer = Tokenizer(custom_dict=trie, engine='newmm') # output: ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ', 'ผิดปกติ', 'ของ', 'การ', 'พูด']
Tokenizer object instantiated with a list of words:
text = "อะเฟเซีย (Aphasia) เป็นอาการผิดปกติของการพูด" _tokenizer = Tokenizer(custom_dict=list(thai_words()), engine='newmm') _tokenizer.word_tokenize(text) # output: # ['อะ', 'เฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ', # 'ผิดปกติ', 'ของ', 'การ', 'พูด']
Tokenizer object instantiated with a file path containing list of word separated with newline and explicitly set a new tokenizer after initiation:
PATH_TO_CUSTOM_DICTIONARY = './custom_dictionary.txtt' # write a file with open(PATH_TO_CUSTOM_DICTIONARY, 'w', encoding='utf-8') as f: f.write('อะเฟเซีย\nAphasia\nผิด\nปกติ') text = "อะเฟเซีย (Aphasia) เป็นอาการผิดปกติของการพูด" # initate an object from file with `attacut` as tokenizer _tokenizer = Tokenizer(custom_dict=PATH_TO_CUSTOM_DICTIONARY, \ engine='attacut') _tokenizer.word_tokenize(text) # output: # ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ', 'ผิด', # 'ปกติ', 'ของ', 'การ', 'พูด'] # change tokenizer to `newmm` _tokenizer.set_tokenizer_engine(engine='newmm') _tokenizer.word_tokenize(text) # output: # ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็นอาการ', 'ผิด', # 'ปกติ', 'ของการพูด']
Tokenization Engines
Sentence level
crfcut
CRFCut - Thai sentence segmenter.
Thai sentence segmentation using conditional random field, default model trained on TED dataset
Performance: - ORCHID - space-correct accuracy 87% vs 95% state-of-the-art
(Zhou et al, 2016; https://www.aclweb.org/anthology/C16-1031.pdf)
TED dataset - space-correct accuracy 82%
See development notebooks at https://github.com/vistec-AI/ted_crawler; POS features are not used due to unreliable POS tagging available
Word level
attacut
Wrapper for AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai
- See Also
deepcut
Wrapper for deepcut Thai word segmentation. deepcut is a Thai word segmentation library using 1D Convolution Neural Network.
User need to install deepcut (and its dependency: tensorflow) by themselves.
- See Also
multi_cut
Multi cut – Thai word segmentation with maximum matching. Original code from Korakot Chaovavanich.
- See Also
- pythainlp.tokenize.multi_cut.segment(text: str, custom_dict: pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) List[str] [source]
Dictionary-based maximum matching word segmentation.
nlpo3
- pythainlp.tokenize.nlpo3.load_dict(file_path: str, dict_name: str) bool [source]
Load a dictionary file into an in-memory dictionary collection.
The loaded dictionary will be accessible throught the assigned dict_name. * This function does not override an existing dict name. *
- Parameters
:return bool
- See Also
- pythainlp.tokenize.nlpo3.segment(text: str, custom_dict: str = '_67a47bf9', safe_mode: bool = False, parallel_mode: bool = False) List[str] [source]
Break text into tokens.
Python binding for nlpO3. It is newmm engine in Rust.
- Parameters
text (str) – text to be tokenized
custom_dict (str) – dictionary name, as assigned with load_dict(), defaults to pythainlp/corpus/common/words_th.txt
safe_mode (bool) – reduce chance for long processing time in long text with many ambiguous breaking points, defaults to False
parallel_mode (bool) – Use multithread mode, defaults to False
- Returns
list of tokens
- Return type
List[str]
- See Also
longest
Dictionary-based longest-matching Thai word segmentation. Implementation based on the code from Patorn Utenpattanun.
- See Also
- pythainlp.tokenize.longest.segment(text: str, custom_dict: pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) List[str] [source]
Dictionary-based longest matching word segmentation.
- Parameters
text (str) – text to be tokenized to words
custom_dict (pythainlp.util.Trie) – dictionary for tokenization
- Returns
list of words, tokenized from the text
pyicu
Wrapper for PyICU word segmentation. This wrapper module uses
icu.BreakIterator
with Thai as icu.Local
to locate boundaries between words from the text.
- See Also
nercut
nercut 0.2
Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster (TCC) boundaries, and combining tokens that are parts of the same named-entity.
Code by Wannaphong Phatthiyaphaibun
- pythainlp.tokenize.nercut.segment(text: str, taglist: Iterable[str] = ['ORGANIZATION', 'PERSON', 'PHONE', 'EMAIL', 'DATE', 'TIME']) List[str] [source]
Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster (TCC) boundaries, and combining tokens that are parts of the same named-entity.
- Parameters
text (str) – text to be tokenized to words
- Parm list taglist
a list of named-entity tags to be used
- Returns
list of words, tokenized from the text
sefr_cut
Wrapper for SEFR CUT Thai word segmentation. SEFR CUT is a Thai Word Segmentation Models using Stacked Ensemble.
- See Also
oskut
Wrapper OSKut (Out-of-domain StacKed cut for Word Segmentation). Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation Stacked Ensemble Framework and DeepCut as Baseline model (ACL 2021 Findings)
- See Also
newmm
The default word tokenization engine.
Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster (TCC) boundaries.
The code is based on the notebooks created by Korakot Chaovavanich, with heuristic graph size limit added to avoid exponential wait time.
- See Also
- pythainlp.tokenize.newmm.segment(text: str, custom_dict: pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>, safe_mode: bool = False) List[str] [source]
Maximal-matching word segmentation, Thai Character Cluster constrained.
A dictionary-based word segmentation using maximal matching algorithm, constrained to Thai Character Cluster boundaries.
A custom dictionary can be supplied.
- Parameters
- Returns
list of tokens
- Return type
List[str]
Subword level
tcc
The implementation of tokenizer accorinding to Thai Character Clusters (TCCs) rules purposed by Theeramunkong et al. 2000.
- Credits:
TCC: Jakkrit TeCho
Grammar: Wittawat Jitkrittum (link to the source file)
Python code: Korakot Chaovavanich
etcc
Segmenting text to Enhanced Thai Character Cluster (ETCC) Python implementation by Wannaphong Phatthiyaphaibun
This implementation relies on a dictionary of ETCC created from etcc.txt in pythainlp/corpus.
Notebook: https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ
- See Also
Inrut, Jeeragone, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. “Thai word segmentation using combination of forward and backward longest matching techniques.” In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.
- pythainlp.tokenize.etcc.segment(text: str) List[str] [source]
Segmenting text into ETCCs.
Enhanced Thai Character Cluster (ETCC) is a kind of subword unit. The concept was presented in Inrut, Jeeragone, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. “Thai word segmentation using combination of forward and backward longest matching techniques.” In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.
- Parameters
text (str) – text to be tokenized to character clusters
- Returns
list of clusters, tokenized from the text
- Returns
list[str]