pythainlp.tokenize
The pythainlp.tokenize
module contains a comprehensive set of functions and classes for tokenizing Thai text into various units, such as sentences, words, subwords, and more. This module is a fundamental component of the PyThaiNLP library, providing tools for natural language processing in the Thai language.
Modules
- pythainlp.tokenize.sent_tokenize(text: str | List[str], engine: str = 'crfcut', keep_whitespace: bool = True) List[str] [source]
Sentence tokenizer.
Tokenizes running text into “sentences”. Supports both string and list of strings.
- Parameters:
text – the text (string) or list of words (list of strings) to be tokenized
engine (str) – choose among ‘crfcut’, ‘whitespace’, ‘whitespace+newline’
- Returns:
list of split sentences
- Return type:
- Options for engine
crfcut - (default) split by CRF trained on TED dataset
thaisum - The implementation of sentence segmenter from Nakhun Chumpolsathien, 2020
tltk - split by TLTK.,
wtp - split by wtpsplitaxe., It supports many sizes of models. You can use
wtp
to use mini model,wtp-tiny
to usewtp-bert-tiny
model (default),wtp-mini
to usewtp-bert-mini
model,wtp-base
to usewtp-canine-s-1l
model, andwtp-large
to usewtp-canine-s-12l
model.whitespace+newline - split by whitespace and newline.
whitespace - split by whitespace, specifically with
regex
patternr" +"
- Example:
Split the text based on whitespace:
from pythainlp.tokenize import sent_tokenize sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม" sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \ และได้รับมอบหมายให้ประจำในระดับภูมิภาค" sent_tokenize(sentence_1, engine="whitespace") # output: ['ฉันไปประชุมเมื่อวันที่', '11', 'มีนาคม'] sent_tokenize(sentence_2, engine="whitespace") # output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ', # '\nและได้รับมอบหมายให้ประจำในระดับภูมิภาค']
Split the text based on whitespace and newline:
sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม" sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \ และได้รับมอบหมายให้ประจำในระดับภูมิภาค" sent_tokenize(sentence_1, engine="whitespace+newline") # output: ['ฉันไปประชุมเมื่อวันที่', '11', 'มีนาคม'] sent_tokenize(sentence_2, engine="whitespace+newline") # output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ', '\nและได้รับมอบหมายให้ประจำในระดับภูมิภาค']
Split the text using CRF trained on TED dataset:
sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม" sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \ และเขาได้รับมอบหมายให้ประจำในระดับภูมิภาค" sent_tokenize(sentence_1, engine="crfcut") # output: ['ฉันไปประชุมเมื่อวันที่ 11 มีนาคม'] sent_tokenize(sentence_2, engine="crfcut") # output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ ', 'และเขาได้รับมอบหมายให้ประจำในระดับภูมิภาค']
Splits Thai text into sentences. This function identifies sentence boundaries, which is essential for text segmentation and analysis.
- pythainlp.tokenize.paragraph_tokenize(text: str, engine: str = 'wtp-mini', paragraph_threshold: float = 0.5, style: str = 'newline') List[List[str]] [source]
Paragraph tokenizer.
Tokenizes text into paragraphs.
- Parameters:
- Returns:
list of paragraphs
- Return type:
List[List[str]]
- Options for engine
wtp - split by wtpsplitaxe., It supports many sizes of models. You can use
wtp
to use mini model,wtp-tiny
to usewtp-bert-tiny
model (default),wtp-mini
to usewtp-bert-mini
model,wtp-base
to usewtp-canine-s-1l
model, andwtp-large
to usewtp-canine-s-12l
model.
- Example:
Split the text based on wtp:
from pythainlp.tokenize import paragraph_tokenize sent = ( "(1) บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต" +" มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด" +" จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ณ ที่นี้" ) paragraph_tokenize(sent) # output: [ # ['(1) '], # [ # 'บทความนี้ผู้เขียนสังเคราะห์ขึ้นมาจากผลงานวิจัยที่เคยทำมาในอดีต ', # 'มิได้ทำการศึกษาค้นคว้าใหม่อย่างกว้างขวางแต่อย่างใด ', # 'จึงใคร่ขออภัยในความบกพร่องทั้งปวงมา ', # 'ณ ที่นี้' # ]]
Segments text into paragraphs, which can be valuable for document-level analysis or summarization.
- pythainlp.tokenize.subword_tokenize(text: str, engine: str = 'tcc', keep_whitespace: bool = True) List[str] [source]
Subword tokenizer for tokenizing text into units smaller than syllables.
Tokenizes text into inseparable units of Thai contiguous characters, namely Thai Character Clusters (TCCs) TCCs are units based on Thai spelling features that could not be separated any character further such as ‘ก็’, ‘จะ’, ‘ไม่’, and ‘ฝา’. If the following units are separated, they could not be spelled out. This function applies TCC rules to tokenize the text into the smallest units.
For example, the word ‘ขนมชั้น’ would be tokenized into ‘ข’, ‘น’, ‘ม’, and ‘ชั้น’.
- Parameters:
- Returns:
list of subwords
- Return type:
List[str]
- Options for engine
dict - newmm word tokenizer with a syllable dictionary
etcc - Enhanced Thai Character Cluster (Inrut et al. 2001)
han_solo - CRF syllable segmenter for Thai that can work in the Thai social media domain. See PyThaiNLP/Han-solo.
ssg - CRF syllable segmenter for Thai. See ponrawee/ssg.
tcc (default) - Thai Character Cluster (Theeramunkong et al. 2000)
tcc_p - Thai Character Cluster + improved rules that are used in newmm
tltk - syllable tokenizer from tltk. See tltk.
wangchanberta - SentencePiece from wangchanberta model
- Example:
Tokenize text into subwords based on tcc:
from pythainlp.tokenize import subword_tokenize text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง" text_2 = "ความแปลกแยกและพัฒนาการ" subword_tokenize(text_1, engine='tcc') # output: ['ยุ', 'ค', 'เริ่ม', 'แร', 'ก', # 'ข', 'อ', 'ง', ' ', 'รา', 'ช', 'ว', 'ง', # 'ศ', '์', 'ห', 'มิ', 'ง'] subword_tokenize(text_2, engine='tcc') # output: ['ค', 'วา', 'ม', 'แป', 'ล', 'ก', 'แย', 'ก', 'และ', 'พัฒ','นา', 'กา', 'ร']
Tokenize text into subwords based on etcc:
text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง" text_2 = "ความแปลกแยกและพัฒนาการ" subword_tokenize(text_1, engine='etcc') # output: ['ยุคเริ่มแรกของ ราชวงศ์หมิง'] subword_tokenize(text_2, engine='etcc') # output: ['ความแปลกแยกและ', 'พัฒ', 'นาการ']
Tokenize text into subwords based on wangchanberta:
text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง" text_2 = "ความแปลกแยกและพัฒนาการ" subword_tokenize(text_1, engine='wangchanberta') # output: ['▁', 'ยุค', 'เริ่มแรก', 'ของ', '▁', 'ราชวงศ์', 'หมิง'] subword_tokenize(text_2, engine='wangchanberta') # output: ['▁ความ', 'แปลก', 'แยก', 'และ', 'พัฒนาการ']
Tokenizes text into subwords, which can be helpful for various NLP tasks, including subword embeddings.
- pythainlp.tokenize.syllable_tokenize(text: str, engine: str = 'han_solo', keep_whitespace: bool = True) List[str] [source]
Syllable tokenizer
Tokenizes text into inseparable units of Thai syllables.
- Parameters:
- Returns:
list of subwords
- Return type:
List[str]
- Options for engine
dict - newmm word tokenizer with a syllable dictionary
han_solo - CRF syllable segmenter for Thai that can work in the Thai social media domain. See PyThaiNLP/Han-solo.
ssg - CRF syllable segmenter for Thai. See ponrawee/ssg.
tltk - syllable tokenizer from tltk. See tltk.
Divides text into syllables, allowing you to work with individual Thai language phonetic units.
- pythainlp.tokenize.word_tokenize(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>, engine: str = 'newmm', keep_whitespace: bool = True, join_broken_num: bool = True) List[str] [source]
Word tokenizer.
Tokenizes running text into words (list of strings).
- Parameters:
text (str) – text to be tokenized
engine (str) – name of the tokenizer to be used
custom_dict (pythainlp.util.Trie) – dictionary trie (some engine may not support)
keep_whitespace (bool) – True to keep whitespace, a common mark for end of phrase in Thai. Otherwise, whitespace is omitted.
join_broken_num (bool) – True to rejoin formatted numeric that could be wrongly separated. Otherwise, formatted numeric could be wrongly separated.
- Returns:
list of words
- Return type:
List[str]
- Options for engine
attacut - wrapper for AttaCut., learning-based approach
deepcut - wrapper for DeepCut, learning-based approach
icu - wrapper for a word tokenizer in PyICU., from ICU (International Components for Unicode), dictionary-based
longest - dictionary-based, longest matching
mm - “multi-cut”, dictionary-based, maximum matching
nercut - dictionary-based, maximal matching, constrained by Thai Character Cluster (TCC) boundaries, combining tokens that are parts of the same named-entity
newmm (default) - “new multi-cut”, dictionary-based, maximum matching, constrained by Thai Character Cluster (TCC) boundaries with improved TCC rules that are used in newmm.
newmm-safe - newmm, with a mechanism to avoid long processing time for text with continuously ambiguous breaking points
nlpo3 - wrapper for a word tokenizer in nlpO3., adaptation of newmm in Rust (2.5x faster)
oskut - wrapper for OSKut., Out-of-domain StacKed cut for Word Segmentation
sefr_cut - wrapper for SEFR CUT., Stacked Ensemble Filter and Refine for Word Segmentation
tltk - wrapper for TLTK.,
maximum collocation approach
- Note:
The custom_dict parameter only works for deepcut, longest, newmm, and newmm-safe engines.
- Example:
Tokenize text with different tokenizers:
from pythainlp.tokenize import word_tokenize text = "โอเคบ่พวกเรารักภาษาบ้านเกิด" word_tokenize(text, engine="newmm") # output: ['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด'] word_tokenize(text, engine='attacut') # output: ['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด']
Tokenize text with whitespace omitted:
text = "วรรณกรรม ภาพวาด และการแสดงงิ้ว " word_tokenize(text, engine="newmm") # output: # ['วรรณกรรม', ' ', 'ภาพวาด', ' ', 'และ', 'การแสดง', 'งิ้ว', ' '] word_tokenize(text, engine="newmm", keep_whitespace=False) # output: ['วรรณกรรม', 'ภาพวาด', 'และ', 'การแสดง', 'งิ้ว']
Join broken formatted numeric (e.g. time, decimals, IP addresses):
text = "เงิน1,234บาท19:32น 127.0.0.1" word_tokenize(text, engine="attacut", join_broken_num=False) # output: # ['เงิน', '1', ',', '234', 'บาท', '19', ':', '32น', ' ', # '127', '.', '0', '.', '0', '.', '1'] word_tokenize(text, engine="attacut", join_broken_num=True) # output: # ['เงิน', '1,234', 'บาท', '19:32น', ' ', '127.0.0.1']
Tokenize with default and custom dictionaries:
from pythainlp.corpus.common import thai_words from pythainlp.tokenize import dict_trie text = 'ชินโซ อาเบะ เกิด 21 กันยายน' word_tokenize(text, engine="newmm") # output: # ['ชิน', 'โซ', ' ', 'อา', 'เบะ', ' ', # 'เกิด', ' ', '21', ' ', 'กันยายน'] custom_dict_japanese_name = set(thai_words() custom_dict_japanese_name.add('ชินโซ') custom_dict_japanese_name.add('อาเบะ') trie = dict_trie(dict_source=custom_dict_japanese_name) word_tokenize(text, engine="newmm", custom_dict=trie)) # output: # ['ชินโซ', ' ', 'อาเบะ', ' ', # 'เกิด', ' ', '21', ' ', 'กันยายน']
Splits text into words. This function is a fundamental tool for Thai language text analysis.
- pythainlp.tokenize.word_detokenize(segments: List[List[str]] | List[str], output: str = 'str') List[str] | str [source]
Word detokenizer.
This function will detokenize the list of words in each sentence into text.
- Parameters:
- Returns:
the Thai text
- Return type:
- Example:
from pythainlp.tokenize import word_detokenize print(word_detokenize(["เรา", "เล่น"])) # output: เราเล่น
Reverses the tokenization process, reconstructing text from tokenized units. Useful for text generation tasks.
- class pythainlp.tokenize.Tokenizer(custom_dict: Trie | Iterable[str] | str = [], engine: str = 'newmm', keep_whitespace: bool = True, join_broken_num: bool = True)[source]
Tokenizer class for a custom tokenizer.
This class allows users to pre-define custom dictionary along with tokenizer and encapsulate them into one single object. It is an wrapper for both functions, that are
pythainlp.tokenize.word_tokenize()
, andpythainlp.util.dict_trie()
- Example:
Tokenizer object instantiated with
pythainlp.util.Trie
:from pythainlp.tokenize import Tokenizer from pythainlp.corpus.common import thai_words from pythainlp.util import dict_trie custom_words_list = set(thai_words()) custom_words_list.add('อะเฟเซีย') custom_words_list.add('Aphasia') trie = dict_trie(dict_source=custom_words_list) text = "อะเฟเซีย (Aphasia*) เป็นอาการผิดปกติของการพูด" _tokenizer = Tokenizer(custom_dict=trie, engine='newmm') _tokenizer.word_tokenize(text) # output: ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ', 'ผิดปกติ', 'ของ', 'การ', 'พูด']
Tokenizer object instantiated with a list of words:
text = "อะเฟเซีย (Aphasia) เป็นอาการผิดปกติของการพูด" _tokenizer = Tokenizer(custom_dict=list(thai_words()), engine='newmm') _tokenizer.word_tokenize(text) # output: # ['อะ', 'เฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ', # 'ผิดปกติ', 'ของ', 'การ', 'พูด']
Tokenizer object instantiated with a file path containing a list of words separated with newline and explicitly setting a new tokenizer after initiation:
PATH_TO_CUSTOM_DICTIONARY = './custom_dictionary.txtt' # write a file with open(PATH_TO_CUSTOM_DICTIONARY, 'w', encoding='utf-8') as f: f.write('อะเฟเซีย\nAphasia\nผิด\nปกติ') text = "อะเฟเซีย (Aphasia) เป็นอาการผิดปกติของการพูด" # initiate an object from file with `attacut` as tokenizer _tokenizer = Tokenizer(custom_dict=PATH_TO_CUSTOM_DICTIONARY, \ engine='attacut') _tokenizer.word_tokenize(text) # output: # ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ', 'ผิด', # 'ปกติ', 'ของ', 'การ', 'พูด'] # change tokenizer to `newmm` _tokenizer.set_tokenizer_engine(engine='newmm') _tokenizer.word_tokenize(text) # output: # ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็นอาการ', 'ผิด', # 'ปกติ', 'ของการพูด']
The Tokenizer class is a versatile tool for customizing tokenization processes and managing tokenization models. It provides various methods and attributes to fine-tune tokenization according to your specific needs.
- __init__(custom_dict: Trie | Iterable[str] | str = [], engine: str = 'newmm', keep_whitespace: bool = True, join_broken_num: bool = True)[source]
Initialize tokenizer object.
- Parameters:
custom_dict (str) – a file path, a list of vocaburaies* to be used to create a trie, or an instantiated
pythainlp.util.Trie
object.engine (str) – choose between different options of tokenizer engines (i.e. newmm, mm, longest, deepcut)
keep_whitespace (bool) – True to keep whitespace, a common mark for end of phrase in Thai
- class pythainlp.tokenize.display_cell_tokenize(text: str)[source]
Display cell tokenizer.
Tokenizes Thai text into display cells without splitting tone marks.
- Parameters:
text (str) – text to be tokenized
- Returns:
list of display cells
- Return type:
List[str]
- Example:
Tokenize Thai text into display cells:
from pythainlp.tokenize import display_cell_tokenize text = "แม่น้ำอยู่ที่ไหน" display_cell_tokenize(text) # output: ['แ', 'ม่', 'น้ํ', 'า', 'อ', 'ยู่', 'ที่', 'ไ', 'ห', 'น']
Tokenization Engines
This module offers multiple tokenization engines designed for different levels of text analysis.
Sentence level
crfcut
thaisumcut
The implementation of sentence segmentator from Nakhun Chumpolsathien, 2020 original codes are from: https://github.com/nakhunchumpolsathien/ThaiSum
Cite:
- @mastersthesis{chumpolsathien_2020,
title={Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization}, author={Chumpolsathien, Nakhun}, year={2020}, school={Beijing Institute of Technology}
A sentence tokenizer based on a maximum entropy model. It’s a great choice for sentence boundary detection in Thai text.
Word level
attacut
deepcut
multi_cut
Multi cut – Thai word segmentation with maximum matching. Original codes from Korakot Chaovavanich.
- See Also:
An ensemble tokenizer that combines multiple tokenization strategies for improved word segmentation.
- class pythainlp.tokenize.multi_cut.LatticeString(value, multi=None, in_dict=True)[source]
String that keeps possible tokenizations
- pythainlp.tokenize.multi_cut.segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) List[str] [source]
Dictionary-based maximum matching word segmentation.
- pythainlp.tokenize.multi_cut.find_all_segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) List[str] [source]
Get all possible segment variations.
nlpo3
longest
Dictionary-based longest-matching Thai word segmentation. Implementation is based on the codes from Patorn Utenpattanun.
- See Also:
A tokenizer that identifies word boundaries by selecting the longest possible words in a text.
- pythainlp.tokenize.longest.segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>) List[str] [source]
Dictionary-based longest matching word segmentation.
- Parameters:
text (str) – text to be tokenized into words
custom_dict (pythainlp.util.Trie) – dictionary for tokenization
- Returns:
list of words, tokenized from the text
pyicu
nercut
sefr_cut
oskut
newmm (Default)
Dictionary-based maximal matching word segmentation, constrained by Thai Character Cluster (TCC) boundaries with improved rules.
The codes are based on the notebooks created by Korakot Chaovavanich, with heuristic graph size limit added to avoid exponential waiting time.
- See Also:
The default word tokenization engine that provides a balance between accuracy and efficiency for most use cases.
- pythainlp.tokenize.newmm.segment(text: str, custom_dict: ~pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>, safe_mode: bool = False) List[str] [source]
Maximal-matching word segmentation constrained by Thai Character Cluster.
A dictionary-based word segmentation using maximal matching algorithm, constrained by Thai Character Cluster boundaries.
A custom dictionary can be supplied.
- Parameters:
- Returns:
list of tokens
- Return type:
List[str]
Subword level
tcc
The implementation of tokenizer according to Thai Character Clusters (TCCs) rules proposed by Theeramunkong et al. 2000.
- Credits:
TCC: Jakkrit TeCho
Grammar: Wittawat Jitkrittum (link to the source file)
Python code: Korakot Chaovavanich
Tokenizes text into Thai Character Clusters (TCCs), a subword level representation.
- pythainlp.tokenize.tcc.tcc(text: str) str [source]
TCC generator which generates Thai Character Clusters
tcc+
The implementation of tokenizer according to Thai Character Clusters (TCCs) rules proposed by Theeramunkong et al. 2000. and improved rules that are used in newmm
- Credits:
TCC: Jakkrit TeCho
Grammar: Wittawat Jitkrittum (link to the source file)
Python code: Korakot Chaovavanich
A subword tokenizer that includes additional rules for more precise subword segmentation.
- pythainlp.tokenize.tcc_p.tcc(text: str) str [source]
TCC generator which generates Thai Character Clusters
etcc
Segmenting text into Enhanced Thai Character Clusters (ETCCs) Python implementation by Wannaphong Phatthiyaphaibun
This implementation relies on a dictionary of ETCC created from etcc.txt in pythainlp/corpus.
Notebook: https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ
- See Also:
Jeeragone Inrut, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. “Thai word segmentation using combination of forward and backward longest matching techniques.” In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.
Enhanced Thai Character Clusters (eTCC) tokenizer for subword-level analysis.
- pythainlp.tokenize.etcc.segment(text: str) List[str] [source]
Segmenting text into ETCCs.
Enhanced Thai Character Cluster (ETCC) is a kind of subword unit. The concept was presented in Inrut, Jeeragone, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. “Thai word segmentation using combination of forward and backward longest matching techniques.” In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.
- Parameters:
text (str) – text to be tokenized into character clusters
- Returns:
list of clusters, tokenized from the text
- Returns:
List[str]
han_solo