pythainlp.tokenize¶
The pythainlp.tokenize
contains multiple functions for tokenizing a chunk of Thai text into desirable units.
Modules¶
-
pythainlp.tokenize.
sent_tokenize
(text: str, engine: str = 'whitespace+newline') → List[str][source]¶ This function does not yet automatically recognize when a sentence actually ends. Rather it helps split text where white space and a new line is found.
- Parameters
- Returns
list of splited sentences
- Return type
- Options for engine
whitespace+newline (default) - split by whitespace token and newline.
whitespace - split by whitespace token. Specifiaclly, with
regex
patternr" +"
- Example
Split the text based on whitespace:
from pythainlp.tokenize import sent_tokenize sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม" sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \ และได้รับมอบหมายให้ประจำในระดับภูมิภาค" sent_tokenize(sentence_1, engine="whitespace") # output: ['ฉันไปประชุมเมื่อวันที่', '11', 'มีนาคม'] sent_tokenize(sentence_2, engine="whitespace") # output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ', # '\nและได้รับมอบหมายให้ประจำในระดับภูมิภาค']
Split the text based on whitespace and newline:
sent_tokenize(sentence_1, engine="whitespace+newline") # output: ['ฉันไปประชุมเมื่อวันที่', '11', 'มีนาคม'] sent_tokenize(sentence_2, engine="whitespace+newline") # output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ', '\nและได้รับมอบหมายให้ประจำในระดับภูมิภาค']
-
pythainlp.tokenize.
word_tokenize
(text: str, custom_dict: Optional[pythainlp.tokenize.trie.Trie] = None, engine: str = 'newmm', keep_whitespace: bool = True) → List[str][source]¶ This function tokenizes running text into words.
- Parameters
- Returns
list of words
- Return type
- Options for engine
newmm (default) - dictionary-based, Maximum Matching + Thai Character Cluster
newmm-safe - newmm, with a mechanism to avoid long processing time for some long continuous text without spaces
longest - dictionary-based, Longest Matching
icu - wrapper for ICU (International Components for Unicode, using PyICU), dictionary-based
attacut - wrapper for AttaCut., learning-based approach
deepcut - wrapper for DeepCut, learning-based approach
Warning
the option for engine named ulmfit has been deprecated since PyThaiNLP version 2.1
- Note
The parameter custom_dict can be provided as an argument only for newmm, longest, and attacut engine.
- Example
Tokenize text with different tokenizer:
from pythainlp.tokenize import word_tokenize text = "โอเคบ่พวกเรารักภาษาบ้านเกิด" word_tokenize(text, engine="newmm") # output: ['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด'] word_tokenize(text, engine='attacut') # output: ['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด']
Tokenize text by omiting whitespaces:
text = "วรรณกรรม ภาพวาด และการแสดงงิ้ว " word_tokenize(text, engine="newmm") # output: # ['วรรณกรรม', ' ', 'ภาพวาด', ' ', 'และ', 'การแสดง', 'งิ้ว', ' '] word_tokenize(text, engine="newmm", keep_whitespace=False) # output: ['วรรณกรรม', 'ภาพวาด', 'และ', 'การแสดง', 'งิ้ว']
Tokenize with default and custom dictionary:
from pythainlp.corpus.common import thai_words from pythainlp.tokenize import dict_trie text = 'ชินโซ อาเบะ เกิด 21 กันยายน' word_tokenize(text, engine="newmm") # output: # ['ชิน', 'โซ', ' ', 'อา', 'เบะ', ' ', # 'เกิด', ' ', '21', ' ', 'กันยายน'] custom_dict_japanese_name = set(thai_words() custom_dict_japanese_name.add('ชินโซ') custom_dict_japanese_name.add('อาเบะ') trie = dict_trie(dict_source=custom_dict_japanese_name) word_tokenize(text, engine="newmm", custom_dict=trie)) # output: # ['ชินโซ', ' ', 'อาเบะ', # ' ', 'เกิด', ' ', '21', ' ', 'กันยายน']
-
pythainlp.tokenize.
syllable_tokenize
(text: str, engine: str = 'default') → List[str][source]¶ This function is to tokenize text into syllable (Thai: พยางค์), a unit of pronunciation having one vowel sound. For example, the word ‘รถไฟ’ contains two syallbles including ‘รถ’, and ‘ไฟ’. Under the hood, this function uses
pythainlp.tokenize.word_tokenize()
with newmm as a tokenizer. The function tokenize the text with the dictionary of Thai words frompythainlp.corpus.common.thai_words()
and then dictionary of Thai syllable frompythainlp.corpus.common.thai_syllables()
. As a result, only syllables are obtained.- Parameters
- Returns
list of syllables where whitespaces in the text are included
- Return type
- Options for engine
default
ssg - CRF syllable segmenter for Thai.
- Example:
from pythainlp.tokenize import syllable_tokenize text = 'รถไฟสมัยใหม่จะใช้กำลังจากหัวรถจักรดีเซล หรือจากไฟฟ้า' syllable_tokenize(text) ['รถ', 'ไฟ', 'สมัย', 'ใหม่', 'ใช้', 'กำ', 'ลัง', 'จาก', 'หัว', 'รถ', 'จักร', 'ดี', 'เซล', ' ', 'หรือ', 'จาก', 'ไฟ', 'ฟ้า']
-
pythainlp.tokenize.
subword_tokenize
(text: str, engine: str = 'tcc') → List[str][source]¶ This function tokenizes text into inseparable units of Thai contiguous characters namely Thai Character Clusters (TCCs) TCCs are the units based on Thai spelling feature that could not be separated any character further such as ‘ก็’, ‘จะ’, ‘ไม่’, and ‘ฝา’. If the following units are separated, they could not be spelled out. This function apply the TCC rules to tokenizes the text into the smallest units.
For example, the word ‘ขนมชั้น’ would be tokenized into ‘ข’, ‘น’, ‘ม’, and ‘ชั้น’.
- Parameters
- Returns
list of subwords
- Return type
- Options for engine
tcc (default) - Thai Character Cluster (Theeramunkong et al. 2000)
ssg - CRF syllable segmenter for Thai.
etcc - Enhanced Thai Character Cluster (Inrut et al. 2001) [In development]
- Example
Tokenize text into subword based on tcc:
from pythainlp.tokenize import subword_tokenize text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง" text_2 = "ความแปลกแยกและพัฒนาการ" subword_tokenize(text_1, engine='tcc') # output: ['ยุ', 'ค', 'เริ่ม', 'แร', 'ก', # 'ข', 'อ', 'ง', ' ', 'รา', 'ช', 'ว', 'ง', # 'ศ', '์', 'ห', 'มิ', 'ง'] subword_tokenize(text_2, engine='tcc') # output: ['ค', 'วา', 'ม', 'แป', 'ล', 'ก', 'แย', 'ก', 'และ', 'พัฒ','นา', 'กา', 'ร']
Tokenize text into subword based on etcc (Work In Progress):
text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง" text_2 = "ความแปลกแยกและพัฒนาการ" subword_tokenize(text_1, engine='etcc') # output: ['ยุคเริ่มแรกของ ราชวงศ์หมิง'] subword_tokenize(text_2, engine='etcc') # output: ['ความแปลกแยกและ', 'พัฒ', 'นาการ']
-
pythainlp.tokenize.
dict_trie
(dict_source: Union[str, Iterable[str], pythainlp.tokenize.trie.Trie]) → pythainlp.tokenize.trie.Trie[source]¶ Create a dictionary trie which will be used for word_tokenize() function.
- Parameters
dict_source (str|Iterable[str]|pythainlp.tokenize.Trie) – a path to dictionary file or a list of words or a pythainlp.tokenize.Trie object
- Returns
a trie object created from a dictionary input
- Return type
pythainlp.tokenize.Trie
-
class
pythainlp.tokenize.
Tokenizer
(custom_dict: Optional[Union[pythainlp.tokenize.trie.Trie, Iterable[str], str]] = None, engine: str = 'newmm')[source]¶ This class allows users to pre-define custom dictionary along with tokenizer and encapsulate them into one single object. It is an wrapper for both two functions including
pythainlp.tokenize.word_tokenize()
, andpythainlp.tokenize.dict_trie()
- Example
Tokenizer object instantiated with
pythainlp.tokenize.Trie
:from pythainlp.tokenize import Tokenizer from pythainlp.tokenize import Tokenizer, dict_trie from pythainlp.corpus.common import thai_words custom_words_list = set(thai_words()) custom_words_list.add('อะเฟเซีย') custom_words_list.add('Aphasia') trie = dict_trie(dict_source=custom_words_list) text = "อะเฟเซีย (Aphasia*) เป็นอาการผิดปกติของการพูด" _tokenizer = Tokenizer(custom_dict=trie, engine='newmm') # output: ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ', 'ผิดปกติ', 'ของ', 'การ', 'พูด']
Tokenizer object instantiated with a list of words:
text = "อะเฟเซีย (Aphasia) เป็นอาการผิดปกติของการพูด" _tokenizer = Tokenizer(custom_dict=list(thai_words()), engine='newmm') _tokenizer.word_tokenize(text) # output: # ['อะ', 'เฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ', # 'ผิดปกติ', 'ของ', 'การ', 'พูด']
Tokenizer object instantiated with a file path containing list of word separated with newline and explicitly set a new tokenizer after initiation:
PATH_TO_CUSTOM_DICTIONARY = './custom_dictionary.txtt' # write a file with open(PATH_TO_CUSTOM_DICTIONARY, 'w', encoding='utf-8') as f: f.write('อะเฟเซีย\nAphasia\nผิด\nปกติ') text = "อะเฟเซีย (Aphasia) เป็นอาการผิดปกติของการพูด" # initate an object from file with `attacut` as tokenizer _tokenizer = Tokenizer(custom_dict=PATH_TO_CUSTOM_DICTIONARY, \ engine='attacut') _tokenizer.word_tokenize(text) # output: # ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ', 'ผิด', # 'ปกติ', 'ของ', 'การ', 'พูด'] # change tokenizer to `newmm` _tokenizer.set_tokenizer_engine(engine='newmm') _tokenizer.word_tokenize(text) # output: # ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็นอาการ', 'ผิด', # 'ปกติ', 'ของการพูด']
Tokenization Engines¶
newmm¶
Dictionary-based Thai Word Segmentation using maximal matching algorithm and Thai Character Cluster (TCC). The code is based on the notebooks created by Korakot Chaovavanich. :See Also:
-
pythainlp.tokenize.newmm.
segment
(text: str, custom_dict: pythainlp.tokenize.trie.Trie = <pythainlp.tokenize.trie.Trie object>, safe_mode: bool = False) → List[str][source]¶ Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster boundaries. :param str text: text to be tokenized to words :param pythainlp.trie.Trie custom_dict: dictionary for tokenization :param bool safe_mode: True to help avoid long wait for text with long and continuous ambiguous breaking points. Long wait may still able to occur. Default is False. :return: list of words, tokenized from the text
longest¶
Dictionary-based longest-matching Thai word segmentation. Implementation based on the code from Patorn Utenpattanun.
- See Also
multi_cut¶
Multi cut – Thai word segmentation with maximum matching. The original source code is from Korakot Chaovavanich.
- See Also
pyicu¶
Wrapper for PyICU word segmentation. This wrapper module uses
icu.BreakIterator
with Thai as icu.Local
to locate boundaries between words from the text.
- See Also
deepcut¶
Wrapper for deepcut Thai word segmentation. deepcut is a Thai word segmentation library using 1D Convolution Neural Network.
User need to install deepcut (and its dependency: tensorflow) by themselves.
- See Also
tcc¶
The implementation of tokenizer accorinding to Thai Character Clusters (TCCs) rules purposed by Theeramunkong et al. 2000.
- Credits:
TCC: Jakkrit TeCho
Grammar: Wittawat Jitkrittum (link to the source file)
Python code: Korakot Chaovavanich
etcc¶
Enhanced Thai Character Cluster (ETCC) (In progress) Python implementation by Wannaphong Phatthiyaphaibun (19 June 2017)
- See Also
Inrut, Jeeragone, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. “Thai word segmentation using combination of forward and backward longest matching techniques.” In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.