pythainlp.tokenize

The pythainlp.tokenize contains multiple functions for tokenizing a chunk of Thai text into desirable units.

Modules

pythainlp.tokenize.clause_tokenize(doc: List[str])List[List[str]][source]

Clause tokenizer. (or Clause segmentation)

Tokenizes running word list into list of clauses (list of strings). split by CRF trained on LST20 Corpus.

Parameters

doc (str) – word list to be clause

Returns

list of claues

Return type

list[list[str]]

Example

from pythainlp.tokenize import clause_tokenize

clause_tokenize([“ฉัน”,”นอน”,”และ”,”คุณ”,”เล่น”,”มือถือ”,”ส่วน”,”น้อง”,”เขียน”,”โปรแกรม”]) [[‘ฉัน’, ‘นอน’], [‘และ’, ‘คุณ’, ‘เล่น’, ‘มือถือ’], [‘ส่วน’, ‘น้อง’, ‘เขียน’, ‘โปรแกรม’]]

pythainlp.tokenize.sent_tokenize(text: str, engine: str = 'crfcut', keep_whitespace: bool = True)List[str][source]

Sentence tokenizer.

Tokenizes running text into “sentences”

Parameters
  • text (str) – the text to be tokenized

  • engine (str) – choose among ‘crfcut’, ‘whitespace’, ‘whitespace+newline’

Returns

list of splited sentences

Return type

list[str]

Options for engine
  • crfcut - (default) split by CRF trained on TED dataset

  • whitespace+newline - split by whitespaces and newline.

  • whitespace - split by whitespaces. Specifiaclly, with regex pattern r" +"

Example

Split the text based on whitespace:

from pythainlp.tokenize import sent_tokenize

sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม"
sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \
และได้รับมอบหมายให้ประจำในระดับภูมิภาค"

sent_tokenize(sentence_1, engine="whitespace")
# output: ['ฉันไปประชุมเมื่อวันที่', '11', 'มีนาคม']

sent_tokenize(sentence_2, engine="whitespace")
# output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ',
#   '\nและได้รับมอบหมายให้ประจำในระดับภูมิภาค']

Split the text based on whitespace and newline:

sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม"
sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \
และได้รับมอบหมายให้ประจำในระดับภูมิภาค"

sent_tokenize(sentence_1, engine="whitespace+newline")
# output: ['ฉันไปประชุมเมื่อวันที่', '11', 'มีนาคม']
sent_tokenize(sentence_2, engine="whitespace+newline")
# output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ',
'\nและได้รับมอบหมายให้ประจำในระดับภูมิภาค']

Split the text using CRF trained on TED dataset:

sentence_1 = "ฉันไปประชุมเมื่อวันที่ 11 มีนาคม"
sentence_2 = "ข้าราชการได้รับการหมุนเวียนเป็นระยะ \
และเขาได้รับมอบหมายให้ประจำในระดับภูมิภาค"

sent_tokenize(sentence_1, engine="crfcut")
# output: ['ฉันไปประชุมเมื่อวันที่ 11 มีนาคม']

sent_tokenize(sentence_2, engine="crfcut")
# output: ['ข้าราชการได้รับการหมุนเวียนเป็นระยะ ',
'และเขาได้รับมอบหมายให้ประจำในระดับภูมิภาค']
pythainlp.tokenize.subword_tokenize(text: str, engine: str = 'tcc', keep_whitespace: bool = True)List[str][source]

Subword tokenizer. Can be smaller than syllable.

Tokenizes text into inseparable units of Thai contiguous characters namely Thai Character Clusters (TCCs) TCCs are the units based on Thai spelling feature that could not be separated any character further such as ‘ก็’, ‘จะ’, ‘ไม่’, and ‘ฝา’. If the following units are separated, they could not be spelled out. This function apply the TCC rules to tokenizes the text into the smallest units.

For example, the word ‘ขนมชั้น’ would be tokenized into ‘ข’, ‘น’, ‘ม’, and ‘ชั้น’.

Parameters
  • text (str) – text to be tokenized

  • engine (str) – the name subword tokenizer

Returns

list of subwords

Return type

list[str]

Options for engine
  • tcc (default) - Thai Character Cluster (Theeramunkong et al. 2000)

  • etcc - Enhanced Thai Character Cluster (Inrut et al. 2001)

  • wangchanberta - SentencePiece from wangchanberta model.

Example

Tokenize text into subword based on tcc:

from pythainlp.tokenize import subword_tokenize

text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง"
text_2 = "ความแปลกแยกและพัฒนาการ"

subword_tokenize(text_1, engine='tcc')
# output: ['ยุ', 'ค', 'เริ่ม', 'แร', 'ก',
#   'ข', 'อ', 'ง', ' ', 'รา', 'ช', 'ว', 'ง',
#   'ศ', '์', 'ห', 'มิ', 'ง']

subword_tokenize(text_2, engine='tcc')
# output: ['ค', 'วา', 'ม', 'แป', 'ล', 'ก', 'แย', 'ก',
'และ', 'พัฒ','นา', 'กา', 'ร']

Tokenize text into subword based on etcc:

text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง"
text_2 = "ความแปลกแยกและพัฒนาการ"

subword_tokenize(text_1, engine='etcc')
# output: ['ยุคเริ่มแรกของ ราชวงศ์หมิง']

subword_tokenize(text_2, engine='etcc')
# output: ['ความแปลกแยกและ', 'พัฒ', 'นาการ']

Tokenize text into subword based on wangchanberta:

text_1 = "ยุคเริ่มแรกของ ราชวงศ์หมิง"
text_2 = "ความแปลกแยกและพัฒนาการ"

subword_tokenize(text_1, engine='wangchanberta')
# output: ['▁', 'ยุค', 'เริ่มแรก', 'ของ', '▁', 'ราชวงศ์', 'หมิง']

subword_tokenize(text_2, engine='wangchanberta')
# output: ['▁ความ', 'แปลก', 'แยก', 'และ', 'พัฒนาการ']
pythainlp.tokenize.syllable_tokenize(text: str, engine: str = 'dict', keep_whitespace: bool = True)List[str][source]

Syllable tokenizer.

Tokenizes text into syllable (Thai: พยางค์), a unit of pronunciation having one vowel sound. For example, the word ‘รถไฟ’ contains two syallbles including ‘รถ’, and ‘ไฟ’.

Under the hood, this function uses pythainlp.tokenize.word_tokenize() with newmm as a tokenizer. The function tokenize the text with the dictionary of Thai words from pythainlp.corpus.common.thai_words() and then dictionary of Thai syllable from pythainlp.corpus.common.thai_syllables(). As a result, only syllables are obtained.

Parameters
  • text (str) – input string to be tokenized

  • engine (str) – name of the syllable tokenizer

Returns

list of syllables where whitespaces in the text are included

Return type

list[str]

Options for engine
  • dict (default) - newmm word tokenizer with a syllable dictionary

  • ssg - CRF syllable segmenter for Thai

Example:

from pythainlp.tokenize import syllable_tokenize

text = 'รถไฟสมัยใหม่จะใช้กำลังจากหัวรถจักรดีเซล หรือจากไฟฟ้า'
syllable_tokenize(text)
['รถ', 'ไฟ', 'สมัย', 'ใหม่', 'ใช้', 'กำ', 'ลัง', 'จาก', 'หัว',
'รถ', 'จักร', 'ดี', 'เซล', ' ', 'หรือ', 'จาก', 'ไฟ', 'ฟ้า']
pythainlp.tokenize.word_tokenize(text: str, custom_dict: Optional[pythainlp.util.trie.Trie] = None, engine: str = 'newmm', keep_whitespace: bool = True)List[str][source]

Word tokenizer.

Tokenizes running text into words (list of strings).

Parameters
  • text (str) – text to be tokenized

  • engine (str) – name of the tokenizer to be used

  • custom_dict (pythainlp.util.Trie) – dictionary trie

  • keep_whitespace (bool) – True to keep whitespaces, a common mark for end of phrase in Thai. Otherwise, whitespaces are omitted.

Returns

list of words

Return type

list[str]

Options for engine
  • newmm (default) - dictionary-based, Maximum Matching + Thai Character Cluster

  • newmm-safe - newmm, with a mechanism to help avoid long processing time for text with continuous ambiguous breaking points

  • longest - dictionary-based, Longest Matching

  • icu - wrapper for ICU (International Components for Unicode, using PyICU), dictionary-based

  • attacut - wrapper for AttaCut., learning-based approach

  • deepcut - wrapper for DeepCut, learning-based approach

  • nercut - Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster (TCC) boundaries, and combining tokens that are parts of the same named-entity.

Note
  • The parameter custom_dict can be provided as an argument only for newmm, longest, and attacut engine.

Example

Tokenize text with different tokenizer:

from pythainlp.tokenize import word_tokenize

text = "โอเคบ่พวกเรารักภาษาบ้านเกิด"

word_tokenize(text, engine="newmm")
# output: ['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด']

word_tokenize(text, engine='attacut')
# output: ['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด']

Tokenize text by omiting whitespaces:

text = "วรรณกรรม ภาพวาด และการแสดงงิ้ว "

word_tokenize(text, engine="newmm")
# output:
# ['วรรณกรรม', ' ', 'ภาพวาด', ' ', 'และ', 'การแสดง', 'งิ้ว', ' ']

word_tokenize(text, engine="newmm", keep_whitespace=False)
# output: ['วรรณกรรม', 'ภาพวาด', 'และ', 'การแสดง', 'งิ้ว']

Tokenize with default and custom dictionary:

from pythainlp.corpus.common import thai_words
from pythainlp.tokenize import dict_trie

text = 'ชินโซ อาเบะ เกิด 21 กันยายน'

word_tokenize(text, engine="newmm")
# output:
# ['ชิน', 'โซ', ' ', 'อา', 'เบะ', ' ',
#  'เกิด', ' ', '21', ' ', 'กันยายน']

custom_dict_japanese_name = set(thai_words()
custom_dict_japanese_name.add('ชินโซ')
custom_dict_japanese_name.add('อาเบะ')

trie = dict_trie(dict_source=custom_dict_japanese_name)

word_tokenize(text, engine="newmm", custom_dict=trie))
# output:
# ['ชินโซ', ' ', 'อาเบะ',
#   ' ', 'เกิด', ' ', '21', ' ', 'กันยายน']
class pythainlp.tokenize.Tokenizer(custom_dict: Optional[Union[pythainlp.util.trie.Trie, Iterable[str], str]] = None, engine: str = 'newmm', keep_whitespace: bool = True)[source]

Tokenizer class, for a custom tokenizer.

This class allows users to pre-define custom dictionary along with tokenizer and encapsulate them into one single object. It is an wrapper for both two functions including pythainlp.tokenize.word_tokenize(), and pythainlp.util.dict_trie()

Example

Tokenizer object instantiated with pythainlp.util.Trie:

from pythainlp.tokenize import Tokenizer
from pythainlp.corpus.common import thai_words
from pythainlp.util import dict_trie

custom_words_list = set(thai_words())
custom_words_list.add('อะเฟเซีย')
custom_words_list.add('Aphasia')
trie = dict_trie(dict_source=custom_words_list)

text = "อะเฟเซีย (Aphasia*) เป็นอาการผิดปกติของการพูด"
_tokenizer = Tokenizer(custom_dict=trie, engine='newmm')
# output: ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ',
'ผิดปกติ', 'ของ', 'การ', 'พูด']

Tokenizer object instantiated with a list of words:

text = "อะเฟเซีย (Aphasia) เป็นอาการผิดปกติของการพูด"
_tokenizer = Tokenizer(custom_dict=list(thai_words()), engine='newmm')
_tokenizer.word_tokenize(text)
# output:
# ['อะ', 'เฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ',
#   'ผิดปกติ', 'ของ', 'การ', 'พูด']

Tokenizer object instantiated with a file path containing list of word separated with newline and explicitly set a new tokenizer after initiation:

PATH_TO_CUSTOM_DICTIONARY = './custom_dictionary.txtt'

# write a file
with open(PATH_TO_CUSTOM_DICTIONARY, 'w', encoding='utf-8') as f:
    f.write('อะเฟเซีย\nAphasia\nผิด\nปกติ')

text = "อะเฟเซีย (Aphasia) เป็นอาการผิดปกติของการพูด"

# initate an object from file with `attacut` as tokenizer
_tokenizer = Tokenizer(custom_dict=PATH_TO_CUSTOM_DICTIONARY, \
    engine='attacut')

_tokenizer.word_tokenize(text)
# output:
# ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็น', 'อาการ', 'ผิด',
#   'ปกติ', 'ของ', 'การ', 'พูด']

# change tokenizer to `newmm`
_tokenizer.set_tokenizer_engine(engine='newmm')
_tokenizer.word_tokenize(text)
# output:
# ['อะเฟเซีย', ' ', '(', 'Aphasia', ')', ' ', 'เป็นอาการ', 'ผิด',
#   'ปกติ', 'ของการพูด']
set_tokenize_engine(engine: str)None[source]

Set the tokenizer’s engine.

Parameters

engine (str) – choose between different options of engine to token (i.e. newmm, longest, attacut)

word_tokenize(text: str)List[str][source]

Main tokenization function.

Parameters

text (str) – text to be tokenized

Returns

list of words, tokenized from the text

Return type

list[str]

Tokenization Engines

Sentence level

crfcut

CRFCut - Thai sentence segmenter.

Thai sentence segmentation using conditional random field, default model trained on TED dataset

Performance: - ORCHID - space-correct accuracy 87% vs 95% state-of-the-art

  • TED dataset - space-correct accuracy 82%

See development notebooks at https://github.com/vistec-AI/ted_crawler; POS features are not used due to unreliable POS tagging available

pythainlp.tokenize.crfcut.extract_features(doc: List[str], window: int = 2, max_n_gram: int = 3)List[List[str]][source]

Extract features for CRF by sliding max_n_gram of tokens for +/- window from the current token

Parameters
  • doc (List[str]) – tokens from which features are to be extracted from

  • window (int) – size of window before and after the current token

  • max_n_gram (int) – create n_grams from 1-gram to max_n_gram-gram within the window

Returns

list of lists of features to be fed to CRF

pythainlp.tokenize.crfcut.segment(text: str)List[str][source]

CRF-based sentence segmentation.

Parameters

text (str) – text to be tokenized to sentences

Returns

list of words, tokenized from the text

Word level

attacut

Wrapper for AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai

See Also
class pythainlp.tokenize.attacut.AttacutTokenizer(model='attacut-sc')[source]

deepcut

Wrapper for deepcut Thai word segmentation. deepcut is a Thai word segmentation library using 1D Convolution Neural Network.

User need to install deepcut (and its dependency: tensorflow) by themselves.

See Also

multi_cut

Multi cut – Thai word segmentation with maximum matching. The original source code is from Korakot Chaovavanich.

See Also
pythainlp.tokenize.multi_cut.segment(text: str, custom_dict: pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>)List[str][source]

Dictionary-based maximum matching word segmentation.

Parameters
  • text (str) – text to be tokenized to words

  • custom_dict (pythainlp.util.Trie) – dictionary for tokenization

Returns

list of words, tokenized from the text

pythainlp.tokenize.multi_cut.find_all_segment(text: str, custom_dict: pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>)List[str][source]

Get all possible segment variations

Parameters

text (str) – input string to be tokenized

Returns

returns list of segment variations

longest

Dictionary-based longest-matching Thai word segmentation. Implementation based on the code from Patorn Utenpattanun.

See Also
pythainlp.tokenize.longest.segment(text: str, custom_dict: pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>)List[str][source]

Dictionary-based longest matching word segmentation.

Parameters
  • text (str) – text to be tokenized to words

  • custom_dict (pythainlp.util.Trie) – dictionary for tokenization

Returns

list of words, tokenized from the text

pyicu

Wrapper for PyICU word segmentation. This wrapper module uses icu.BreakIterator with Thai as icu.Local to locate boundaries between words from the text.

See Also

nercut

nercut 0.1

Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster (TCC) boundaries, and combining tokens that are parts of the same named-entity.

Code by Wannaphong Phatthiyaphaibun

pythainlp.tokenize.nercut.segment(text: str, taglist: Iterable[str] = ['ORGANIZATION', 'PERSON', 'PHONE', 'EMAIL', 'DATE', 'TIME'])List[str][source]

Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster (TCC) boundaries, and combining tokens that are parts of the same named-entity.

Parameters

text (str) – text to be tokenized to words

Parm list taglist

a list of named-entity tags to be used

Returns

list of words, tokenized from the text

newmm

The default word tokenization engine.

Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster (TCC) boundaries.

The code is based on the notebooks created by Korakot Chaovavanich, with heuristic graph size limit added to avoid exponential wait time.

See Also
pythainlp.tokenize.newmm.segment(text: str, custom_dict: pythainlp.util.trie.Trie = <pythainlp.util.trie.Trie object>, safe_mode: bool = False)List[str][source]

Dictionary-based maximal matching word segmentation, constrained with Thai Character Cluster boundaries.

Parameters
  • text (str) – text to be tokenized to words

  • custom_dict (pythainlp.util.Trie) – dictionary for tokenization

  • safe_mode (bool) – True to help avoid long wait for text with long and continuous ambiguous breaking points. Long wait may still able to occur. Default is False.

Returns

list of words, tokenized from the text

Subword level

tcc

The implementation of tokenizer accorinding to Thai Character Clusters (TCCs) rules purposed by Theeramunkong et al. 2000.

Credits:
pythainlp.tokenize.tcc.segment(text: str)List[str][source]

Subword segmentation

Parameters

text (str) – text to be tokenized to character clusters

Returns

list of subwords (character clusters), tokenized from the text

Return type

list[str]

pythainlp.tokenize.tcc.tcc(text: str)str[source]

TCC generator, generates Thai Character Clusters

Parameters

text (str) – text to be tokenized to character clusters

Returns

subwords (character clusters)

Return type

Iterator[str]

pythainlp.tokenize.tcc.tcc_pos(text: str)Set[int][source]

TCC positions

Parameters

text (str) – text to be tokenized to character clusters

Returns

list of the end position of subwords

Return type

set[int]

etcc

Segmenting text to Enhanced Thai Character Cluster (ETCC) Python implementation by Wannaphong Phatthiyaphaibun

This implementation relies on a dictionary of ETCC created from etcc.txt in pythainlp/corpus.

Notebook: https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ

See Also

Inrut, Jeeragone, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. “Thai word segmentation using combination of forward and backward longest matching techniques.” In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.

pythainlp.tokenize.etcc.segment(text: str)List[str][source]

Segmenting text into ETCCs.

Enhanced Thai Character Cluster (ETCC) is a kind of subword unit. The concept was presented in Inrut, Jeeragone, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. “Thai word segmentation using combination of forward and backward longest matching techniques.” In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.

Parameters

text (str) – text to be tokenized to character clusters

Returns

list of clusters, tokenized from the text

Returns

list[str]