pythainlp.tokenize¶
The pythainlp.tokenize
contains multiple functions for tokenizing a chunk of Thai text into desirable units.
Modules¶
-
pythainlp.tokenize.
sent_tokenize
(text: str, engine: str = 'whitespace+newline') → List[str][source]¶ This function does not yet automatically recognize when a sentence actually ends. Rather it helps split text where white space and a new line is found.
-
pythainlp.tokenize.
word_tokenize
(text: str, custom_dict: Optional[marisa_trie.Trie] = None, engine: str = 'newmm', keep_whitespace: bool = True) → List[str][source]¶ - Parameters
- Returns
list of words
- Options for engine
newmm (default) - dictionary-based, Maximum Matching + Thai Character Cluster
longest - dictionary-based, Longest Matching
deepcut - wrapper for deepcut, language-model-based https://github.com/rkcosmos/deepcut
icu - wrapper for ICU (International Components for Unicode, using PyICU), dictionary-based
ulmfit - for thai2fit
a custom_dict can be provided for newmm, longest, and deepcut
- Example
>>> from pythainlp.tokenize import word_tokenize >>> text = "โอเคบ่พวกเรารักภาษาบ้านเกิด" >>> word_tokenize(text, engine="newmm") ['โอเค', 'บ่', 'พวกเรา', 'รัก', 'ภาษา', 'บ้านเกิด'] >>> word_tokenize(text, engine="icu") ['โอ', 'เค', 'บ่', 'พวก', 'เรา', 'รัก', 'ภาษา', 'บ้าน', 'เกิด']
-
pythainlp.tokenize.
syllable_tokenize
(text: str) → List[str][source]¶ - Parameters
text (str) – input string to be tokenized
- Returns
list of syllables
-
pythainlp.tokenize.
subword_tokenize
(text: str, engine: str = 'tcc') → List[str][source]¶ - Parameters
- Returns
list of subwords
- Options for engine
tcc (default) - Thai Character Cluster (Theeramunkong et al. 2000)
etcc - Enhanced Thai Character Cluster (Inrut et al. 2001) [In development]
-
pythainlp.tokenize.
dict_trie
(dict_source: Union[str, Iterable[str], marisa_trie.Trie]) → marisa_trie.Trie[source]¶ Create a dict trie which will be used for word_tokenize() function. For more information on the trie data structure, see: https://marisa-trie.readthedocs.io/en/latest/index.html
- Parameters
dict_source (string/list) – a list of vocaburaries or a path to source file
- Returns
a trie created from a dictionary input
NEWMM¶
-
pythainlp.tokenize.newmm.
segment
(text: str, custom_dict: Optional[marisa_trie.Trie] = None) → List[str][source]¶ Dictionary-based word segmentation, using maximal matching algorithm and Thai Character Cluster :param str text: text to be tokenized to words :return: list of words, tokenized from the text
TCC¶
Thai Character Cluster
-
pythainlp.tokenize.tcc.
segment
(text: str) → List[str][source]¶ Subword segmentation :param str text: text to be tokenized to character clusters :return: list of subwords (character clusters), tokenized from the text