nlpO3ï
nlpO3 is a Rust natural language processing library for Thai with Python and Node bindings. Similarly to newmm, it comes with a maximal-matching dictionary-based tokenizer, which honors Thai character cluster boundaries. However, compared to newmm, which is a pure Python implementation, nlpO3 is much faster. For a comparison, refer to Benchmark nlpo3.segment. Lern more about nlpO3 here.
In this tutorial, you will learn how to use nlpO3 to tokenize a text with a pre-prepared list of words serving as a custom dictionary.
Installationï
We install the Python binding using pip.
[1]:
!pip install nlpo3
Collecting nlpo3
Successfully installed nlpo3-1.1.2
PyThaiNLP dictionaryï
First we try segmenting a Thai sentence into a list of words without specifying a dictionary parameter.
[2]:
from nlpo3 import segment
[3]:
segment("āļāļāļŠāļāļāļāļąāļāļāļģāļ āļēāļĐāļēāđāļāļĒ")
[3]:
['āļāļāļŠāļāļ', 'āļāļąāļ', 'āļāļģ', 'āļ āļēāļĐāļēāđāļāļĒ']
Custom dictionaryï
Now we enhance the tokenization with a pre-prepared list of countries in Thai, which will serve as a custom dictionary.
We use the wget
command to download the list from GitHub. Itâs a plain text file containing one word per line.
[4]:
!wget https://github.com/PyThaiNLP/pythainlp/raw/dev/pythainlp/corpus/countries_th.txt
--2021-06-22 05:14:58-- https://github.com/PyThaiNLP/pythainlp/raw/dev/pythainlp/corpus/countries_th.txt
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt [following]
--2021-06-22 05:14:58-- https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7622 (7.4K) [text/plain]
Saving to: âcountries_th.txtâ
countries_th.txt 100%[===================>] 7.44K --.-KB/s in 0s
2021-06-22 05:14:58 (70.3 MB/s) - âcountries_th.txtâ saved [7622/7622]
We use the load_dict
function to load the contents of the downloaded file into the countries
dictionary.
[5]:
from nlpo3 import segment, load_dict
[6]:
load_dict("countries_th.txt", "countries")
Successful: dictionary name countries from file countries_th.txt has been successfully loaded
Finally, we call the segment
method on a Thai sentence specifying the countries
dictionary in the parameters.
[7]:
segment("āļŠāļ§āļąāļŠāļāļĩāļāļĢāļąāļāļāļĢāļ°āđāļāļĻāđāļāļĒ āđāļāļēāļŦāļĨāļĩ", "countries")
[7]:
['āļŠāļ§āļąāļŠāļāļĩāļāļĢāļąāļāļāļĢāļ°āđāļāļĻ', 'āđāļāļĒ', ' ', 'āđāļāļēāļŦāļĨāļĩ']