Installation
Stable release:
pip install pythainlp
Development (pre-release) version:
pip install --upgrade --pre pythainlp
Some features (for example, named entity recognition) require additional optional dependencies. Install them using the extras syntax:
pip install pythainlp[extra1,extra2,…]
- The extras can include:
compact— install a stable and small subset of dependencies (recommended)full— install all optional dependencies (may introduce large dependencies and conflicts)abbreviation— abbreviation expansion utilitiesattacut— support for AttaCut (a fast and accurate tokenizer)budoux— support for BudouX text segmentationbenchmarks— support for running benchmarkscoreference_resolution— coreference resolution supportdependency_parsing— dependency parsing supportel— entity linking supportesupar— ESuPAR parser supportgenerate— support for text generationicu— support for ICU (International Components for Unicode) used in transliteration and tokenizationipa— support for IPA (International Phonetic Alphabet) in transliterationml— support for ULMFiT models used in classificationmt5— mT5 models for Thai text summarizationnlpo3— nlpo3 Thai word tokenization supportonnx- ONNX model supportoskut— OSKUT supportsefr_cut— SEFR CUT Thai word tokenization supportspacy_thai— spaCy Thai language supportspell— support for more spell-checkers (phunspell & symspellpy)ssg— support for SSG syllable tokenizertextaugment— text augmentation utilitiesthai_nner— Thai named entity recognition supportthai2fit— Thai word vectors (thai2fit)thai2rom— machine-learned romanizationtransformers_ud— transformers_ud engine supporttranslate— machine translation supportwangchanberta— WangchanBERTa modelswangchanglm— WangchangLM model supportword_approximation— word approximation supportwordnet— WordNet supportwsd— word-sense disambiguation support (pythainlp.wsd)wtp— Where’s the Point text segmentation supportwunsen— Wunsen spell checker support
For dependency details, see the project.optional-dependencies section in pyproject.toml.
Notes for Windows installation
Some features require the PyICU libraries on Windows. You have two options to install them.
Option 1 (recommended):
Download a pre-built wheel from https://www.lfd.uci.edu/~gohlke/pythonlibs/
Choose a wheel that matches your Python version and architecture (“win32” or “amd64”).
Install it with pip, for example:
pip install PyICU-xxx-cp36-cp36m-win32.whl
Option 2 (advanced):
Attempt to build from source using:
pip install pyicuBuilding from source requires development toolchains (for example Microsoft Visual C++ Build Tools) and may require setting environment variables such as
ICU_VERSION. For example:set ICU_VERSION=62.1Building from source takes longer and requires technical knowledge, but produces a wheel optimized for your system.
Using PyThaiNLP in distributed environments
PyThaiNLP can be used in distributed computing environments such as Apache Spark. When using PyThaiNLP in these environments, you need to configure the data directory for each worker node.
Key considerations
Set environment variables inside distributed functions: Environment variables must be set inside the function that will be distributed to executor nodes, not in the driver program.
Use a writable local directory: The default data directory (
~/pythainlp-data) may not be writable on executor nodes. Use a local directory like./pythainlp-datainstead.Set ``PYTHAINLP_DATA`` before data access: Always set the
PYTHAINLP_DATAenvironment variable before the first call that reads or writes PyThaiNLP data on each worker. (PYTHAINLP_DATA_DIRis also accepted for backward compatibility but is deprecated.)
Example usage with Apache Spark
Basic example using PySpark RDD:
from pyspark import SparkContext
sc = SparkContext("local[*]", "PyThaiNLP Example")
thai_texts = ["สวัสดีครับ", "ภาษาไทย"]
rdd = sc.parallelize(thai_texts)
def tokenize_thai(text):
import os
os.environ['PYTHAINLP_DATA'] = './pythainlp-data'
from pythainlp.tokenize import word_tokenize
return word_tokenize(text)
tokenized_rdd = rdd.map(tokenize_thai)
results = tokenized_rdd.collect()
Example using PySpark DataFrame API:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
spark = SparkSession.builder.appName("PyThaiNLP").getOrCreate()
df = spark.createDataFrame([(1, "สวัสดีครับ")], ["id", "text"])
@udf(returnType=ArrayType(StringType()))
def tokenize_udf(text):
import os
os.environ['PYTHAINLP_DATA'] = './pythainlp-data'
from pythainlp.tokenize import word_tokenize
return word_tokenize(text)
result_df = df.withColumn("tokens", tokenize_udf(df.text))
For more comprehensive examples including error handling, production best practices, and advanced features, see the file examples/distributed_pyspark.py in the PyThaiNLP repository.
Thread safety considerations
PyThaiNLP’s core tokenization engines are thread-safe, which is important for distributed computing environments where multiple threads may process data concurrently. For detailed information about thread safety guarantees and best practices, see Thread safety.
Note that while the code itself is thread-safe, you still need to configure the data directory appropriately for distributed environments as described above.
Runtime configurations
- PYTHAINLP_DATA
Specifies the location where downloaded data and the corpus database are stored. If the directory does not exist, PyThaiNLP will create it.
By default this is a directory named
pythainlp-datain the user’s home directory.Run
thainlp data pathat the command line to display the current data directory.
- PYTHAINLP_DATA_DIR
Deprecated since version Use:
PYTHAINLP_DATAinstead. SettingPYTHAINLP_DATA_DIRtriggers aDeprecationWarningat runtime. If bothPYTHAINLP_DATAandPYTHAINLP_DATA_DIRare set simultaneously, PyThaiNLP raisesValueError.
- PYTHAINLP_OFFLINE
When set to a truthy value (
1,true,yes,on), PyThaiNLP operates in offline mode: automatic corpus downloads are disabled, andpythainlp.corpus.get_corpus_path()raisesFileNotFoundErrorfor any corpus that is not already cached locally.Explicit calls to
pythainlp.corpus.download()orthainlp data getstill work normally, because those are deliberate user actions.Use
pythainlp.is_offline_mode()to check the current state programmatically.This follows the same convention as
HF_HUB_OFFLINEin huggingface_hub.
- PYTHAINLP_READ_ONLY
When set to a truthy value (
1,true,yes,on), PyThaiNLP operates in read-only mode: implicit background writes to PyThaiNLP’s internal data directory are disabled.What read-only mode blocks (implicit writes the user may not be aware of):
Creating the PyThaiNLP data directory (
~/pythainlp-dataor as configured byPYTHAINLP_DATA).pythainlp.corpus.download()— corpus file downloads and catalog (db.json) updates.pythainlp.corpus.remove()— corpus file and catalog deletions.
What read-only mode does NOT block (explicit user-initiated writes):
Saving trained models or vocabularies to a user-specified path (e.g.,
model.save("my_model.json"),tagger.train(..., save_loc="..."),tokenizer.save_vocabulary("my_dir/")) — the user explicitly provided the destination path.CLI output files written to a user-specified location (e.g.,
thainlp benchmark --save-details,thainlp misspell --output myfile.txt).
Use
pythainlp.is_read_only_mode()to check the current state programmatically.Note
To disable only automatic background downloads while keeping explicit
download()calls working, usePYTHAINLP_OFFLINEinstead.If both
PYTHAINLP_READ_ONLYandPYTHAINLP_READ_MODEare set at the same time, PyThaiNLP raisesValueError.
- PYTHAINLP_READ_MODE
Deprecated since version Use:
PYTHAINLP_READ_ONLYinstead. SettingPYTHAINLP_READ_MODEtriggers aDeprecationWarningat runtime. If bothPYTHAINLP_READ_ONLYandPYTHAINLP_READ_MODEare set simultaneously, PyThaiNLP raisesValueError.PYTHAINLP_READ_MODE=1is equivalent toPYTHAINLP_READ_ONLY=1.
Interaction between environment variables
The table below shows how PYTHAINLP_OFFLINE and PYTHAINLP_READ_ONLY
affect the two main corpus operations:
Operation |
|
|
|---|---|---|
|
Succeeds (returns path) |
Succeeds (no write needed) |
|
Fails ( |
Fails ( |
|
Succeeds (download is an explicit user action) |
Fails (returns |
|
Succeeds (downloads the corpus) |
Fails (returns |
Key differences:
PYTHAINLP_OFFLINEblocks only automatic downloads. Explicit calls todownload()(or thethainlp data getCLI command) still work, because those are deliberate user actions.PYTHAINLP_READ_ONLYis more restrictive: it blocks all writes to the data directory, including explicitdownload()calls. Use this when the data directory is on a read-only file system (e.g., a read-only Docker volume or a shared cluster mount).PYTHAINLP_DATAsets the path of the data directory used by both modes. In read-only mode the directory is not created if it does not already exist.
Typical use cases:
Offline laptop / air-gapped system: set
PYTHAINLP_OFFLINE=1after downloading all required corpora. You can still calldownload()manually if you have network access.Read-only container image with pre-bundled corpora: set
PYTHAINLP_READ_ONLY=1so that no writes occur at all. Any attempt to download a corpus that is missing from the image will returnFalseinstead of raising a permission error.
Installation FAQ
Q: How do I set environment variables on each executor node in a distributed environment?
A: When using PyThaiNLP in distributed computing environments like Apache Spark, you need to set the PYTHAINLP_DATA environment variable inside the function that will be distributed to executor nodes. For example:
def tokenize_thai(text):
import os
os.environ['PYTHAINLP_DATA'] = './pythainlp-data'
from pythainlp.tokenize import word_tokenize
return word_tokenize(text)
rdd.map(tokenize_thai)
This ensures that each executor node uses a local data directory instead of the default home directory, which may not be writable on executor nodes.
For detailed examples including PySpark DataFrame API and production best practices, see examples/distributed_pyspark.py.
For more discussion, see PermissionError: [Errno 13] Permission denied: /home/pythainlp-data.
Q: How do I enable read-only mode for PyThaiNLP?
A: Set the environment variable PYTHAINLP_READ_ONLY to 1.
The legacy PYTHAINLP_READ_MODE=1 is still accepted but deprecated.