Thread safety
Summary
PyThaiNLP’s core word tokenization engines are designed with thread-safety
in mind. Internal implementations (mm, newmm, newmm-safe,
longest, icu) are thread-safe.
For engines that wrap external libraries (attacut, budoux, deepcut,
nercut, nlpo3, oskut, sefr_cut, tltk, wtsplit), the
wrapper code is thread-safe, but we cannot guarantee thread-safety of the
underlying external libraries themselves.
Thread safety implementation
Internal implementations (fully thread-safe):
mm,newmm,newmm-safe: Stateless implementation, all data is locallongest: uses lock-protected check-then-act for the management of global cache shared across threadsicu: each thread gets its ownBreakIteratorinstance
External library wrappers (wrapper code is thread-safe):
attacut: uses lock-protected check-then-act for the management of global cache; underlying library thread-safety not guaranteedbudoux: uses lock-protected lazy initialization of parser; underlying library thread-safety not guaranteeddeepcut,nercut,nlpo3,tltk: Stateless wrapper, underlying library thread-safety not guaranteedoskut,sefr_cut,wtsplit: use lock-protected model loading when switching models/engines; underlying library thread-safety not guaranteed
Usage in multi-threaded applications
Using a tokenization engine safely in multi-threaded contexts:
import threading
from pythainlp.tokenize import word_tokenize
def tokenize_worker(text, results, index):
# Thread-safe for all engines
results[index] = word_tokenize(text, engine="longest")
texts = ["ผมรักประเทศไทย", "วันนี้อากาศดี", "เขาไปโรงเรียน"]
results = [None] * len(texts)
threads = []
for i, text in enumerate(texts):
thread = threading.Thread(target=tokenize_worker, args=(text, results, i))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
# All results are correctly populated
print(results)
Performance considerations
Lock-based synchronization (longest, attacut):
Minimal overhead for cache access
Cache lookups are very fast
Lock contention is minimal in typical usage
Thread-local storage (icu):
Each thread maintains its own instance
No synchronization overhead after initialization
Slightly higher memory usage (one instance per thread)
Stateless engines (newmm, mm):
Zero synchronization overhead
Best performance in multi-threaded scenarios
Recommended for high-throughput applications
Best practices
For high-throughput applications: Consider using stateless engines like
newmmormmfor optimal performance.For custom dictionaries: The
longestengine with custom dictionaries maintains a cache per dictionary object. Reuse dictionary objects across threads to maximize cache efficiency.For process pools: All engines work correctly with multiprocessing as each process has its own memory space.
IMPORTANT: Do not modify custom dictionaries during tokenization:
Create your custom Trie/dictionary before starting threads
Never call
trie.add()ortrie.remove()while tokenization is in progressIf you need to update the dictionary, create a new Trie instance and pass it to subsequent tokenization calls
The Trie data structure itself is NOT thread-safe for concurrent modifications
Example of safe custom dictionary usage
from pythainlp.tokenize import word_tokenize
from pythainlp.corpus.common import thai_words
from pythainlp.util import dict_trie
import threading
# SAFE: Create dictionary once before threading
custom_words = set(thai_words())
custom_words.add("คำใหม่")
custom_dict = dict_trie(custom_words)
texts = ["ผมรักประเทศไทย", "วันนี้อากาศดี", "เขาไปโรงเรียน"]
def worker(text, custom_dict):
# SAFE: Only reading from the dictionary
return word_tokenize(text, engine="newmm", custom_dict=custom_dict)
# All threads share the same dictionary (read-only)
threads = []
for text in texts:
t = threading.Thread(target=worker, args=(text, custom_dict))
threads.append(t)
t.start()
# Wait for all threads to finish
for t in threads:
t.join()
Example of UNSAFE usage (DO NOT DO THIS)
# UNSAFE: Modifying dictionary while threads are using it
custom_dict = dict_trie(thai_words())
def unsafe_worker(text, custom_dict):
result = word_tokenize(text, engine="newmm", custom_dict=custom_dict)
# DANGER: Modifying the shared dictionary
custom_dict.add("คำใหม่") # This is NOT thread-safe!
return result
Testing
Comprehensive thread safety tests are available in:
tests/core/test_tokenize_thread_safety.py
The test suite includes:
Concurrent tokenization with multiple threads
Race condition testing with multiple dictionaries
Verification of result consistency across threads
Stress testing with up to 200 concurrent operations (20 threads × 10 iterations)
Maintenance notes
When adding new tokenization engines to PyThaiNLP:
Avoid global mutable state whenever possible
If caching is necessary, use thread-safe locks
If per-thread state is needed, use
threading.local()Always add thread safety tests for new engines
Document thread safety guarantees in docstrings
See also
Installation - For using PyThaiNLP in distributed computing environments like Apache Spark, including configuration of data directories for distributed operations