Thread safety

Summary

PyThaiNLP’s core word tokenization engines are designed with thread-safety in mind. Internal implementations (mm, newmm, newmm-safe, longest, icu) are thread-safe.

For engines that wrap external libraries (attacut, budoux, deepcut, nercut, nlpo3, oskut, sefr_cut, tltk, wtsplit), the wrapper code is thread-safe, but we cannot guarantee thread-safety of the underlying external libraries themselves.

Thread safety implementation

Internal implementations (fully thread-safe):

  • mm, newmm, newmm-safe: Stateless implementation, all data is local

  • longest: uses lock-protected check-then-act for the management of global cache shared across threads

  • icu: each thread gets its own BreakIterator instance

External library wrappers (wrapper code is thread-safe):

  • attacut: uses lock-protected check-then-act for the management of global cache; underlying library thread-safety not guaranteed

  • budoux: uses lock-protected lazy initialization of parser; underlying library thread-safety not guaranteed

  • deepcut, nercut, nlpo3, tltk: Stateless wrapper, underlying library thread-safety not guaranteed

  • oskut, sefr_cut, wtsplit: use lock-protected model loading when switching models/engines; underlying library thread-safety not guaranteed

Usage in multi-threaded applications

Using a tokenization engine safely in multi-threaded contexts:

import threading
from pythainlp.tokenize import word_tokenize

def tokenize_worker(text, results, index):
    # Thread-safe for all engines
    results[index] = word_tokenize(text, engine="longest")

texts = ["ผมรักประเทศไทย", "วันนี้อากาศดี", "เขาไปโรงเรียน"]
results = [None] * len(texts)
threads = []

for i, text in enumerate(texts):
    thread = threading.Thread(target=tokenize_worker, args=(text, results, i))
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

# All results are correctly populated
print(results)

Performance considerations

  1. Lock-based synchronization (longest, attacut):

    • Minimal overhead for cache access

    • Cache lookups are very fast

    • Lock contention is minimal in typical usage

  2. Thread-local storage (icu):

    • Each thread maintains its own instance

    • No synchronization overhead after initialization

    • Slightly higher memory usage (one instance per thread)

  3. Stateless engines (newmm, mm):

    • Zero synchronization overhead

    • Best performance in multi-threaded scenarios

    • Recommended for high-throughput applications

Best practices

  1. For high-throughput applications: Consider using stateless engines like newmm or mm for optimal performance.

  2. For custom dictionaries: The longest engine with custom dictionaries maintains a cache per dictionary object. Reuse dictionary objects across threads to maximize cache efficiency.

  3. For process pools: All engines work correctly with multiprocessing as each process has its own memory space.

  4. IMPORTANT: Do not modify custom dictionaries during tokenization:

    • Create your custom Trie/dictionary before starting threads

    • Never call trie.add() or trie.remove() while tokenization is in progress

    • If you need to update the dictionary, create a new Trie instance and pass it to subsequent tokenization calls

    • The Trie data structure itself is NOT thread-safe for concurrent modifications

Example of safe custom dictionary usage

from pythainlp.tokenize import word_tokenize
from pythainlp.corpus.common import thai_words
from pythainlp.util import dict_trie
import threading

# SAFE: Create dictionary once before threading
custom_words = set(thai_words())
custom_words.add("คำใหม่")
custom_dict = dict_trie(custom_words)

texts = ["ผมรักประเทศไทย", "วันนี้อากาศดี", "เขาไปโรงเรียน"]

def worker(text, custom_dict):
    # SAFE: Only reading from the dictionary
    return word_tokenize(text, engine="newmm", custom_dict=custom_dict)

# All threads share the same dictionary (read-only)
threads = []
for text in texts:
    t = threading.Thread(target=worker, args=(text, custom_dict))
    threads.append(t)
    t.start()

# Wait for all threads to finish
for t in threads:
    t.join()

Example of UNSAFE usage (DO NOT DO THIS)

# UNSAFE: Modifying dictionary while threads are using it
custom_dict = dict_trie(thai_words())

def unsafe_worker(text, custom_dict):
    result = word_tokenize(text, engine="newmm", custom_dict=custom_dict)
    # DANGER: Modifying the shared dictionary
    custom_dict.add("คำใหม่")  # This is NOT thread-safe!
    return result

Testing

Comprehensive thread safety tests are available in:

  • tests/core/test_tokenize_thread_safety.py

The test suite includes:

  • Concurrent tokenization with multiple threads

  • Race condition testing with multiple dictionaries

  • Verification of result consistency across threads

  • Stress testing with up to 200 concurrent operations (20 threads × 10 iterations)

Maintenance notes

When adding new tokenization engines to PyThaiNLP:

  1. Avoid global mutable state whenever possible

  2. If caching is necessary, use thread-safe locks

  3. If per-thread state is needed, use threading.local()

  4. Always add thread safety tests for new engines

  5. Document thread safety guarantees in docstrings

See also

  • Installation - For using PyThaiNLP in distributed computing environments like Apache Spark, including configuration of data directories for distributed operations