Thread safety
=============

Summary
-------

PyThaiNLP's core word tokenization engines are designed with thread-safety
in mind. Internal implementations (``mm``, ``newmm``, ``newmm-safe``,
``longest``, ``icu``) are thread-safe.

For engines that wrap external libraries (``attacut``, ``budoux``, ``deepcut``,
``nercut``, ``nlpo3``, ``oskut``, ``sefr_cut``, ``tltk``, ``wtsplit``), the
wrapper code is thread-safe, but we cannot guarantee thread-safety of the
underlying external libraries themselves.

Thread safety implementation
-----------------------------

**Internal implementations (fully thread-safe):**

- ``mm``, ``newmm``, ``newmm-safe``: Stateless implementation,
  all data is local
- ``longest``: uses lock-protected check-then-act for
  the management of global cache shared across threads
- ``icu``: each thread gets its own ``BreakIterator`` instance

**External library wrappers (wrapper code is thread-safe):**

- ``attacut``: uses lock-protected check-then-act for
  the management of global cache; underlying library thread-safety not guaranteed
- ``budoux``: uses lock-protected lazy initialization of parser;
  underlying library thread-safety not guaranteed
- ``deepcut``, ``nercut``, ``nlpo3``, ``tltk``: Stateless wrapper,
  underlying library thread-safety not guaranteed
- ``oskut``, ``sefr_cut``, ``wtsplit``: use lock-protected model
  loading when switching models/engines; underlying library thread-safety not guaranteed

Usage in multi-threaded applications
-------------------------------------

Using a tokenization engine safely in multi-threaded contexts:

.. code-block:: python

    import threading
    from pythainlp.tokenize import word_tokenize

    def tokenize_worker(text, results, index):
        # Thread-safe for all engines
        results[index] = word_tokenize(text, engine="longest")

    texts = ["ผมรักประเทศไทย", "วันนี้อากาศดี", "เขาไปโรงเรียน"]
    results = [None] * len(texts)
    threads = []

    for i, text in enumerate(texts):
        thread = threading.Thread(target=tokenize_worker, args=(text, results, i))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

    # All results are correctly populated
    print(results)

Performance considerations
--------------------------

1. **Lock-based synchronization** (longest, attacut):
   
   - Minimal overhead for cache access
   - Cache lookups are very fast
   - Lock contention is minimal in typical usage

2. **Thread-local storage** (icu):
   
   - Each thread maintains its own instance
   - No synchronization overhead after initialization
   - Slightly higher memory usage (one instance per thread)

3. **Stateless engines** (newmm, mm):
   
   - Zero synchronization overhead
   - Best performance in multi-threaded scenarios
   - Recommended for high-throughput applications

Best practices
--------------

1. **For high-throughput applications**: Consider using stateless engines like
   ``newmm`` or ``mm`` for optimal performance.

2. **For custom dictionaries**: The ``longest`` engine with custom dictionaries
   maintains a cache per dictionary object. Reuse dictionary objects across
   threads to maximize cache efficiency.

3. **For process pools**: All engines work correctly with multiprocessing as
   each process has its own memory space.

4. **IMPORTANT: Do not modify custom dictionaries during tokenization**:
   
   - Create your custom Trie/dictionary before starting threads
   - Never call ``trie.add()`` or ``trie.remove()`` while tokenization is in progress
   - If you need to update the dictionary,
     create a new Trie instance and pass it to subsequent tokenization calls
   - The Trie data structure itself is NOT thread-safe for concurrent modifications

Example of safe custom dictionary usage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from pythainlp.tokenize import word_tokenize
    from pythainlp.corpus.common import thai_words
    from pythainlp.util import dict_trie
    import threading

    # SAFE: Create dictionary once before threading
    custom_words = set(thai_words())
    custom_words.add("คำใหม่")
    custom_dict = dict_trie(custom_words)

    texts = ["ผมรักประเทศไทย", "วันนี้อากาศดี", "เขาไปโรงเรียน"]

    def worker(text, custom_dict):
        # SAFE: Only reading from the dictionary
        return word_tokenize(text, engine="newmm", custom_dict=custom_dict)

    # All threads share the same dictionary (read-only)
    threads = []
    for text in texts:
        t = threading.Thread(target=worker, args=(text, custom_dict))
        threads.append(t)
        t.start()

    # Wait for all threads to finish
    for t in threads:
        t.join()

Example of UNSAFE usage (DO NOT DO THIS)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    # UNSAFE: Modifying dictionary while threads are using it
    custom_dict = dict_trie(thai_words())

    def unsafe_worker(text, custom_dict):
        result = word_tokenize(text, engine="newmm", custom_dict=custom_dict)
        # DANGER: Modifying the shared dictionary
        custom_dict.add("คำใหม่")  # This is NOT thread-safe!
        return result

Testing
-------

Comprehensive thread safety tests are available in:

- ``tests/core/test_tokenize_thread_safety.py``

The test suite includes:

- Concurrent tokenization with multiple threads
- Race condition testing with multiple dictionaries
- Verification of result consistency across threads
- Stress testing with up to 200 concurrent operations (20 threads × 10 iterations)

Maintenance notes
-----------------

When adding new tokenization engines to PyThaiNLP:

1. **Avoid global mutable state** whenever possible
2. If caching is necessary, use thread-safe locks
3. If per-thread state is needed, use ``threading.local()``
4. Always add thread safety tests for new engines
5. Document thread safety guarantees in docstrings

Related files
-------------

- Core implementation: ``pythainlp/tokenize/core.py``
- Engine implementations: ``pythainlp/tokenize/*.py``
- Tests: ``tests/core/test_tokenize_thread_safety.py``

See also
--------

- :doc:`installation` - For using PyThaiNLP in distributed computing environments
  like Apache Spark, including configuration of data directories for distributed operations