pythainlp.tools

The pythainlp.tools module encompasses a collection of miscellaneous functions primarily designed for internal use within the PyThaiNLP library. While these functions may not be directly exposed for external use, understanding their purpose can offer insights into the inner workings of PyThaiNLP.

Modules

pythainlp.tools.get_full_data_path(path: str) str[source]

Join the PyThaiNLP data directory path with path and return the result.

Parameters:

path (str) – relative path or filename to append to the data directory.

Returns:

normalized absolute path within the PyThaiNLP data directory.

Return type:

str

Raises:

ValueError – if path resolves to a location outside the PyThaiNLP data directory (path traversal attempt).

Example:

from pythainlp.tools import get_full_data_path

get_full_data_path("ttc_freq.txt")
# output: '/root/pythainlp-data/ttc_freq.txt'

Retrieves the full path to the PyThaiNLP data directory. This function is essential for internal data management, enabling PyThaiNLP to locate resources efficiently.

pythainlp.tools.get_pythainlp_data_path() str[source]

Return the full path where PyThaiNLP keeps its (downloaded) data.

The directory is created if it does not yet exist.

The path is resolved in the following order:

  1. PYTHAINLP_DATA environment variable (preferred).

  2. PYTHAINLP_DATA_DIR environment variable (deprecated; shows a warning).

  3. If both variables are set, the function raises ValueError because the conflict must be resolved explicitly.

  4. If neither is set, ~/pythainlp-data is used.

Deprecated since version ``PYTHAINLP_DATA_DIR``: is deprecated. Use PYTHAINLP_DATA instead (follows the same pattern as NLTK_DATA).

Returns:

full path of directory for pythainlp downloaded data

Return type:

str

Example:

from pythainlp.tools import get_pythainlp_data_path

get_pythainlp_data_path()
# output: '/root/pythainlp-data'

Obtains the path to the PyThaiNLP data directory. This function is useful for accessing the library’s data resources for internal processes.

pythainlp.tools.get_pythainlp_path() str[source]

This function returns full path of PyThaiNLP codes.

Note: When the package is installed as a zip file, the returned path may not be a standard filesystem path and should not be used for direct file I/O operations. Use importlib.resources for accessing package files in a zip-safe manner.

Returns:

full path of pythainlp codes

Return type:

str

Example:

from pythainlp.tools import get_pythainlp_path

get_pythainlp_path()
# output: '/usr/local/lib/python3.6/dist-packages/pythainlp'

Returns the path to the PyThaiNLP library directory. This function is vital for PyThaiNLP’s internal operations and library management.

pythainlp.tools.safe_print(text: str) None[source]

Print text to console, handling UnicodeEncodeError.

Parameters:

text (str) – Text to print.

pythainlp.tools.misspell.misspell(sentence: str, ratio: float = 0.05) str[source]

Simulate some misspellings of the input sentence. The number of misspelled locations is governed by ratio.

Params str sentence:

sentence to be misspelled

Params float ratio:

number of misspells per 100 chars. Defaults to 0.5.

Returns:

sentence containing some misspelled words

Return type:

str

Example:

from pythainlp.tools.misspell import misspell

sentence = "ภาษาไทยปรากฏครั้งแรกในพุทธศักราช 1826"

misspell(sent, ratio=0.1)
# output:
ภาษาไทยปรากฏครั้งแรกในกุทธศักราช 1727

This module appears to be related to handling misspellings within PyThaiNLP. While not explicitly documented here, it likely provides functionality for identifying and correcting misspelled words, which can be crucial for text preprocessing and language processing tasks.

The pythainlp.tools module contains these functions, which are mainly intended for PyThaiNLP’s internal workings. While they may not be directly utilized by external users, they play a pivotal role in ensuring the smooth operation of the library. Understanding the purpose of these functions can be valuable for contributors and developers working on PyThaiNLP, as it sheds light on the internal mechanisms and data management within the library.