pythainlp.util

The pythainlp.util module serves as a treasure trove of utility functions designed to aid text conversion, formatting, and various language processing tasks in the context of Thai language.

Modules

pythainlp.util.analyze_thai_text(text: str) → dict[str, int][source]

Analyze Thai text and return a character count by descriptive name.

Process the text character by character and map each Thai character to its descriptive name or to itself (for consonants and digits).

Parameters:

text (str) – Thai text string to be analyzed

Returns:

dict mapping character names to their count in the text

Return type:

dict[str, int]

Example:

>>> from pythainlp.util import analyze_thai_text
>>> analyze_thai_text("คนดี")
{'ค': 1, 'น': 1, 'ด': 1, 'สระ อี': 1}
>>> analyze_thai_text("เล่น")
{'สระ เอ': 1, 'ล': 1, 'ไม้เอก': 1, 'น': 1}

Analyzes a string of Thai text and returns a dictionaries, where each values represents a single classified character from the text.

pythainlp.util.abbreviation_to_full_text(text: str, top_k: int = 2) → list[tuple[str, float | None]][source]

Converts Thai text (with abbreviations) to full text.

Uses KhamYo to handle abbreviations. See more: KhamYo.

Parameters:

text (str) – Thai text
top_k (int) – Top K

Returns:

list of (full_text, cosine_similarity) tuples.

Return type:

list[tuple[str, Optional[float]]]

Example:

>>> from pythainlp.util import abbreviation_to_full_text

>>> text = "รร.ของเราน่าอยู่"

>>> abbreviation_to_full_text(text)
[
('โรงเรียนของเราน่าอยู่', tensor(0.3734)),
('โรงแรมของเราน่าอยู่', tensor(0.2438))
]

The abbreviation_to_full_text function is a text processing tool for converting common Thai abbreviations into their full, expanded forms. It’s invaluable for improving text readability and clarity.

pythainlp.util.arabic_digit_to_thai_digit(text: str) → str[source]

Converts Arabic digits (i.e. 1, 3, 10) to Thai digits (i.e. ๑, ๓, ๑๐).

Parameters:

text (str) – Text with Arabic digits such as ‘1’, ‘2’, ‘3’

Returns:

Text with Arabic digits converted to Thai digits such as ‘๑’, ‘๒’, ‘๓’

Return type:

str

Example:

>>> from pythainlp.util import arabic_digit_to_thai_digit
>>> text = "เป็นจำนวน 123,400.25 บาท"
>>> arabic_digit_to_thai_digit(text)
'เป็นจำนวน ๑๒๓,๔๐๐.๒๕ บาท'

The arabic_digit_to_thai_digit function allows you to transform Arabic numerals into their Thai numeral equivalents. This utility is especially useful when working with Thai numbers in text data.

pythainlp.util.bahttext(number: float) → str[source]

Converts a number to Thai text and adds a suffix “บาท” (Baht). The precision will be fixed at two decimal places (0.00) to fit “สตางค์” (Satang) unit. This function works similarly to the BAHTTEXT function in Microsoft Excel.

Parameters:

number (float) – number to be converted into Thai Baht currency format

Returns:

text representing the amount of money in the format of Thai currency

Return type:

str

Raises:

TypeError – if number is not a numeric type

Example:

>>> from pythainlp.util import bahttext
>>> bahttext(1)
'หนึ่งบาทถ้วน'
>>> bahttext(21)
'ยี่สิบเอ็ดบาทถ้วน'
>>> bahttext(200)
'สองร้อยบาทถ้วน'

The bahttext function specializes in converting numerical values into Thai Baht text, an essential feature for rendering financial data or monetary amounts in a user-friendly Thai format.

pythainlp.util.check_khuap_klam(word: str) → bool | None[source]

Check whether a Thai word is a consonant cluster (Kham Khuap Klam).

Parameters:

word (str) – Thai word to check.

Returns:

True if the word is a true consonant cluster (คำควบกล้ำแท้), False if it is a false consonant cluster (คำควบกล้ำไม่แท้), or None if it is not a consonant cluster.

Return type:

Optional[bool]

Example:

>>> from pythainlp.util import check_khuap_klam

>>> # True consonant clusters (คำควบกล้ำแท้)
>>> print(check_khuap_klam("กราบ"))  # True
>>> print(check_khuap_klam("ปลา"))  # True
>>> print(check_khuap_klam("เพราะ"))  # True
>>> print(check_khuap_klam("ตรง"))  # True

>>> # False consonant clusters (คำควบกล้ำไม่แท้)
>>> print(check_khuap_klam("จริง"))  # False
>>> print(check_khuap_klam("ทราย"))  # False
>>> print(check_khuap_klam("เศร้า"))  # False

>>> # Not a consonant cluster
>>> print(check_khuap_klam("แม่"))  # None
>>> print(check_khuap_klam("ตา"))  # None

The check_khuap_klam function checks whether a Thai word is a consonant cluster (Kham Khuap Klam, คำควบกล้ำ). It returns True for a true consonant cluster (คำควบกล้ำแท้), False for a false consonant cluster (คำควบกล้ำไม่แท้), or None if the word is not a consonant cluster.

pythainlp.util.censor_profanity(text: str, replacement: str = '*', custom_words: set[str] | None = None, engine: str = 'newmm') → str[source]

Replace profanity words in the text with a replacement character.

Parameters:

text (str) – Thai text to censor
replacement (str) – character to replace profanity with (default: “*”)
custom_words (set) – additional profanity words to censor (default: None)
engine (str) – tokenization engine (default: “newmm”)

Returns:

Text with profanity words censored

Return type:

str

Example:

>>> from pythainlp.util import censor_profanity

>>> print(censor_profanity("สวัสดีครับ"))
สวัสดีครับ

>>> print(censor_profanity("text with profanity word"))
text with *** word

>>> # Add custom profanity words
>>> print(censor_profanity("คำใหม่", custom_words={"คำใหม่"}))
******

The censor_profanity function replaces profanity words in Thai text with a replacement character (default: “*”). Users can provide custom profanity words in addition to the built-in list for content moderation and filtering.

pythainlp.util.collate(data: Iterable[str], reverse: bool = False) → list[str][source]

Sorts strings (almost) according to Thai dictionary.

Important notes: this implementation ignores tone marks and symbols

Parameters:

data (Iterable[str]) – an iterable of words to be sorted
reverse (bool, optional) – If reverse is set to True the result will be sorted in descending order. Otherwise, the result will be sorted in ascending order, defaults to False

Returns:

a list of strings, sorted alphabetically, (almost) according to Thai dictionary

Return type:

list[str]

Example:

>>> from pythainlp.util import collate
>>> collate(['ไก่', 'เกิด', 'กาล', 'เป็ด', 'หมู', 'วัว', 'วันที่'])
['กาล', 'เกิด', 'ไก่', 'เป็ด', 'วันที่', 'วัว', 'หมู']
>>> collate(['ไก่', 'เกิด', 'กาล', 'เป็ด', 'หมู', 'วัว', 'วันที่'],
...     reverse=True)
['หมู', 'วัว', 'วันที่', 'เป็ด', 'ไก่', 'เกิด', 'กาล']

The collate function is a versatile tool for sorting Thai text in a locale-specific manner. It ensures that text data is sorted correctly, taking into account the Thai language’s unique characteristics.

pythainlp.util.contains_profanity(text: str, custom_words: set[str] | None = None, engine: str = 'newmm') → bool[source]

Check if the given text contains profanity words.

Parameters:

text (str) – Thai text to check
custom_words (set) – additional profanity words to check (default: None)
engine (str) – tokenization engine (default: “newmm”)

Returns:

True if text contains profanity, False otherwise

Return type:

bool

Example:

>>> from pythainlp.util import contains_profanity

>>> print(contains_profanity("สวัสดีครับ"))
False

>>> print(contains_profanity("คำหยาบคาย"))
True if the word is in the profanity list

>>> # Add custom profanity words
>>> print(contains_profanity("คำใหม่", custom_words={"คำใหม่"}))
True

The contains_profanity function checks if Thai text contains profanity words. It returns True if profanity is detected and False otherwise. Users can provide custom profanity words for enhanced content moderation.

pythainlp.util.convert_years(year: str, src: str = 'be', target: str = 'ad') → str[source]

Convert years

Parameters:

year (int) – Year
src (str) – The source year
target (str) – The target year

Returns:

The converted year

Return type:

str

Options for year

be - Buddhist calendar
ad - Anno Domini
re - Rattanakosin era
ah - Anno Hejira

Warning: This function works properly only after 1941 because Thailand has change the Thai calendar in 1941. If you are the time traveler or the historian, you should care about the correct calendar.

Example:

>>> from pythainlp.util import convert_years
>>> # Convert Buddhist Era (BE) to Anno Domini (AD)
>>> convert_years("2566", src="be", target="ad")
'2023'
>>> # Convert AD to BE
>>> convert_years("2023", src="ad", target="be")
'2566'
>>> # Convert BE to Rattanakosin Era (RE)
>>> convert_years("2566", src="be", target="re")
'242'

The convert_years function is designed to facilitate the conversion of Western calendar years into Thai Buddhist Era (BE) years. This is significant for presenting dates and years in a Thai context.

pythainlp.util.count_thai_chars(text: str) → dict[str, int][source]

Count Thai characters by type.

Count Thai characters by type: consonants, vowels, lead_vowels, follow_vowels, above_vowels, below_vowels, tonemarks, signs, thai_digits, punctuations, and non_thai.

Parameters:

text (str) – input text

Returns:

dict with counts of Thai characters by type

Return type:

dict[str, int]

Example:

>>> from pythainlp.util import count_thai_chars
>>> count_thai_chars("ทดสอบภาษาไทย")
{
'vowels': 3,
'lead_vowels': 1,
'follow_vowels': 2,
'above_vowels': 0,
'below_vowels': 0,
'consonants': 9,
'tonemarks': 0,
'signs': 0,
'thai_digits': 0,
'punctuations': 0,
'non_thai': 0
}

The count_thai_chars function is a character counting tool specifically tailored for Thai text. It helps in quantifying Thai characters, which can be useful for various text processing tasks.

pythainlp.util.countthai(text: str, ignore_chars: str = ' \t\n\r\x0b\x0c0123456789!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~') → float[source]

Find proportion of Thai characters in a given text.

Deprecated since version 5.3.2: Use count_thai() instead.

Parameters:

text (str) – input text
ignore_chars (str, optional) – characters to be ignored, defaults to whitespace,digits, and punctuation marks.

Returns:

proportion of Thai characters in the text (percentage)

Return type:

float

The countthai function is a text processing utility for counting the occurrences of Thai characters in text data. This is useful for understanding the prevalence of Thai language content.

pythainlp.util.dict_trie(dict_source: str | Iterable[str] | Trie) → Trie[source]

Create a dictionary trie from a file or an iterable.

Parameters:: dict_source (str|Iterable[str]|pythainlp.util.Trie) – a path to dictionary file or a list of words or a pythainlp.util.Trie object
Returns:: a trie object
Return type:: pythainlp.util.Trie

The dict_trie function implements a Trie data structure for efficient dictionary operations. It’s a valuable resource for dictionary management and fast word lookup.

pythainlp.util.digit_to_text(text: str) → str[source]

Spell out digits in Thai.

Parameters:

text (str) – text with digits such as ‘1’, ‘2’, ‘๓’, ‘๔’

Returns:

text with digits spelled out in Thai

Return type:

str

Example:

>>> from pythainlp.util import digit_to_text
>>> digit_to_text("เบอร์โทร 0812345678")
'เบอร์โทร ศูนย์แปดหนึ่งสองสามสี่ห้าหกเจ็ดแปด'
>>> digit_to_text("123")
'หนึ่งสองสาม'
>>> digit_to_text("๕๖๗")
'ห้าหกเจ็ด'

The digit_to_text function is a numeral conversion tool that translates Arabic numerals into their Thai textual representations. This is vital for rendering numbers in Thai text naturally.

pythainlp.util.display_thai_char(ch: str) → str[source]

Prefix an underscore (_) to a high-position vowel or a tone mark, to ease readability.

Parameters:

ch (str) – input character

Returns:

“_” + ch

Return type:

str

Example:

>>> from pythainlp.util import display_thai_char
>>> display_thai_char("้")
'_้'

The display_thai_char function is designed to present Thai characters with diacritics and tonal marks accurately. This is essential for displaying Thai text with correct pronunciation cues.

pythainlp.util.emoji_to_thai(text: str, delimiters: tuple[str, str] = (':', ':')) → str[source]

Converts emojis to their Thai meanings.

Parameters:

text (str) – Text with emojis

Returns:

Text with emojis converted to their Thai meanings

Return type:

str

Example:

>>> from pythainlp.util import emoji_to_thai
>>> emoji_to_thai("จะมานั่งรถเมล์เหมือนผมก็ได้นะครับ ใกล้ชิดประชาชนดี 😀")
'จะมานั่งรถเมล์เหมือนผมก็ได้นะครับ ใกล้ชิดประชาชนดี :หน้ายิ้มยิงฟัน:'
>>> emoji_to_thai("หิวข้าวอยากกินอาหารญี่ปุ่น 🍣")
'หิวข้าวอยากกินอาหารญี่ปุ่น :ซูชิ:'
>>> emoji_to_thai("🇹🇭 นี่คือธงประเทศไทย")
':ธง_ไทย: นี่คือธงประเทศไทย'

The emoji_to_thai function focuses on converting emojis into their Thai language equivalents. This is a unique feature for enhancing text communication with Thai-language emojis.

pythainlp.util.eng_to_thai(text: str) → str[source]

Corrects the given text that was incorrectly typed using English-US Qwerty keyboard layout to the originally intended keyboard layout that is the Thai Kedmanee keyboard.

Parameters:: text (str) – incorrect text input (Thai typed using English keyboard)
Returns:: Thai text with typing using incorrect keyboard layout is corrected
Return type:: str
Example:

Intentionally type “ธนาคารแห่งประเทศไทย”, but got “Tok8kicsj’xitgmLwmp”:

>>> from pythainlp.util import eng_to_thai
>>> eng_to_thai("Tok8kicsj'xitgmLwmp")
'ธนาคารแห่งประเทศไทย'

The eng_to_thai function serves as a text conversion tool for translating English text into its Thai transliterated form. It is beneficial for rendering English words and phrases in a Thai context.

pythainlp.util.find_keyword(word_list: list[str], min_len: int = 3) → dict[str, int][source]

Counts the frequencies of words in the list where stopwords are excluded and returns a frequency dictionary.

Parameters:

word_list (list) – a list of words
min_len (int) – the minimum frequency for words to be retained

Returns:

a dictionary object with key-value pair being words and their raw counts

Return type:

dict[str, int]

Example:

>>> from pythainlp.util import find_keyword

>>> words = ["บันทึก", "เหตุการณ์", "บันทึก", "เหตุการณ์",
...          " ", "มี", "การ", "บันทึก", "เป็น", " ", "ลายลักษณ์อักษร"
...          "และ", "การ", "บันทึก","เสียง","ใน","เหตุการณ์"]

>>> find_keyword(words)
{'บันทึก': 4, 'เหตุการณ์': 3}

>>> find_keyword(words, min_len=1)
{' ': 2, 'บันทึก': 4, 'ลายลักษณ์อักษรและ': 1,
         'เสียง': 1, 'เหตุการณ์': 3}

The find_keyword function is a powerful utility for identifying keywords and key phrases in text data. It is a fundamental component for text analysis and information extraction tasks.

pythainlp.util.find_profanity(text: str, custom_words: set[str] | None = None, engine: str = 'newmm') → list[str][source]

Find all profanity words in the given text.

Parameters:

text (str) – Thai text to check
custom_words (set) – additional profanity words to check (default: None)
engine (str) – tokenization engine (default: “newmm”)

Returns:

list of profanity words found in the text

Return type:

list[str]

Example:

>>> from pythainlp.util import find_profanity

>>> print(find_profanity("สวัสดีครับ"))
[]

>>> print(find_profanity("text with profanity words"))
['profanity_word1', 'profanity_word2']

>>> # Add custom profanity words
>>> print(find_profanity("คำใหม่", custom_words={"คำใหม่"}))
['คำใหม่']

The find_profanity function identifies and returns a list of all profanity words found in Thai text. Users can provide custom profanity words to enhance detection capabilities for content moderation.

pythainlp.util.ipa_to_rtgs(ipa: str) → str[source]

Convert IPA system to The Royal Thai General System of Transcription (RTGS)

Docs: https://en.wikipedia.org/wiki/Help:IPA/Thai

Parameters:

ipa (str) – IPA phoneme

Returns:

The RTGS that is converted, according to rules listed in the Wikipedia page

Return type:

str

Example:

>>> from pythainlp.util import ipa_to_rtgs

>>> print(ipa_to_rtgs("kluaj"))
'kluai'

The ipa_to_rtgs function focuses on converting International Phonetic Alphabet (IPA) transcriptions into Royal Thai General System of Transcription (RTGS) format. This is valuable for phonetic analysis and pronunciation guides.

pythainlp.util.isthai(text: str, ignore_chars: str = '.') → bool[source]

Check if every character in a string is a Thai character.

Deprecated since version 5.3.2: Use is_thai() instead.

Parameters:

text (str) – input text
ignore_chars (str, optional) – characters to be ignored, defaults to “.”

Returns:

True if every character in the input string is Thai, otherwise False.

Return type:

bool

The isthai function is a straightforward language detection utility that determines if text contains Thai language content. This function is essential for language-specific text processing.

pythainlp.util.isthaichar(ch: str) → bool[source]

Check if a character is a Thai character.

Deprecated since version 5.3.2: Use is_thai_char() instead.

Parameters:: ch (str) – input character
Returns:: True if ch is a Thai character, otherwise False.
Return type:: bool

The isthaichar function is designed to check if a character belongs to the Thai script. It helps in character-level language identification and text processing.

pythainlp.util.maiyamok(sent: str | list[str]) → list[str][source]

Expand Maiyamok.

Deprecated since version 5.0.5: Use expand_maiyamok() instead.

Maiyamok (ๆ) (Unicode U+0E46) is a Thai character indicating word repetition. This function preprocesses Thai text by replacing Maiyamok with a word being repeated.

Parameters:

sent (Union[str, list[str]]) – sentence (list or string)

Returns:

list of words

Return type:

list[str]

Example:

>>> from pythainlp.util import maiyamok

>>> maiyamok("คนๆนก")
['คน', 'คน', 'นก']

The maiyamok function is a text processing tool that assists in identifying and processing Thai character characters with a ‘mai yamok’ tone mark.

pythainlp.util.nectec_to_ipa(pronunciation: str) → str[source]

Convert NECTEC system to IPA system

Parameters:

pronunciation (str) – NECTEC phoneme

Returns:

IPA that is converted

Return type:

str

Example:

>>> from pythainlp.util import nectec_to_ipa

>>> print(nectec_to_ipa("kl-uua-j^-2"))
'kl uua j ˥˩'

References

Pornpimon Palingoon, Sumonmas Thatphithakkul. Chapter 4 Speech processing and Speech corpus. In: Handbook of Thai Electronic Corpus. 1st ed. p. 122–56.

The nectec_to_ipa function focuses on converting text from the NECTEC phonetic transcription system to the International Phonetic Alphabet (IPA). This conversion is vital for linguistic analysis and phonetic representation.

pythainlp.util.normalize(text: str) → str[source]

Normalize and clean Thai text with normalizing rules as follows:

Remove zero-width spaces

Remove duplicate spaces

Remove spaces before tone marks and non-base characters

Reorder tone marks and vowels to standard order/spelling

Remove duplicate vowels and signs

Remove duplicate tone marks

Remove dangling non-base characters at the beginning of text

normalize() simply call remove_zw(), remove_dup_spaces(), remove_spaces_before_marks(), remove_repeat_vowels(), and remove_dangling(), in that order.

If a user wants to customize the selection or the order of rules to be applied, they can choose to call those functions by themselves.

Note: for Unicode normalization, see unicodedata.normalize().

Parameters:

text (str) – input text

Returns:

normalized text according to the rules

Return type:

str

Example:

>>> from pythainlp.util import normalize
>>> normalize("เเปลก")  # starts with two Sara E
'แปลก'
>>> normalize("นานาาา")
'นานา'

The normalize function is a text processing utility that standardizes text by removing diacritics, tonal marks, and other modifications. It is valuable for text normalization and linguistic analysis.

pythainlp.util.now_reign_year() → int[source]

Return the reign year of the 10th King of Chakri dynasty.

Returns:

reign year of the 10th King of Chakri dynasty.

Return type:

int

Example:

>>> from pythainlp.util import now_reign_year
>>> text = "เป็นปีที่ {reign_year} ในรัชกาลปัจจุบัน"\
...     .format(reign_year=now_reign_year())
>>> print(text)
เป็นปีที่ 11 ในรัชกาลปัจจุบัน

The now_reign_year function computes the current Thai Buddhist Era (BE) year and provides it in a human-readable format. This function is essential for displaying the current year in a Thai context.

pythainlp.util.num_to_thaiword(number: int | None) → str[source]

Converts a number to Thai text.

Parameters:

number (int) – an integer number to be converted to Thai text

Returns:

text representing the number in Thai

Return type:

str

Example:

>>> from pythainlp.util import num_to_thaiword
>>> num_to_thaiword(1)
'หนึ่ง'
>>> num_to_thaiword(11)
'สิบเอ็ด'

The num_to_thaiword function is a numeral conversion tool for translating Arabic numerals into Thai word form. It is crucial for rendering numbers in a natural Thai textual format.

pythainlp.util.rank(words: list[str], exclude_stopwords: bool = False) → Counter[str] | None[source]

Count word frequencies given a list of Thai words with an option to exclude stopwords.

Parameters:

words (list) – a list of words
exclude_stopwords (bool) – If this parameter is set to True, exclude stopwords from counting. Otherwise, the stopwords will be counted. By default, `exclude_stopwords`is set to False

Returns:

a Counter object representing word frequencies in the text, or None if words is empty

Return type:

Optional[collections.Counter[str]]

Example:

Include stopwords when counting word frequencies:

>>> from pythainlp.util import rank

>>> words = ["บันทึก", "เหตุการณ์", " ", "มี", "การ", "บันทึก",
... "เป็น", " ", "ลายลักษณ์อักษร"]

>>> rank(words)
Counter(
    {
        ' ': 2,
        'การ': 1,
        'บันทึก': 2,
        'มี': 1,
        'ลายลักษณ์อักษร': 1,
        'เป็น': 1,
        'เหตุการณ์': 1
    })

Exclude stopwords when counting word frequencies:

>>> from pythainlp.util import rank

>>> words = ["บันทึก", "เหตุการณ์", " ", "มี", "การ", "บันทึก",
...     "เป็น", " ", "ลายลักษณ์อักษร"]

>>> rank(words)
Counter(
    {
        ' ': 2,
        'บันทึก': 2,
        'ลายลักษณ์อักษร': 1,
        'เหตุการณ์': 1
    })

The rank function is designed for ranking and ordering a list of items. It is a general-purpose utility for ranking items based on various criteria.

pythainlp.util.reign_year_to_ad(reign_year: int, reign: int) → int[source]

Convert reign year to AD.

Return AD year according to the reign year for the 7th to 10th King of Chakri dynasty, Thailand. For instance, the AD year of the 4th reign year of the 10th King is 2019.

Parameters:

reign_year (int) – reign year of the King
reign (int) – the reign of the King (i.e. 7, 8, 9, and 10)

Returns:

the year in AD of the King given the reign and reign year.

Return type:

int

Example:

>>> from pythainlp.util import reign_year_to_ad
>>> print("The 4th reign year of the King Rama X is in",
...     reign_year_to_ad(4, 10))
The 4th reign year of the King Rama X is in 2019
>>> print("The 1st reign year of the King Rama IX is in",
...     reign_year_to_ad(1, 9))
The 1st reign year of the King Rama IX is in 1946

The reign_year_to_ad function facilitates the conversion of Thai Buddhist Era (BE) years into Western calendar years. This is useful for displaying historical dates in a globally recognized format.

pythainlp.util.remove_dangling(text: str) → str[source]

Remove Thai non-base characters at the beginning of text and after spaces.

This is a common “typo”, especially for input field in a form, as these non-base characters can be visually hidden from user who may accidentally typed them in.

A character to be removed should be both:

tone mark, above vowel, below vowel, or non-base sign AND

located at the beginning of the text or after spaces

Parameters:

text (str) – input text

Returns:

text without dangling Thai characters at the beginning and after spaces

Return type:

str

Example:

>>> from pythainlp.util import remove_dangling
>>> remove_dangling("๊ก")
'ก'
>>> remove_dangling("คำ ่ที่สอง")
'คำ ที่สอง'

The remove_dangling function is a text processing tool for removing dangling characters or diacritics from text. It is useful for text cleaning and normalization.

pythainlp.util.remove_dup_spaces(text: str) → str[source]

Remove duplicate spaces. Replace multiple spaces with one space.

Multiple newline characters and empty lines will be replaced with one newline character.

Parameters:

text (str) – input text

Returns:

text without duplicated spaces and newlines

Return type:

str

Example:

>>> from pythainlp.util import remove_dup_spaces
>>> remove_dup_spaces("ก    ข    ค")
'ก ข ค'

The remove_dup_spaces function focuses on removing duplicate space characters from text data, making it more consistent and readable.

pythainlp.util.remove_repeat_vowels(text: str) → str[source]

Remove repeating vowels, tone marks, and signs.

Calls reorder_vowels() first to ensure that double Sara E will be converted to Sara Ae and not be removed.

Parameters:

text (str) – input text

Returns:

text without repeating Thai vowels, tone marks, and signs

Return type:

str

Example:

>>> from pythainlp.util import remove_repeat_vowels
>>> remove_repeat_vowels("นานาาา")
'นานา'
>>> remove_repeat_vowels("ดีีีี")
'ดี'

The remove_repeat_vowels function is designed to eliminate repeated vowel characters in text, improving text readability and consistency.

pythainlp.util.remove_tone_ipa(ipa: str) → str[source]

Remove Thai Tones from IPA system

Parameters:

ipa (str) – IPA phoneme

Returns:

IPA phoneme with tones removed

Return type:

str

Example:

>>> from pythainlp.util import remove_tone_ipa

>>> print(remove_tone_ipa("laː˦˥.sa˨˩.maj˩˩˦"))
laː.sa.maj

The remove_tone_ipa function serves as a phonetic conversion tool for removing tone marks from IPA transcriptions. This is crucial for phonetic analysis and linguistic research.

pythainlp.util.remove_tonemark(text: str) → str[source]

Remove all Thai tone marks from the text.

Thai script has four tone marks indicating four tones as follows:

Down tone (Thai: ไม้เอก _่ )

Falling tone (Thai: ไม้โท _้ )

High tone (Thai: ไม้ตรี _๊ )

Rising tone (Thai: ไม้จัตวา _๋ )

Putting wrong tone mark is a common mistake in Thai writing. By removing tone marks from the string, it could be used to for a approximate string matching.

Parameters:

text (str) – input text

Returns:

text without Thai tone marks

Return type:

str

Example:

>>> from pythainlp.util import remove_tonemark
>>> remove_tonemark("สองพันหนึ่งร้อยสี่สิบเจ็ดล้านสี่แสนแปดหมื่นสามพันหกร้อยสี่สิบเจ็ด")
'สองพันหนึงรอยสีสิบเจ็ดลานสีแสนแปดหมืนสามพันหกรอยสีสิบเจ็ด'

The remove_tonemark function is a utility for removing tonal marks and diacritics from text data, making it suitable for various text processing tasks.

pythainlp.util.remove_zw(text: str) → str[source]

Remove zero-width characters.

These non-visible characters may cause unexpected result from the user’s point of view. Removing them can make string matching more robust.

Characters to be removed:

Zero-width space (ZWSP)

Zero-width non-joiner (ZWJP)

Parameters:

text (str) – input text

Returns:

text without zero-width characters

Return type:

str

Example:

>>> from pythainlp.util import remove_zw
>>> remove_zw("สวัสดี​ครับ")
'สวัสดีครับ'
>>> remove_zw("ภาษา‌ไทย")
'ภาษาไทย'

The remove_zw function is designed to remove zero-width characters from text data, ensuring that text is free from invisible or unwanted characters.

pythainlp.util.reorder_vowels(text: str) → str[source]

Reorder vowels and tone marks to the standard logical order/spelling.

Characters in input text will be reordered/transformed, according to these rules:

Sara E + Sara E -> Sara Ae

Nikhahit + Sara Aa -> Sara Am

tone mark + non-base vowel -> non-base vowel + tone mark

follow vowel + tone mark -> tone mark + follow vowel

Parameters:

text (str) – input text

Returns:

text with vowels and tone marks in the standard logical order

Return type:

str

Example:

>>> from pythainlp.util import reorder_vowels
>>> reorder_vowels("เเปลก")  # two Sara E become Sara Ae
'แปลก'
>>> reorder_vowels("ก้ำ")  # reorder tone marks and vowels
'ก้ำ'

The reorder_vowels function is a text processing utility for reordering vowel characters in Thai text. It is essential for phonetic analysis and pronunciation guides.

pythainlp.util.rhyme(word: str) → list[str][source]

Find Thai rhyme

Parameters:

word (str) – A Thai word

Returns:

All list Thai rhyme words

Return type:

List[str]

Example:

>>> from pythainlp.util import rhyme
>>> rhyme("จีบ")
['กลีบ', 'กีบ', 'ครีบ', 'คีบ', 'งีบ', ... ]

The rhyme function is a utility for find rhyme of Thai word.

pythainlp.util.sound_syllable(syllable: str) → str[source]

Sound syllable classification

This function is sound syllable classification. The syllable is a live syllable or dead syllable.

Parameters:

syllable (str) – Thai syllable

Returns:

syllable’s type (“live” or “dead”)

Return type:

str

Example:

>>> from pythainlp.util import sound_syllable
>>> sound_syllable("มา")
'live'
>>> sound_syllable("เลข")
'dead'

The sound_syllable function specializes in identifying and processing Thai characters that represent sound syllables. This is valuable for phonetic and linguistic analysis.

pythainlp.util.syllable_length(syllable: str) → str[source]

Thai syllable length

This function is used for finding syllable’s length. (long or short)

Parameters:

syllable (str) – Thai syllable

Returns:

syllable’s length (long or short)

Return type:

str

Example:

>>> from pythainlp.util import syllable_length
>>> syllable_length("มาก")
'long'
>>> syllable_length("คะ")
'short'

The syllable_length function is a text analysis tool for calculating the length of syllables in Thai text. It is significant for linguistic analysis and language research.

pythainlp.util.syllable_open_close_detector(syllable: str) → str[source]

Open/close Thai syllables detector

This function is used for finding Thai syllables that are open or closed sound.

Parameters:

syllable (str) – Thai syllable

Returns:

open / close

Return type:

str

Example:

>>> from pythainlp.util import syllable_open_close_detector
>>> syllable_open_close_detector("มาก")
'close'
>>> syllable_open_close_detector("คะ")
'open'

The syllable_open_close_detector function is designed to detect syllable open and close statuses in Thai text. This information is vital for phonetic analysis and linguistic research.

pythainlp.util.text_to_arabic_digit(text: str) → str[source]

Converts spelled out digits in Thai to Arabic digits.

Parameters:

text – A digit spelled out in Thai

Returns:

An Arabic digit such as ‘1’, ‘2’, ‘3’ if the text is digit spelled out in Thai (ศูนย์, หนึ่ง, สอง, …, เก้า). Otherwise, it returns an empty string.

Return type:

str

Example:

>>> from pythainlp.util import text_to_arabic_digit

>>> text_to_arabic_digit("ศูนย์")
0
>>> text_to_arabic_digit("หนึ่ง")
1
>>> text_to_arabic_digit("แปด")
8
>>> text_to_arabic_digit("เก้า")
9

>>> # For text that is not digit spelled out in Thai
>>> text_to_arabic_digit("สิบ") == ""
True
>>> text_to_arabic_digit("เก้าร้อย") == ""
True

The text_to_arabic_digit function is a numeral conversion tool that translates Thai text numerals into Arabic numeral form. It is useful for numerical data extraction and processing.

pythainlp.util.text_to_num(text: str) → list[str][source]

Thai text to list of Thai words with floating point numbers

Parameters:

text (str) – Thai text with the spelled-out numerals

Returns:

list of Thai words with float values of the input

Return type:

List[str]

Example:

>>> from pythainlp.util import text_to_num
>>> text_to_num("เก้าร้อยแปดสิบจุดเก้าห้าบาทนี่คือจำนวนทั้งหมด")
['980.95', 'บาท', 'นี่', 'คือ', 'จำนวน', 'ทั้งหมด']
>>> text_to_num("สิบล้านสองหมื่นหนึ่งพันแปดร้อยแปดสิบเก้าบาท")
['10021889', 'บาท']

The text_to_num function focuses on extracting numerical values from text data. This is essential for converting textual numbers into numerical form for computation.

pythainlp.util.text_to_thai_digit(text: str) → str[source]

Converts spelled out digits in Thai to Thai digits.

Parameters:

text – A digit spelled out in Thai

Returns:

A Thai digit such as ‘๑’, ‘๒’, ‘๓’ if the text is digit spelled out in Thai (ศูนย์, หนึ่ง, สอง, …, เก้า). Otherwise, it returns an empty string.

Return type:

str

Example:

>>> from pythainlp.util import text_to_thai_digit

>>> text_to_thai_digit("ศูนย์")
๐
>>> text_to_thai_digit("หนึ่ง")
๑
>>> text_to_thai_digit("แปด")
๘
>>> text_to_thai_digit("เก้า")
๙

>>> # For text that is not Thai digit spelled out
>>> text_to_thai_digit("สิบ") == ""
True
>>> text_to_thai_digit("เก้าร้อย") == ""
True

The text_to_thai_digit function serves as a numeral conversion tool for translating Arabic numerals into Thai numeral form. This is important for rendering numbers in Thai text naturally.

pythainlp.util.thai_digit_to_arabic_digit(text: str) → str[source]

Converts Thai digits (i.e. ๑, ๓, ๑๐) to Arabic digits (i.e. 1, 3, 10).

Parameters:

text (str) – Text with Thai digits such as ‘๑’, ‘๒’, ‘๓’

Returns:

Text with Thai digits converted to Arabic digits such as ‘1’, ‘2’, ‘3’

Return type:

str

Example:

>>> from pythainlp.util import thai_digit_to_arabic_digit
>>> text = "เป็นจำนวน ๑๒๓,๔๐๐.๒๕ บาท"
>>> thai_digit_to_arabic_digit(text)
'เป็นจำนวน 123,400.25 บาท'

The thai_digit_to_arabic_digit function allows you to transform Thai numeral text into Arabic numeral format. This is valuable for numerical data extraction and computation tasks.

pythainlp.util.thai_strftime(dt_obj: datetime, fmt: str = '%-d %b %y', thaidigit: bool = False) → str[source]

Convert datetime.datetime into Thai date and time format.

The formatting directives are similar to datetime.strftime().

This function uses Thai names and Thai Buddhist Era for these directives:

%a - abbreviated weekday name (i.e. “จ”, “อ”, “พ”, “พฤ”, “ศ”, “ส”, “อา”)
%A - full weekday name (i.e. “วันจันทร์”, “วันอังคาร”, “วันเสาร์”, “วันอาทิตย์”)
%b - abbreviated month name (i.e. “ม.ค.”,”ก.พ.”,”มี.ค.”,”เม.ย.”,”พ.ค.”,”มิ.ย.”, “ธ.ค.”)
%B - full month name (i.e. “มกราคม”, “กุมภาพันธ์”, “พฤศจิกายน”, “ธันวาคม”,)
%y - year without century (i.e. “56”, “10”)
%Y - year with century (i.e. “2556”, “2410”)
%c - date and time representation (i.e. “พ 6 ต.ค. 01:40:00 2519”)
%v - short date representation (i.e. “ 6-ม.ค.-2562”, “27-ก.พ.-2555”)

Other directives will be passed to datetime.strftime()

Note:

The Thai Buddhist Era (BE) year is simply converted from AD by adding 543. This is certainly not accurate for years before 1941 AD, due to the change in Thai New Year’s Day.
This meant to be an interim solution, since Python standard’s locale module (which relied on C’s strftime()) does not support “th” or “th_TH” locale yet. If supported, we can just locale.setlocale(locale.LC_TIME, “th_TH”) and then use native datetime.strftime().

We are trying to make this platform-independent and support extensions as many as possible. See these links for strftime() extensions in POSIX, BSD, and GNU libc:

Python https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior

C https://en.cppreference.com/w/cpp/chrono/c/strftime

GNU https://metacpan.org/pod/POSIX::strftime::GNU

Linux https://linux.die.net/man/3/strftime

OpenBSD https://man.openbsd.org/strftime.3

FreeBSD https://www.unix.com/man-page/FreeBSD/3/strftime/

macOS https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/strftime.3.html

PHP https://secure.php.net/manual/en/function.strftime.php

JavaScript’s implementation https://github.com/samsonjs/strftime

strftime() quick reference https://strftime.net/

Parameters:

dt_obj (datetime) – an instantiatetd object of datetime.datetime
fmt (str) – string containing date and time directives
thaidigit (bool) – If thaidigit is set to False (default), number will be represented in Arabic digit. If it is set to True, it will be represented in Thai digit.

Returns:

Date and time text, with month in Thai name and year in Thai Buddhist era. The year is simply converted from AD by adding 543 (will not accurate for years before 1941 AD, due to change in Thai New Year’s Day).

Return type:

str

Example:

>>> from datetime import datetime
>>> from pythainlp.util import thai_strftime

>>> datetime_obj = datetime(year=2019, month=6, day=9, \
...     hour=5, minute=59, second=0, microsecond=0)

>>> print(datetime_obj)
2019-06-09 05:59:00

>>> thai_strftime(datetime_obj, "%A %d %B %Y")
'วันอาทิตย์ 09 มิถุนายน 2562'

>>> thai_strftime(datetime_obj, "%a %-d %b %y")  # no padding
'อา 9 มิ.ย. 62'

>>> thai_strftime(datetime_obj, "%a %_d %b %y")  # space padding
'อา  9 มิ.ย. 62'

>>> thai_strftime(datetime_obj, "%a %0d %b %y")  # zero padding
'อา 09 มิ.ย. 62'

>>> thai_strftime(datetime_obj, "%-H นาฬิกา %-M นาที", thaidigit=True)
'๕ นาฬิกา ๕๙ นาที'

>>> thai_strftime(datetime_obj, "%D (%v)")
'06/09/62 ( 9-มิ.ย.-2562)'

>>> thai_strftime(datetime_obj, "%c")
'อา  9 มิ.ย. 05:59:00 2562'

>>> thai_strftime(datetime_obj, "%H:%M %p")
'05:59 AM'

>>> thai_strftime(datetime_obj, "%H:%M %#p")
'05:59 am'

The thai_strftime function is a date formatting tool tailored for Thai culture. It is essential for displaying dates and times in a format that adheres to Thai conventions.

pythainlp.util.thai_strptime(text: str, fmt: str, year: str = 'be', add_year: int | None = None, tzinfo: ZoneInfo | None = zoneinfo.ZoneInfo(key='Asia/Bangkok')) → datetime[source]

Thai strptime

Parameters:

text (str) – text
fmt (str) – string containing date and time directives
year (str) – year of the text (ad is Anno Domini and be is Buddhist Era)
add_year (Optional[int]) – add to year when converting to ad. Default is None.
tzinfo (object) – tzinfo (default is Asia/Bangkok)

Returns:

The year that is converted to datetime.datetime

Return type:

datetime.datetime

The fmt chars that are supported:

%d - Day (1 - 31)
%B - Thai month (03, 3, มี.ค., or มีนาคม)
%Y - Year (66, 2566, or 2023)
%H - Hour (0 - 23)
%M - Minute (0 - 59)
%S - Second (0 - 59)
%f - Microsecond

Example:

>>> from pythainlp.util import thai_strptime

>>> thai_strptime("15 ก.ค. 2565 09:00:01","%d %B %Y %H:%M:%S")
datetime.datetime(2022, 7, 15, 9, 0, 1, tzinfo=zoneinfo.ZoneInfo(key='Asia/Bangkok'))

The thai_strptime function focuses on parsing dates and times in a Thai-specific format, making it easier to work with date and time data in a Thai context.

pythainlp.util.thai_to_eng(text: str) → str[source]

Corrects the given text that was incorrectly typed using Thai Kedmanee keyboard layout to the originally intended keyboard layout that is the English-US Qwerty keyboard.

Parameters:: text (str) – incorrect text input (English typed using Thai keyboard)
Returns:: English text with typing with incorrect keyboard layout is corrected
Return type:: str
Example:

Intentionally type “Bank of Thailand”, but got “ฺฟืา นด ธ้ฟรสฟืก”:

>>> from pythainlp.util import thai_to_eng
>>> thai_to_eng("ฺฟืา นด ธ้ฟรสฟืก")
'Bank of Thailand'

The thai_to_eng function is a text conversion tool for translating Thai text into its English transliterated form. This is beneficial for rendering Thai words and phrases in an English context.

pythainlp.util.to_idna(text: str) → str[source]

Encode text with IDNA, as used in Internationalized Domain Name (IDN).

Parameters:

text (str) – Thai text

Returns:

IDNA-encoded text

Return type:

str

Example:

>>> from pythainlp.util import to_idna
>>> to_idna("คนละครึ่ง.com")
'xn--42caj4e6bk1f5b1j.com'

The to_idna function is a text conversion tool for translating Thai text into its International Domain Name (IDN) for Thai domain name.

pythainlp.util.thai_word_tone_detector(word: str | None) → list[tuple[str, str]][source]

Thai tone detector for word.

It uses pythainlp.transliterate.pronunciate for converting word to pronunciation.

Parameters:

word (str, optional) – Thai word, or None

Returns:

list of tuples (syllable, tone) for each syllable. Tone values: l (low), m (mid), h (high), r (rising), f (falling), or empty string if it cannot be detected. Returns [] if word is None or empty.

Return type:

list[tuple[str, str]]

Example:

>>> from pythainlp.util import thai_word_tone_detector
>>> print(thai_word_tone_detector("คนดี"))
[('คน', 'm'), ('ดี', 'm')]
>>> print(thai_word_tone_detector("มือถือ"))
[('มือ', 'm'), ('ถือ', 'r')]
>>> print(thai_word_tone_detector(None))
[]

The thai_word_tone_detector function specializes in detecting and processing tonal marks in Thai words. It is essential for phonetic analysis and pronunciation guides.

pythainlp.util.thaiword_to_date(text: str, date: datetime | None = None) → datetime | None[source]

Convert Thai relative date to datetime.datetime.

Parameters:

text (str) – Thai text containing relative date
date (datetime.datetime) – date (default is datetime.datetime.now())

Returns:

datetime object, if it can be calculated. Otherwise, None.

Return type:

datetime.datetime

Example:

thaiword_to_date(“พรุ่งนี้”) # output: # datetime of tomorrow

The thaiword_to_date function facilitates the conversion of Thai word representations of dates into standardized date formats. This is important for date data extraction and processing.

pythainlp.util.thaiword_to_num(word: str) → int[source]

Converts the spelled-out numerals in Thai scripts into an actual integer.

Parameters:

word (str) – Spelled-out numerals in Thai scripts

Returns:

Corresponding integer value of the input

Return type:

int

Example:

>>> from pythainlp.util import thaiword_to_num
>>> thaiword_to_num("ศูนย์")
0
>>> thaiword_to_num("สองล้านสามแสนหกร้อยสิบสอง")
2300612

The thaiword_to_num function is a numeral conversion tool for translating Thai word numerals into numerical form. This is essential for numerical data extraction and computation.

pythainlp.util.thaiword_to_time(text: str, padding: bool = True) → str[source]

Convert Thai time in words into time (H:M).

Parameters:

text (str) – Thai time in words
padding (bool) – Zero pad the hour if True

Returns:

time string

Return type:

str

Example:

>>> from pythainlp.util import thaiword_to_time
>>> thaiword_to_time("บ่ายโมงครึ่ง")
'13:30'

The thaiword_to_time function is designed for converting Thai word representations of time into standardized time formats. It is crucial for time data extraction and processing.

pythainlp.util.time_to_thaiword(time_data: time | datetime | str, fmt: str = '24h', precision: str | None = None) → str[source]

Spell out time as Thai words.

Parameters:

time_data (datetime.time or datetime.datetime or str) – time input; a datetime.time object, a datetime.datetime object, or a string in H:M or H:M:S format (24-hour clock)
fmt (str) – time output format * 24h - 24-hour clock (default) * 6h - 6-hour clock * m6h - Modified 6-hour clock
precision (str) – precision of the spell out time * m - always spell out at minute level * s - always spell out at second level * None - spell out only non-zero parts

Returns:

Time spelled out as Thai words

Return type:

str

Example:

>>> from datetime import time
>>> from pythainlp.util import time_to_thaiword
>>> time_to_thaiword("8:17")
'แปดนาฬิกาสิบเจ็ดนาที'
>>> time_to_thaiword("8:17", "6h")
'สองโมงเช้าสิบเจ็ดนาที'
>>> time_to_thaiword("8:17", "m6h")
'แปดโมงสิบเจ็ดนาที'
>>> time_to_thaiword("18:30", fmt="m6h")
'หกโมงครึ่ง'
>>> time_to_thaiword(time(12, 3, 0))
'สิบสองนาฬิกาสามนาที'
>>> time_to_thaiword(time(12, 3, 0), precision="s")
'สิบสองนาฬิกาสามนาทีศูนย์วินาที'

The time_to_thaiword function focuses on converting time values into Thai word representations. This is valuable for rendering time in a natural Thai textual format.

pythainlp.util.tis620_to_utf8(text: str) → str[source]

Convert TIS-620 to UTF-8

Parameters:

text (str) – TIS-620 encoded text

Returns:

UTF-8 encoded text

Return type:

str

Example:

>>> from pythainlp.util import tis620_to_utf8
>>> tis620_to_utf8("¡ÃÐ·ÃÇ§ÍØµÊÒË¡ÃÃÁ")
'กระทรวงอุตสาหกรรม'

The tis620_to_utf8 function serves as a character encoding conversion tool for converting TIS-620 encoded text into UTF-8 format. This is significant for character encoding compatibility.

pythainlp.util.tone_detector(syllable: str) → str[source]

Thai tone detector for syllables

Return tone of a syllable.

l: low
m: mid
r: rising
f: falling
h: high
empty string: cannot be detected

Parameters:

syllable (str) – Thai syllable

Returns:

syllable’s tone (l, m, h, r, f) or empty if it cannot be detected

Return type:

str

Example:

>>> from pythainlp.util import tone_detector
>>> tone_detector("มา")
'm'
>>> tone_detector("ไม้")
'h'

The tone_detector function is a text processing tool for detecting tone marks and diacritics in Thai text. It is essential for phonetic analysis and pronunciation guides.

pythainlp.util.words_to_num(words: list[str]) → float[source]

Thai words to float.

Parameters:

words (list[str]) – Thai words (a number broken into tokens)

Returns:

float value of the words

Return type:

float

Example:

>>> from pythainlp.util import words_to_num
>>> words_to_num(["ห้า", "สิบ", "จุด", "เก้า", "ห้า"])
50.95

The words_to_num function is a numeral conversion utility that translates Thai word numerals into numerical form. It is important for numerical data extraction and computation.

pythainlp.util.thai_consonant_to_spelling(c: str) → str[source]

Thai consonants to spelling

Parameters:

c (str) – A Thai consonant

Returns:

spelling

Return type:

str

Example:

>>> from pythainlp.util import thai_consonant_to_spelling
>>> print(thai_consonant_to_spelling("ก"))
กอ

pythainlp.util.tone_to_spelling(t: str) → str[source]

Thai tonemarks to spelling

Parameters:

t (str) – A Thai tonemarks

Returns:

spelling

Return type:

str

Example:

>>> from pythainlp.util import tone_to_spelling
>>> print(tone_to_spelling("่"))  # ไม้เอก
ไม้เอก

pythainlp.util.spell_words.spell_syllable(text: str) → list[str][source]

Spell out syllables in Thai word distribution form.

Parameters:

text (str) – Thai syllables only

Returns:

list of spelled-out syllable components

Return type:

list[str]

Example:

>>> from pythainlp.util.spell_words import spell_syllable
>>> spell_syllable("แมว")
['มอ', 'วอ', 'แอ', 'แมว']

The pythainlp.util.spell_words.spell_syllable function focuses on spelling syllables in Thai text, an important feature for phonetic analysis and linguistic research.

pythainlp.util.spell_words.spell_word(text: str | None) → list[str][source]

Spell out words in Thai word distribution form.

Parameters:

text (Optional[str]) – Thai words only, or None

Returns:

List of spelled out words, empty list if text is None or empty

Return type:

list[str]

Example:

>>> from pythainlp.util.spell_words import spell_word
>>> spell_word("คนดี")
['คอ', 'นอ', 'คน', 'ดอ', 'อี', 'ดี', 'คนดี']
>>> spell_word(None)
[]

The pythainlp.util.spell_words.spell_word function is designed for spelling individual words in Thai text, facilitating phonetic analysis and pronunciation guides.

pythainlp.util.to_lunar_date(input_date: date) → str[source]

Convert the solar date to Thai Lunar Date

Parameters:

input_date (date) – date of the day.

Returns:

Thai text lunar date

Return type:

str

Example:

>>> from datetime import date
>>> from pythainlp.util import to_lunar_date
>>> to_lunar_date(date(2024, 1, 1))
'แรม 5 ค่ำ เดือน 1'
>>> to_lunar_date(date(2024, 12, 31))
'ขึ้น 2 ค่ำ เดือน 2'

The to_lunar_date function focuses on converts the solar date to Thai Lunar Date.

pythainlp.util.th_zodiac(year: int, output_type: int = 1) → str | int[source]

Thai Zodiac Year Name Converts a Gregorian year to its corresponding Zodiac name.

Parameters:

year (int) – The Gregorian year. AD (Anno Domini)
output_type (int) – Output type (1 = Thai, 2 = English, 3 = Number).

Returns:

The Zodiac name or number corresponding to the input year.

Return type:

Union[str, int]

Example:

>>> from pythainlp.util import th_zodiac
>>> # Get Thai zodiac name
>>> th_zodiac(2024, output_type=1)
'มะโรง'
>>> # Get English zodiac name
>>> th_zodiac(2024, output_type=2)
'DRAGON'
>>> # Get zodiac number
>>> th_zodiac(2024, output_type=3)
5

The th_zodiac function is converts a Gregorian year to its corresponding Thai Zodiac name.

class pythainlp.util.Trie(words: Iterable[str])[source]

Trie data structure for efficient prefix-based word search.

A Trie (prefix tree) is a tree-like data structure used to store a collection of strings. It enables fast retrieval of words with common prefixes, making it ideal for dictionary-based tokenization and autocomplete features.

Parameters:

words (Iterable[str]) – An iterable collection of words to initialize the Trie

Example:

>>> from pythainlp.util import Trie
>>> trie = Trie(["สวัสดี", "สวัส", "ดี", "ครับ"])
>>> "สวัสดี" in trie
True
>>> trie.prefixes("สวัสดีครับ")
['สวัส', 'สวัสดี']
>>> trie.add("สวัสดีตอนเช้า")
>>> len(trie)
5

The Trie class is a data structure for efficient dictionary operations. It’s a valuable resource for managing and searching word lists and dictionaries in a structured and efficient manner.

class Node[source]

__init__() → None[source]

end: bool

children: dict[str, Node] | None

__init__(words: Iterable[str]) → None[source]

root: Node

add(word: str) → None[source]

Add a word to the trie. Spaces in front of and following the word will be removed.

Parameters:: word (str) – a word

remove(word: str) → None[source]

Remove a word from the trie. If the word is not found, do nothing.

Parameters:: word (str) – a word

prefixes(text: str, start: int = 0) → list[str][source]

List all possible words from first sequence of characters in a word.

Parameters:

text (str) – text to search for prefixes
start (int) – starting position in text, defaults to 0

Returns:

a list of possible words starting at start

Return type:

list[str]

pythainlp.util.longest_common_subsequence(str1: str, str2: str) → str[source]

Find the longest common subsequence between two strings.

Parameters:

str1 (str) – The first string.
str2 (str) – The second string.

Returns:

The longest common subsequence.

Return type:

str

Example:

>>> from pythainlp.util.lcs import longest_common_subsequence
>>> longest_common_subsequence("ABCBDAB", "BDCAB")
'BDAB'

The longest_common_subsequence function is find the longest common subsequence between two strings.

pythainlp.util.morse.morse_encode(text: str, lang: str = 'th') → str[source]

Convert text to Morse code (support Thai and English)

Parameters:

text (str) – Text
lang (str) – Language Code (th is Thai and en is English)

Returns:

Morse code

Return type:

str

Example:

>>> from pythainlp.util.morse import morse_encode
>>> morse_encode("แมว", lang="th")
'.-.- -- .--'
>>> morse_encode("cat", lang="en")
'-.-. .- -'

The pythainlp.util.morse.morse_encode function is convert text to Morse code.

pythainlp.util.morse.morse_decode(morse_text: str, lang: str = 'th') → str[source]

Convert Morse code to text.

Thai decoding may produce incorrect characters that can be fixed with a spell corrector.

Parameters:

morse_text (str) – Morse code
lang (str) – language code ('th' for Thai, 'en' for English)

Returns:

decoded text

Return type:

str

Example:

>>> from pythainlp.util.morse import morse_decode
>>> morse_decode(".-.- -- .--", lang="th")
'แมว'
>>> morse_decode("-.-. .- -", lang="en")
'CAT'

The pythainlp.util.morse.morse_decode function is convert Morse code to text.