pythainlp.util¶
The pythainlp.util
contains utility functions, like text conversion and formatting
Modules¶
-
pythainlp.util.
arabic_digit_to_thai_digit
(text: str) → str[source]¶ This function convert Arabic digits (i.e. 1, 3, 10) to Thai digits (i.e. ๑, ๓, ๑๐).
- Parameters
text (str) – Text with Arabic digits such as ‘1’, ‘2’, ‘3’
- Returns
Text with Arabic digits being converted to Thai digits such as ‘๑’, ‘๒’, ‘๓’
- Return type
- Example
from pythainlp.util import arabic_digit_to_thai_digit text = 'เป็นจำนวน 123,400.25 บาท' arabic_digit_to_thai_digit(text) # output: เป็นจำนวน ๑๒๓,๔๐๐.๒๕ บาท
-
pythainlp.util.
bahttext
(number: float) → str[source]¶ This function converts a number to Thai text and adds a suffix “บาท” (Baht). The precision will be fixed at two decimal places (0.00) to fits “สตางค์” (Satang) unit. This function works similar to BAHTTEXT function in Microsoft Excel.
- Parameters
number (float) – number to be converted into Thai Baht currency format
- Returns
text representing the amount of money in the format of Thai currency
- Return type
- Example
from pythainlp.util import bahttext bahttext(1) # output: หนึ่งบาทถ้วน bahttext(21) # output: ยี่สิบเอ็ดบาทถ้วน bahttext(200) # output: สองร้อยบาทถ้วน
-
pythainlp.util.
collate
(data: Iterable, reverse: bool = False) → List[str][source]¶ This function sorts a list of strings according to Thai alphabets.
- Parameters
- Returns
a list of strings, sorted alphabetically, according to Thai alphabets
- Return type
- Example
from pythainlp.util import collate collate(['ไก่', 'เกิด', 'กาล', 'เป็ด', 'หมู', 'วัว', 'วันที่']) # output: ['กาล', 'เกิด', 'ไก่', 'เป็ด', 'วันที่', 'วัว', 'หมู'] collate(['ไก่', 'เกิด', 'กาล', 'เป็ด', 'หมู', 'วัว', 'วันที่'], \ reverse=True) # output: ['หมู', 'วัว', 'วันที่', 'เป็ด', 'ไก่', 'เกิด', 'กาล']
-
pythainlp.util.
dict_trie
(dict_source: Union[str, Iterable[str], pythainlp.util.trie.Trie]) → pythainlp.util.trie.Trie[source]¶ Create a dictionary trie from a file or an iterable.
- Parameters
dict_source (str|Iterable[str]|pythainlp.util.Trie) – a path to dictionary file or a list of words or a pythainlp.util.Trie object
- Returns
a trie object
- Return type
-
pythainlp.util.
digit_to_text
(text: str) → str[source]¶ - Parameters
text (str) – Text with digits such as ‘1’, ‘2’, ‘๓’, ‘๔’
- Returns
Text with digits being spelled out in Thai
-
pythainlp.util.
display_thai_char
(char: str) → str[source]¶ This function adds a underscore (_) prefix to high-position vowels and tone marks to ease readability
- Parameters
character (str) –
- Returns
returns True if the input text all contains Thai characters, otherwise returns False
- Return type
- Example
display_thai_char(“้”) # output: “_้”
-
pythainlp.util.
emoji_to_thai
(text: str, delimiters=(':', ':')) → str[source]¶ This function convert emoji to thai meaning
- Parameters
text (str) – Text with Emoji
- Returns
Text with Emoji being converted to thai meaning
- Return type
- Example
from pythainlp.util import emoji_to_thai emoji_to_thai("จะมานั่งรถเมล์เหมือนผมก็ได้นะครับ ใกล้ชิดประชาชนดี 😀") # output: จะมานั่งรถเมล์เหมือนผมก็ได้นะครับ ใกล้ชิดประชาชนดี :หน้ายิ้มยิงฟัน: emoji_to_thai("หิวข้าวอยากกินอาหารญี่ปุ่น 🍣") # output: หิวข้าวอยากกินอาหารญี่ปุ่น :ซูชิ: emoji_to_thai("🇹🇭 นี่คิือธงประเทศไทย") # output: :ธง_ไทย: นี่คิือธงประเทศไทย
-
pythainlp.util.
eng_to_thai
(text: str) → str[source]¶ Corrects the given text that was incorrectly typed using English-US Qwerty keyboard layout to the originally intended keyboard layout that is the Thai Kedmanee keyboard.
- Parameters
text (str) – incorrect text input (type Thai with English keyboard)
- Returns
Thai text where incorrect typing with a keyboard layout is corrected
- Return type
- Example
Intentionally type “ธนาคารแห่งประเทศไทย”, but got “Tok8kicsj’xitgmLwmp”:
from pythainlp.util import eng_to_thai eng_to_thai("Tok8kicsj'xitgmLwmp") # output: ธนาคารแห่งประเทศไทย
-
pythainlp.util.
find_keyword
(word_list: List[str], min_len: int = 3) → Dict[str, int][source]¶ This function count the frequency of words in the list where stopword is excluded and returns as a frequency dictionary.
- Parameters
- Returns
a dictionary object with key-value pair as word and its raw count
- Return type
- Example
from pythainlp.util import find_keyword words = ["บันทึก", "เหตุการณ์", "บันทึก", "เหตุการณ์", " ", "มี", "การ", "บันทึก", "เป็น", " ", "ลายลักษณ์อักษร" "และ", "การ", "บันทึก","เสียง","ใน","เหตุการณ์"] find_keyword(words) # output: {'บันทึก': 4, 'เหตุการณ์': 3} find_keyword(words, min_len=1) # output: {' ': 2, 'บันทึก': 4, 'ลายลักษณ์อักษรและ': 1, 'เสียง': 1, 'เหตุการณ์': 3}
-
pythainlp.util.
countthai
(text: str, ignore_chars: str = ' \t\n\r\x0b\x0c0123456789!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~') → float[source]¶ This function calculates percentage of Thai characters in the text with an option to ignored some characters.
- Parameters
- Returns
percentage of Thai characters in the text
- Return type
- Example
Find the percentage of Thai characters in the textt with default ignored characters set (whitespace, newline character, punctuation and digits):
from pythainlp.util import countthai countthai("ดอนัลด์ จอห์น ทรัมป์ English: Donald John Trump") # output: 45.0 countthai("(English: Donald John Trump)") # output: 0.0
Find the percentage of Thai characters in the text while ignoring only punctuation but not whitespace, newline character and digits:
import string string.punctuation # output: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ countthai("ดอนัลด์ จอห์น ทรัมป์ English: Donald John Trump", \ ignore_chars=string.punctuation) # output: 39.130434782608695 countthai("ดอนัลด์ จอห์น ทรัมป์ (English: Donald John Trump)", \ ignore_chars=string.punctuation) # output: 0.0
-
pythainlp.util.
is_native_thai
(word: str) → bool[source]¶ Check if a word is an “native Thai word” (Thai: “คำไทยแท้”) This function based on a simple heuristic algorithm and cannot be entirely reliable.
English word:
from pythainlp.util import is_native_thai is_native_thai("Avocado") # output: False
Native Thai word:
is_native_thai("มะม่วง") # output: True is_native_thai("ตะวัน") # output: True
Non-native Thai word:
is_native_thai("สามารถ") # output: False is_native_thai("อิสริยาภรณ์") # output: False
-
pythainlp.util.
isthai
(word: str, ignore_chars: str = '.') → bool[source]¶ This function checks if all character in the input string are Thai character.
- Parameters
- Returns
returns True if the input text all contains Thai characters, otherwise returns False
- Return type
- Example
Check if all character is Thai character. By default, it ignores only full stop (“.”):
from pythainlp.util import isthai isthai("กาลเวลา") # output: True isthai("กาลเวลา.") # output: True
Explicitly ignore digits, whitespace, and the following characters (“-“, “.”, “$”, “,”):
from pythainlp.util import isthai isthai("กาลเวลา, การเวลา-ก, 3.75$", ignore_chars="1234567890.-,$ ") # output: True
-
pythainlp.util.
isthaichar
(ch: str) → bool[source]¶ This function checks if the input character is a Thai character.
- Parameters
ch (str) – input character
- Returns
returns True if the input character is a Thai characttr, otherwise returns False
- Return type
- Example
from pythainlp.util import isthaichar isthaichar("ก") # THAI CHARACTER KO KAI # output: True isthaichar("๐") # THAI DIGIT ZERO # output: True isthaichar("๕") # THAI DIGIT FIVE # output: True
-
pythainlp.util.
normalize
(text: str) → str[source]¶ Normalize and clean Thai text with normalizing rules as follows:
Remove zero-width spaces
Remove duplicate spaces
Reorder tone marks and vowels to standard order/spelling
Remove duplicate vowels and signs
Remove duplicate tone marks
Remove dangling non-base characters at the beginning of text
normalize() simply call remove_zw(), remove_dup_spaces(), remove_repeat_vowels(), and remove_dangling(), in that order.
If a user wants to customize the selection or the order of rules to be applied, they can choose to call those functions by themselves.
Note: for Unicode normalization, see unicodedata.normalize().
- Parameters
text (str) – input text
- Returns
normalized text according to the fules
- Return type
- Example
from pythainlp.util import normalize normalize('เเปลก') # starts with two Sara E # output: แปลก normalize('นานาาา') # output: นานา
-
pythainlp.util.
now_reign_year
() → int[source]¶ Return the reign year of the 10th King of Chakri dynasty.
- Returns
reign year of the 10th King of Chakri dynasty.
- Return type
- Example
from pythainlp.util import now_reign_year text = "เป็นปีที่ {reign_year} ในรัชกาลปัจจุบัน"\ .format(reign_year=now_reign_year()) print(text) # output: เป็นปีที่ 4 ในรัชการปัจจุบัน
-
pythainlp.util.
num_to_thaiword
(number: int) → str[source]¶ This function convert number to Thai text
- Parameters
number (int) – an integer number to be converted to Thai text
- Returns
text representing the number in Thai
- Return type
- Example
from pythainlp.util import num_to_thaiword num_to_thaiword(1) # output: หนึ่ง num_to_thaiword(11) # output: สิบเอ็ด
-
pythainlp.util.
rank
(words: List[str], exclude_stopwords: bool = False) → collections.Counter[source]¶ Count word frequecy given a list of Thai words with an option to exclude stopwords.
- Parameters
- Returns
a Counter object representing word frequency from the text
- Return type
- Example
Include stopwords in counting word frequency:
from pythainlp.util import rank words = ["บันทึก", "เหตุการณ์", " ", "มี", "การ", "บันทึก", \ "เป็น", " ", "ลายลักษณ์อักษร"] rank(words) # output: # Counter( # { # ' ': 2, # 'การ': 1, # 'บันทึก': 2, # 'มี': 1, # 'ลายลักษณ์อักษร': 1, # 'เป็น': 1, # 'เหตุการณ์': 1 # })
Exclude stopword in counting word frequency:
from pythainlp.util import rank words = ["บันทึก", "เหตุการณ์", " ", "มี", "การ", "บันทึก", \ "เป็น", " ", "ลายลักษณ์อักษร"] rank(words) # output: # Counter( # { # ' ': 2, # 'บันทึก': 2, # 'ลายลักษณ์อักษร': 1, # 'เหตุการณ์': 1 # })
-
pythainlp.util.
reign_year_to_ad
(reign_year: int, reign: int) → int[source]¶ Convert reigh year to AD.
Return AD year according to the reign year for the 7th to 10th King of Chakri dynasty, Thailand. For instance, the AD year of the 4th reign year of the 10th King is 2019.
- Parameters
- Returns
the year in AD of the King given the reign and reign year.
- Return type
- Example
from pythainlp.util import reign_year_to_ad print("The 4th reign year of the King Rama X is in", \ reign_year_to_ad(4, 10)) # output: The 4th reign year of the King Rama X is in 2019 print("The 1st reign year of the King Rama IX is in", \ reign_year_to_ad(1, 9)) # output: The 4th reign year of the King Rama X is in 1946
-
pythainlp.util.
remove_dangling
(text: str) → str[source]¶ Remove Thai non-base characters at the beginning of text.
This is a common “typo”, especially for input field in a form, as these non-base characters can be visually hidden from user who may accidentally typed them in.
A character to be removed should be both:
tone mark, above vowel, below vowel, or non-base sign AND
located at the beginning of the text
-
pythainlp.util.
remove_dup_spaces
(text: str) → str[source]¶ Remove duplicate spaces. Replace multiple spaces with one space.
Multiple newline characters and empty lines will be replaced with one newline character.
-
pythainlp.util.
remove_repeat_vowels
(text: str) → str[source]¶ Remove repeating vowels, tone marks, and signs.
This function will call reorder_vowels() first, to make sure that double Sara E will be converted to Sara Ae and not be removed.
-
pythainlp.util.
remove_tonemark
(text: str) → str[source]¶ Remove all Thai tone marks from the text.
Thai script has four tone marks indicating four tones as follows:
Down tone (Thai: ไม้เอก _่ )
Falling tone (Thai: ไม้โท _้ )
High tone (Thai: ไม้ตรี _๊ )
Rising tone (Thai: ไม้จัตวา _๋ )
Putting wrong tone mark is a common mistake in Thai writing. By removing tone marks from the string, it could be used to for a approximate string matching
from pythainlp.util import delete_tone delete_tone('สองพันหนึ่งร้อยสี่สิบเจ็ดล้านสี่แสนแปดหมื่นสามพันหกร้อยสี่สิบเจ็ด') # output: สองพันหนึงรอยสีสิบเจ็ดลานสีแสนแปดหมืนสามพันหกรอยสีสิบเจ็ด
-
pythainlp.util.
remove_zw
(text: str) → str[source]¶ Remove zero-width characters.
These non-visible characters may cause unexpected result from the user’s point of view. Removing them can make string matching more robust.
Characters to be removed:
Zero-width space (ZWSP)
Zero-width non-joiner (ZWJP)
-
pythainlp.util.
reorder_vowels
(text: str) → str[source]¶ Reorder vowels and tone marks to the standard logical order/spelling.
Characters in input text will be reordered/transformed, according to these rules:
Sara E + Sara E -> Sara Ae
Nikhahit + Sara Aa -> Sara Am
tone mark + non-base vowel -> non-base vowel + tone mark
follow vowel + tone mark -> tone mark + follow vowel
-
pythainlp.util.
text_to_arabic_digit
(text: str) → str[source]¶ This function convert Thai spelled out digits to Arabic digits.
- Parameters
text – A digit spelled out in Thai
- Returns
An Arabic digit such as ‘1’, ‘2’, ‘3’ if the text is Thai digit spelled out (ศูนย์, หนึ่ง, สอง, …, เก้า). Otherwise, it returns an empty string.
- Return type
- Example
from pythainlp.util import text_to_arabic_digit text_to_arabic_digit("ศูนย์") # output: 0 text_to_arabic_digit("หนึ่ง") # output: 1 text_to_arabic_digit("แปด") # output: 8 text_to_arabic_digit("เก้า") # output: 9 # For text that is not Thai digit spelled out text_to_arabic_digit("สิบ") == "" # output: True text_to_arabic_digit("เก้าร้อย") == "" # output: True
-
pythainlp.util.
text_to_thai_digit
(text: str) → str[source]¶ This function convert Thai spelled out digits to Thai digits.
- Parameters
text – A digit spelled out in Thai
- Returns
A Thai digit such as ‘๑’, ‘๒’, ‘๓’ if the text is Thai digit spelled out (ศูนย์, หนึ่ง, สอง, …, เก้า). Otherwise, it returns an empty string.
- Return type
- Example
from pythainlp.util import text_to_thai_digit text_to_thai_digit("ศูนย์") # output: ๐ text_to_thai_digit("หนึ่ง") # output: ๑ text_to_thai_digit("แปด") # output: ๘ text_to_thai_digit("เก้า") # output: ๙ # For text that is not Thai digit spelled out text_to_thai_digit("สิบ") == "" # output: True text_to_thai_digit("เก้าร้อย") == "" # output: True
-
pythainlp.util.
thai_strftime
(dt_obj: datetime.datetime, fmt: str = '%-d %b %y', thaidigit: bool = False) → str[source]¶ Convert
datetime.datetime
into Thai date and time format.The formatting directives are similar to
datatime.strrftime()
.- This function uses Thai names and Thai Buddhist Era for these directives:
%a - abbreviated weekday name (i.e. “จ”, “อ”, “พ”, “พฤ”, “ศ”, “ส”, “อา”)
%A - full weekday name (i.e. “วันจันทร์”, “วันอังคาร”, “วันเสาร์”, “วันอาทิตย์”)
%b - abbreviated month name (i.e. “ม.ค.”,”ก.พ.”,”มี.ค.”,”เม.ย.”,”พ.ค.”,”มิ.ย.”, “ธ.ค.”)
%B - full month name (i.e. “มกราคม”, “กุมภาพันธ์”, “พฤศจิกายน”, “ธันวาคม”,)
%y - year without century (i.e. “56”, “10”)
%Y - year with century (i.e. “2556”, “2410”)
%c - date and time representation (i.e. “พ 6 ต.ค. 01:40:00 2519”)
%v - short date representation (i.e. ” 6-ม.ค.-2562”, “27-ก.พ.-2555”)
Other directives will be passed to datetime.strftime()
- Note
The Thai Buddhist Era (BE) year is simply converted from AD by adding 543. This is certainly not accurate for years before 1941 AD, due to the change in Thai New Year’s Day.
This meant to be an interrim solution, since Python standard’s locale module (which relied on C’s strftime()) does not support “th” or “th_TH” locale yet. If supported, we can just locale.setlocale(locale.LC_TIME, “th_TH”) and then use native datetime.strftime().
We trying to make this platform-independent and support extentions as many as possible. See these links for strftime() extensions in POSIX, BSD, and GNU libc:
Python https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
JavaScript’s implementation https://github.com/samsonjs/strftime
strftime() quick reference http://www.strftime.net/
- Parameters
- Returns
Date and time text, with month in Thai name and year in Thai Buddhist era. The year is simply converted from AD by adding 543 (will not accurate for years before 1941 AD, due to change in Thai New Year’s Day).
- Return type
- Example
from datetime import datetime from pythainlp.util import thai_strftime datetime_obj = datetime(year=2019, month=6, day=9, \ hour=5, minute=59, second=0, microsecond=0) print(datetime_obj) # output: 2019-06-09 05:59:00 thai_strftime(datetime_obj, "%A %d %B %Y") # output: 'วันอาทิตย์ 09 มิถุนายน 2562' thai_strftime(datetime_obj, "%a %-d %b %y") # no padding # output: 'อา 9 มิ.ย. 62' thai_strftime(datetime_obj, "%a %_d %b %y") # space padding # output: 'อา 9 มิ.ย. 62' thai_strftime(datetime_obj, "%a %0d %b %y") # zero padding # output: 'อา 09 มิ.ย. 62' thai_strftime(datetime_obj, "%-H นาฬิกา %-M นาที", thaidigit=True) # output: '๕ นาฬิกา ๕๙ นาที' thai_strftime(datetime_obj, "%D (%v)") # output: '06/09/62 ( 9-มิ.ย.-2562)' thai_strftime(datetime_obj, "%c") # output: 'อา 9 มิ.ย. 05:59:00 2562' thai_strftime(datetime_obj, "%H:%M %p") # output: '01:40 AM' thai_strftime(datetime_obj, "%H:%M %#p") # output: '01:40 am'
-
pythainlp.util.
thai_to_eng
(text: str) → str[source]¶ Corrects the given text that was incorrectly typed using Thai Kedmanee keyboard layout to the originally intended keyboard layout that is the English-US Qwerty keyboard.
- Parameters
text (str) – incorrect text input (type English with Thai keyboard)
- Returns
English text where incorrect typing with a keyboard layout is corrected
- Return type
- Example
Intentionally type “Bank of Thailand”, but got “ฺฟืา นด ธ้ฟรสฟืก”:
from pythainlp.util import eng_to_thai thai_to_eng("ฺฟืา นด ธ้ฟรสฟืก") # output: 'Bank of Thailand'
-
pythainlp.util.
thai_digit_to_arabic_digit
(text: str) → str[source]¶ This function convert Thai digits (i.e. ๑, ๓, ๑๐) to Arabic digits (i.e. 1, 3, 10).
- Parameters
text (str) – Text with Thai digits such as ‘๑’, ‘๒’, ‘๓’
- Returns
Text with Thai digits being converted to Arabic digits such as ‘1’, ‘2’, ‘3’
- Return type
- Example
from pythainlp.util import thai_digit_to_arabic_digit text = 'เป็นจำนวน ๑๒๓,๔๐๐.๒๕ บาท' thai_digit_to_arabic_digit(text) # output: เป็นจำนวน 123,400.25 บาท
-
pythainlp.util.
thaiword_to_date
(text: str, date: Optional[datetime.datetime] = None) → Optional[datetime.datetime][source]¶ Convert Thai relative date to
datetime.datetime
.- Parameters
text (str) – Thai text contains relative date
date (datetime.datetime) – date (default is datetime.datetime.now())
- Returns
datetime object, if it can be calculated. Otherwise, None.
- Return type
- Example
thaiword_to_date(“พรุ่งนี้”) # output: # datetime of tomorrow
-
pythainlp.util.
thaiword_to_num
(word: str) → int[source]¶ Converts the spelled-out numerals in Thai scripts into an actual integer.
- Parameters
word (str) – Spelled-out numerals in Thai scripts
- Returns
Corresponding integer value of the input
- Return type
- Example
from pythainlp.util import thaiword_to_num thaiword_to_num("ศูนย์") # output: 0 thaiword_to_num("สองล้านสามแสนหกร้อยสิบสอง") # output: 2300612
-
pythainlp.util.
thaiword_to_time
(text: str, padding: bool = True) → str[source]¶ Convert Thai time in words into time (H:M).
-
pythainlp.util.
time_to_thaiword
(time_data: Union[datetime.time, datetime.datetime, str], fmt: str = '24h', precision: Optional[str] = None) → str[source]¶ Spell out time to Thai words.
- Parameters
time_data (str) – time input, can be a datetime.time object or a datetime.datetime object or a string (in H:M or H:M:S format, using 24-hour clock)
fmt (str) – time output format * 24h - 24-hour clock (default) * 6h - 6-hour clock * m6h - Modified 6-hour clock
precision (str) – precision of the spell out * m - always spell out to minute level * s - always spell out to second level * None - spell out only non-zero parts
- Returns
Time spell out in Thai words
- Return type
- Example
time_to_thaiword(“8:17”) # output: # แปดนาฬิกาสิบเจ็ดนาที
time_to_thaiword(“8:17”, “6h”) # output: # สองโมงเช้าสิบเจ็ดนาที
time_to_thaiword(“8:17”, “m6h”) # output: # แปดโมงสิบเจ็ดนาที
time_to_thaiword(“18:30”, fmt=”m6h”) # output: # หกโมงครึ่ง
time_to_thaiword(datetime.time(12, 3, 0)) # output: # สิบสองนาฬิกาสามนาที
time_to_thaiword(datetime.time(12, 3, 0), precision=”s”) # output: # สิบสองนาฬิกาสามนาทีศูนย์วินาที
-
class
pythainlp.util.
Trie
(words: Iterable[str])[source]¶ -
add
(word: str) → None[source]¶ Add a word to the trie. Spaces in front of and following the word will be removed.
- Parameters
text (str) – a word
-