pythainlp.soundex
The pythainlp.soundex module provides soundex algorithms for the Thai language. Soundex is a phonetic algorithm used to encode words or names into a standardized representation based on their pronunciation, making it useful for tasks like name matching and search.
Modules
soundex
- pythainlp.soundex.soundex(text: str, engine: str = 'udom83', length: int = 4) str[source]
This function converts Thai text into phonetic code.
- Parameters:
- Returns:
Soundex code
- Return type:
- Options for engine:
udom83 (default) - Thai soundex algorithm proposed by Vichit Lorchirachoonkul [2]
lk82 - Thai soundex algorithm proposed by Wannee Udompanich [3]
metasound - Thai soundex algorithm based on a combination of Metaphone and Soundex proposed by Snae & Brückner [1]
prayut_and_somchaip - Thai-English Cross-Language Transliterated Word Retrieval using Soundex Technique [4]
complete_soundex - Complete Soundex for Thai Words Similarity Analysis [5]
- Example:
from pythainlp.soundex import soundex soundex("ลัก"), soundex("ลัก", engine='lk82'), \ soundex("ลัก", engine='metasound') # output: ('ร100000', 'ร1000', 'ล100') soundex("รัก"), soundex("รัก", engine='lk82'), \ soundex("รัก", engine='metasound') # output: ('ร100000', 'ร1000', 'ร100') soundex("รักษ์"), soundex("รักษ์", engine='lk82'), \ soundex("รักษ์", engine='metasound') # output: ('ร100000', 'ร1000', 'ร100') soundex("บูรณการ"), soundex("บูรณการ", engine='lk82'), \ soundex("บูรณการ", engine='metasound') # output: ('บ931900', 'บE419', 'บ551') soundex("ปัจจุบัน"), soundex("ปัจจุบัน", engine='lk82'), \ soundex("ปัจจุบัน", engine='metasound') # output: ('ป775300', 'ป3E54', 'ป223') soundex("vp", engine="prayut_and_somchaip") # output: '11' soundex("วีพี", engine="prayut_and_somchaip") # output: '11' soundex("ก้าน", engine="complete_soundex") # output: 'กก1Bน2-' soundex("ทราย", engine="complete_soundex") # output: 'ซซ1Bย0-'
The soundex function is a basic Soundex algorithm for the Thai language. It encodes a Thai word into a Soundex code, allowing for approximate matching of words with similar pronunciation.
lk82
- pythainlp.soundex.lk82(text: str) str[source]
This function converts Thai text into phonetic code with the Thai soundex algorithm named LK82 [3].
- Parameters:
text (str) – Thai word
- Returns:
LK82 soundex of the given Thai word
- Return type:
- Example:
from pythainlp.soundex import lk82 lk82("ลัก") # output: 'ร1000' lk82("รัก") # output: 'ร1000' lk82("รักษ์") # output: 'ร1000' lk82("บูรณการ") # output: 'บE419' lk82("ปัจจุบัน") # output: 'ป3E54'
The lk82 module implements the Thai Soundex algorithm proposed by Vichit Lorchirachoonkul in 1982. This module is suitable for encoding Thai words into Soundex codes for phonetic comparisons.
udom83
- pythainlp.soundex.udom83(text: str) str[source]
This function converts Thai text into phonetic code with the Thai soundex algorithm named Udom83 [2].
from pythainlp.soundex import udom83 udom83("ลัก") # output : 'ล100' udom83("รัก") # output: 'ร100' udom83("รักษ์") # output: 'ร100' udom83("บูรณการ") # output: 'บ5515' udom83("ปัจจุบัน") # output: 'ป775300'
The udom83 module is based on a homonymic approach for sound-alike string search. It encodes Thai words using the Wannee Udompanich Soundex algorithm developed in 1983.
metasound
- pythainlp.soundex.metasound(text: str, length: int = 4) str[source]
This function converts Thai text into phonetic code with the matching technique called MetaSound [1] (combination between Soundex and Metaphone algorithms). MetaSound algorithm was developed specifically for the Thai language.
- Parameters:
- Returns:
MetaSound for the given text
- Return type:
- Example:
from pythainlp.soundex.metasound import metasound metasound("ลัก") # output: 'ล100' metasound("รัก") # output: 'ร100' metasound("รักษ์") # output: 'ร100' metasound("บูรณการ", 5) # output: 'บ5515' metasound("บูรณการ", 6)) # output: 'บ55150' metasound("บูรณการ", 4) # output: 'บ551'
The metasound module implements a novel phonetic name matching algorithm with a statistical ontology for analyzing names based on Thai astrology. It offers advanced phonetic matching capabilities for Thai names.
prayut_and_somchaip
- pythainlp.soundex.prayut_and_somchaip(text: str, length: int = 4) str[source]
This function converts English-Thai Cross-Language Transliterated Word into phonetic code with the matching technique called Soundex [4].
- Parameters:
- Returns:
Soundex for the given text
- Return type:
- Example:
from pythainlp.soundex.prayut_and_somchaip import prayut_and_somchaip prayut_and_somchaip("king", 2) # output: '52' prayut_and_somchaip("คิง", 2) # output: '52'
The prayut_and_somchaip module is designed for Thai-English cross-language transliterated word retrieval using the Soundex technique. It is particularly useful for matching transliterated words in both languages.
complete_soundex
- pythainlp.soundex.complete_soundex(text: str) str[source]
Convert a Thai word into phonetic code using the Complete Soundex algorithm.
This function handles both single and multi-syllable words by internally tokenizing multi-syllable words when the syllable_tokenize dependency is available.
from pythainlp.soundex import complete_soundex # Single syllable encoding complete_soundex("ก้าน") # output: 'กก1Bน2-' complete_soundex("กลับ") # output: 'กก1Aบ0ล' # Multi-syllable words (automatically tokenized) complete_soundex("ปุญญา") # output: 'ปป4G0น-ยย1B0--*' complete_soundex("สวรรค์") # output: 'ซศ1A-0-วว1Aน0-' complete_soundex("ปันนา") # output: 'ปป1A0น-นน1B0--'
The complete_soundex function implements the Complete Soundex algorithm for Thai word phonetic encoding based on Tapsai et al. (2020). Unlike traditional Soundex methods, it generates variable-length codes representing every syllable in a word.
Each syllable is encoded using a 7-character block structure:
Initial Consonant (2 chars) - Phonetic grouping
Vowel (2 chars) - Including length markers
Final Consonant (1 char) - Sonorant clustering
Tone (1 char) - Tone mark encoding
Cluster Symbol (1 char) - Second consonant in clusters
The algorithm handles complex Thai phonetic patterns including ทร transformation, รร special rules, cluster detection, and implicit vowels. Multi-syllable words are automatically tokenized and encoded. This soundex is particularly effective for handling misspelled words, character variations, and similar pronunciations.
complete_soundex_similarity
- pythainlp.soundex.complete_soundex_similarity(code1: str, code2: str) float[source]
Calculate similarity between two Complete Soundex codes based on the character-wise comparison formula defined in Tapsai et al. (2020).
The similarity is calculated character-by-character using the formula: S(X,Y) = Sum(sim(c_xi, c_yi)) / max(len(X), len(Y))
Where sim(c_xi, c_yi) = 1 if characters match, else 0.
This implements Equation (1) from the paper (Section 3.3, page 55), which compares codes position-by-position rather than by syllable blocks.
- Parameters:
- Returns:
Similarity score between 0.0 and 1.0
- Return type:
- Example:
from pythainlp.soundex import complete_soundex, complete_soundex_similarity # Encode two words code1 = complete_soundex("ข้มขืน") # Bitter/Forced (with tone) code2 = complete_soundex("ขมขืน") # Bitter (no tone) # Calculate similarity similarity = complete_soundex_similarity(code1, code2) # output: ~0.93 (13 matches out of 14 characters) # Perfect match code_a = complete_soundex("ก้าน") code_b = complete_soundex("ก้าน") complete_soundex_similarity(code_a, code_b) # output: 1.0 # No match code_x = complete_soundex("ทราย") code_y = complete_soundex("น้ำ") complete_soundex_similarity(code_x, code_y) # output: 0.0 (completely different)
The complete_soundex_similarity function calculates the similarity between two Complete Soundex codes using character-wise comparison.
The similarity is computed using the formula:
S(X,Y) = Σ(sim(c_xi, c_yi)) / max(len(X), len(Y))
where sim = 1 if characters match at position i, else 0.
The result is normalized by the maximum length of the two codes, returning a float between 0.0 (no match) and 1.0 (perfect match). This function is useful for finding phonetically similar Thai words and handling spelling variations.
pythainlp.soundex.sound.word_approximation
The pythainlp.soundex.sound.word_approximation module offers word approximation functionality. It allows users to find Thai words that are phonetically similar to a given word.
pythainlp.soundex.sound.audio_vector
The pythainlp.soundex.sound.audio_vector module provides audio vector functionality for Thai words. It allows users to work with audio vectors based on phonetic properties.
pythainlp.soundex.sound.word2audio
The pythainlp.soundex.sound.word2audio module is designed for converting Thai words to audio representations. It enables users to obtain audio vectors for Thai words, which can be used for various applications.