pythainlp.transliterate
The pythainlp.transliterate
module is dedicated to the transliteration of Thai text into romanized form, effectively spelling it out with the English alphabet. This functionality is invaluable for making Thai text more accessible to non-Thai speakers and for various language processing tasks.
Modules
- pythainlp.transliterate.romanize(text: str, engine: str = 'royin', fallback_engine: str = 'royin') str [source]
This function renders Thai words in the Latin alphabet or “romanization”, using the Royal Thai General System of Transcription (RTGS) [1]. RTGS is the official system published by the Royal Institute of Thailand. (Thai: ถอดเสียงภาษาไทยเป็นอักษรละติน)
- Parameters:
text (str) – Thai text to be romanized
engine (str) – One of ‘royin’ (default), ‘thai2rom’, ‘thai2rom_onnx, ‘tltk’, and ‘lookup’. See more in options for engine section.
fallback_engine (str) – If engine equals ‘lookup’, use fallback_engine for words that are not in the transliteration dict. No effect on other engines. Default to ‘royin’.
- Returns:
A string of Thai words rendered in the Latin alphabet.
- Return type:
- Options for engines:
royin - (default) based on the Royal Thai General System of Transcription issued by Royal Institute of Thailand.
thai2rom - a deep learning-based Thai romanization engine (require PyTorch).
thai2rom_onnx - a deep learning-based Thai romanization engine with ONNX runtime
tltk - TLTK: Thai Language Toolkit
lookup - Look up on Thai-English Transliteration dictionary v1.4 compiled by Wannaphong.
- Example:
from pythainlp.transliterate import romanize romanize("สามารถ", engine="royin") # output: 'samant' romanize("สามารถ", engine="thai2rom") # output: 'samat' romanize("สามารถ", engine="tltk") # output: 'samat' romanize("ภาพยนตร์", engine="royin") # output: 'phapn' romanize("ภาพยนตร์", engine="thai2rom") # output: 'phapphayon' romanize("ภาพยนตร์", engine="thai2rom_onnx") # output: 'phapphayon' romanize("ก็อปปี้", engine="lookup") # output: 'copy'
The romanize function allows you to transliterate Thai text, converting it into a phonetic representation using the English alphabet. It’s a fundamental tool for rendering Thai words and phrases in a more familiar format.
- pythainlp.transliterate.transliterate(text: str, engine: str = 'thaig2p') str [source]
This function transliterates Thai text.
- Parameters:
- Returns:
A string of phonetic alphabets indicating how the input text should be pronounced.
- Return type:
- Options for engines:
thaig2p - (default) Thai Grapheme-to-Phoneme, output is IPA (require PyTorch)
icu - pyicu, based on International Components for Unicode (ICU)
ipa - epitran, output is International Phonetic Alphabet (IPA)
tltk_g2p - Thai Grapheme-to-Phoneme from TLTK.,
iso_11940 - Thai text into Latin characters with ISO 11940.
tltk_ipa - tltk, output is International Phonetic Alphabet (IPA)
- Example:
from pythainlp.transliterate import transliterate transliterate("สามารถ", engine="icu") # output: 's̄āmārt̄h' transliterate("สามารถ", engine="ipa") # output: 'saːmaːrot' transliterate("สามารถ", engine="thaig2p") # output: 's aː ˩˩˦ . m aː t̚ ˥˩' transliterate("สามารถ", engine="tltk_ipa") # output: 'saː5.maːt3' transliterate("สามารถ", engine="tltk_g2p") # output: 'saa4~maat2' transliterate("สามารถ", engine="iso_11940") # output: 's̄āmārt̄h' transliterate("ภาพยนตร์", engine="icu") # output: 'p̣hāphyntr̒' transliterate("ภาพยนตร์", engine="ipa") # output: 'pʰaːpjanot' transliterate("ภาพยนตร์", engine="thaig2p") # output: 'pʰ aː p̚ ˥˩ . pʰ a ˦˥ . j o n ˧' transliterate("ภาพยนตร์", engine="iso_11940") # output: 'p̣hāphyntr'
The transliterate function serves as a versatile transliteration tool, offering a range of transliteration engines to choose from. It provides flexibility and customization for your transliteration needs.
- pythainlp.transliterate.pronunciate(word: str, engine: str = 'w2p') str [source]
This function pronunciates Thai word.
- Parameters:
- Returns:
A string of Thai letters indicating how the input text should be pronounced.
- Return type:
- Options for engines:
w2p - Thai Word-to-Phoneme
- Example:
from pythainlp.transliterate import pronunciate pronunciate("สามารถ", engine="w2p") # output: 'สา-มาด' pronunciate("ภาพยนตร์", engine="w2p") # output: 'พาบ-พะ-ยน'
This function provides assistance in generating phonetic representations of Thai words, which is particularly useful for language learning and pronunciation practice.
- pythainlp.transliterate.puan(word: str, show_pronunciation: bool = True) str [source]
Thai Spoonerism
This function converts Thai word to spoonerism word.
- Parameters:
- Returns:
A string of Thai spoonerism word.
- Return type:
- Example:
from pythainlp.transliterate import puan puan("นาริน") # output: 'นิน-รา' puan("นาริน", False) # output: 'นินรา'
The puan function offers a unique transliteration feature known as “Puan.” It provides a specialized transliteration method for Thai text and is an additional option for rendering Thai text into English characters.
- class pythainlp.transliterate.wunsen.WunsenTransliterate[source]
Transliterating Japanese/Korean/Mandarin/Vietnamese romanization text to Thai text by Wunsen
- See Also:
The WunsenTransliterate class represents a transliteration engine known as “Wunsen.” It offers specific transliteration methods for rendering Thai text into a phonetic English format.
- transliterate(text: str, lang: str, jp_input: str | None = None, zh_sandhi: bool | None = None, system: str | None = None)[source]
Use Wunsen for transliteration
- Parameters:
- Returns:
Thai text
- Return type:
- Options for lang:
jp - Japanese (from Hepburn romanization)
ko - Korean (from Revised Romanization)
vi - Vietnamese (Latin script)
zh - Mandarin (from Hanyu Pinyin)
- Options for jp_input:
Hepburn-no diacritic - Hepburn-no diacritic (without macron)
- Options for zh_sandhi:
True - apply third tone sandhi rule
False - do not apply third tone sandhi rule
- Options for system:
- ORS61 - for Japanese หลักเกณฑ์การทับศัพท์ภาษาญี่ปุ่น
(สำนักงานราชบัณฑิตยสภา พ.ศ. 2561)
- RI35 - for Japanese หลักเกณฑ์การทับศัพท์ภาษาญี่ปุ่น
(ราชบัณฑิตยสถาน พ.ศ. 2535)
- RI49 - for Mandarin หลักเกณฑ์การทับศัพท์ภาษาจีน
(ราชบัณฑิตยสถาน พ.ศ. 2549)
- THC43 - for Mandarin เกณฑ์การถ่ายทอดเสียงภาษาจีนแมนดาริน
ด้วยอักขรวิธีไทย (คณะกรรมการสืบค้นประวัติศาสตร์ไทยในเอกสาร ภาษาจีน พ.ศ. 2543)
- Example:
from pythainlp.transliterate.wunsen import WunsenTransliterate wt = WunsenTransliterate() wt.transliterate("ohayō", lang="jp") # output: 'โอฮาโย' wt.transliterate( "ohayou", lang="jp", jp_input="Hepburn-no diacritic" ) # output: 'โอฮาโย' wt.transliterate("ohayō", lang="jp", system="RI35") # output: 'โอะฮะโย' wt.transliterate("annyeonghaseyo", lang="ko") # output: 'อันนย็องฮาเซโย' wt.transliterate("xin chào", lang="vi") # output: 'ซีน จ่าว' wt.transliterate("ni3 hao3", lang="zh") # output: 'หนี เห่า' wt.transliterate("ni3 hao3", lang="zh", zh_sandhi=False) # output: 'หนี่ เห่า' wt.transliterate("ni3 hao3", lang="zh", system="RI49") # output: 'หนี ห่าว'
Transliteration Engines
thai2rom
The thai2rom engine specializes in transliterating Thai text into romanized form. It’s particularly useful for rendering Thai words accurately in an English phonetic format.
royin
Render Thai words in Latin alphabet, using RTGS
Royal Thai General System of Transcription (RTGS), is the official system by the Royal Institute of Thailand.
- param text:
Thai text to be romanized
- type text:
str
- return:
A string of Thai words rendered in the Latin alphabet
- rtype:
str
The royin engine focuses on transliterating Thai text into English characters. It provides an alternative approach to transliteration, ensuring accurate representation of Thai words.
Transliterate Engines
This section includes multiple transliteration engines designed to suit various use cases. They offer unique methods for transliterating Thai text into romanized form:
icu: Utilizes the ICU transliteration system for phonetic conversion.
ipa: Provides International Phonetic Alphabet (IPA) representation of Thai text.
thaig2p: Transliterates Thai text into the Grapheme-to-Phoneme (G2P) representation.
tltk: Utilizes the TLTK transliteration system for a specific approach to transliteration.
iso_11940: Focuses on the ISO 11940 transliteration standard.
References
The pythainlp.transliterate module offers a comprehensive set of tools and engines for transliterating Thai text into Romanized form. Whether you need a simple transliteration, specific engines for accurate representation, or phonetic rendering, this module provides a wide range of options. Additionally, the module references a publication that highlights the significance of Romanization, Transliteration, and Transcription in making the Thai language accessible to a global audience.