pythainlp.spell
The pythainlp.spell
module is a powerful tool for finding the closest correctly spelled word to a given text in the Thai language. It provides functionalities to correct spelling errors and enhance the accuracy of text processing.
Modules
correct
- pythainlp.spell.correct(word: str, engine: str = 'pn') str [source]
Corrects the spelling of the given word by returning the correctly spelled word.
- Parameters:
- Returns:
the corrected word
- Return type:
- Example:
from pythainlp.spell import correct correct("เส้นตรบ") # output: 'เส้นตรง' correct("ครัช") # output: 'ครับ' correct("สังเกตุ") # output: 'สังเกต' correct("กระปิ") # output: 'กะปิ' correct("เหตการณ") # output: 'เหตุการณ์'
The correct function is designed to correct the spelling of a single Thai word. Given an input word, this function returns the closest correctly spelled word from the dictionary, making it valuable for spell-checking and text correction tasks.
correct_sent
- pythainlp.spell.correct_sent(list_words: List[str], engine: str = 'pn') List[str] [source]
Corrects and returns the spelling of the given sentence
- Parameters:
- Returns:
the corrected list of words in sentence
- Return type:
List[str]
- Example:
from pythainlp.spell import correct_sent correct_sent(["เด็","อินอร์เน็ต","แรง"],engine='symspellpy') # output: ['เด็ก', 'อินเทอร์เน็ต', 'แรง']
The correct_sent function is an extension of the correct function and is used to correct an entire sentence. It tokenizes the input sentence, corrects each word, and returns the corrected sentence. This is beneficial for proofreading and improving the readability of Thai text.
spell
- pythainlp.spell.spell(word: str, engine: str = 'pn') List[str] [source]
Provides a list of possible correct spellings of the given word. The list of words are from the words in the dictionary that incurs an edit distance value of 1 or 2. The result is a list of words sorted by their occurrences in the spelling dictionary in descending order.
- Parameters:
- Returns:
list of possible correct words within 1 or 2 edit distance and sorted by frequency of word occurrences in the spelling dictionary in descending order.
- Return type:
- Example:
from pythainlp.spell import spell spell("เส้นตรบ", engine="pn") # output: ['เส้นตรง'] spell("เส้นตรบ") # output: ['เส้นตรง'] spell("เส้นตรบ", engine="tltk") # output: ['เส้นตรง'] spell("ครัช") # output: ['ครับ', 'ครัว', 'รัช', 'ครัม', 'ครัน', 'วรัช', 'ครัส', # 'ปรัช', 'บรัช', 'ครัง', 'คัช', 'คลัช', 'ครัย', 'ครัด'] spell("กระปิ") # output: ['กะปิ', 'กระบิ'] spell("สังเกตุ") # output: ['สังเกต'] spell("เหตการณ") # output: ['เหตุการณ์']
The spell function is responsible for identifying spelling errors within a given Thai word. It checks whether the input word is spelled correctly or not and returns a Boolean result. This function is useful for validating the correctness of Thai words.
spell_sent
- pythainlp.spell.spell_sent(list_words: List[str], engine: str = 'pn') List[List[str]] [source]
Provides a list of possible correct spellings of sentence
- Parameters:
- Returns:
list of possibly correct words
- Return type:
List[List[str]]
- Example:
from pythainlp.spell import spell_sent spell_sent(["เด็","อินอร์เน็ต","แรง"],engine='symspellpy') # output: [['เด็ก', 'อินเทอร์เน็ต', 'แรง']]
The spell_sent function extends the spell-checking functionality to entire sentences. It tokenizes the input sentence and checks the spelling of each word. It returns a list of Booleans indicating whether each word in the sentence is spelled correctly or not.
NorvigSpellChecker
- class pythainlp.spell.NorvigSpellChecker(custom_dict: ~typing.Dict[str, int] | ~typing.Iterable[str] | ~typing.Iterable[~typing.Tuple[str, int]] | None = None, min_freq: int = 2, min_len: int = 2, max_len: int = 40, dict_filter: ~typing.Callable[[str], bool] | None = <function _is_thai_and_not_num>)[source]
- __init__(custom_dict: ~typing.Dict[str, int] | ~typing.Iterable[str] | ~typing.Iterable[~typing.Tuple[str, int]] | None = None, min_freq: int = 2, min_len: int = 2, max_len: int = 40, dict_filter: ~typing.Callable[[str], bool] | None = <function _is_thai_and_not_num>)[source]
Initializes Peter Norvig’s spell checker object. Spelling dictionary can be customized. By default, spelling dictionary is from Thai National Corpus
Basically, Norvig’s spell checker will choose the most likely corrected spelling given a word by searching for candidates of corrected words based on edit distance. Then, it selects the candidate with the highest word occurrence probability.
- Parameters:
custom_dict (str) –
A custom spelling dictionary. This can be: (1) a dictionary (dict), with words (str)
as keys and frequencies (int) as values;
an iterable (list, tuple, or set) of words (str) and frequency (int) tuples: (str, int); or
an iterable of just words (str), without frequencies – in this case 1 will be assigned to every words.
Default is from Thai National Corpus (around 40,000 words).
min_freq (int) – Minimum frequency of a word to keep (default = 2)
min_len (int) – Minimum length (in characters) of a word to keep (default = 2)
max_len (int) – Maximum length (in characters) of a word to keep (default = 40)
dict_filter (func) – A function to filter the dictionary. Default filter removes any word with numbers or non-Thai characters. If no filter is required, use None.
- dictionary() ItemsView[str, int] [source]
Returns the spelling dictionary currently used by this spell checker
from pythainlp.spell import NorvigSpellChecker dictionary= [("หวาน", 30), ("มะนาว", 2), ("แอบ", 3223)] checker = NorvigSpellChecker(custom_dict=dictionary) checker.dictionary() # output: dict_items([('หวาน', 30), ('มะนาว', 2), ('แอบ', 3223)])
- known(words: Iterable[str]) List[str] [source]
Returns a list of given words found in the spelling dictionary
- Parameters:
words (list[str]) – A list of words to check if they exist in the spelling dictionary
- Returns:
intersection of the given word list and words in the spelling dictionary
- Return type:
- Example:
from pythainlp.spell import NorvigSpellChecker checker = NorvigSpellChecker() checker.known(["เพยน", "เพล", "เพลง"]) # output: ['เพล', 'เพลง'] checker.known(['ยกไ', 'ไฟล์ม']) # output: [] checker.known([]) # output: []
- prob(word: str) float [source]
Returns the probability of an input word, according to the spelling dictionary
- Parameters:
word (str) – A word to check occurrence probability of
- Returns:
word occurrence probability
- Return type:
- Example:
from pythainlp.spell import NorvigSpellChecker checker = NorvigSpellChecker() checker.prob("ครัช") # output: 0.0 checker.prob("รัก") # output: 0.0006959172792052158 checker.prob("น่ารัก") # output: 9.482306849763902e-05
- freq(word: str) int [source]
Returns the frequency of an input word, according to the spelling dictionary
- Parameters:
word (str) – A word to check frequency of
- Returns:
frequency of the given word in the spelling dictionary
- Return type:
- Example:
from pythainlp.spell import NorvigSpellChecker checker = NorvigSpellChecker() checker.freq("ปัญญา") # output: 3639 checker.freq("บิญชา") # output: 0
- spell(word: str) List[str] [source]
Returns a list of all correctly-spelled words whose spelling is similar to the given word by edit distance metrics. The returned list of words will be sorted by decreasing order of word frequencies in the word spelling dictionary.
First, if the input word is spelled correctly, this method returns a list of exactly one word which is itself. Next, this method looks for a list of all correctly spelled words whose edit distance value is 1 from the input word. If there is no such word, then the search expands to a list of words whose edit distance value is 2. And if that still fails, the list of input words is returned.
- Parameters:
word (str) – A word to check spelling of
- Returns:
list of possibly correct words within 1 or 2 edit distance and sorted by frequency of word occurrence in the spelling dictionary in descending order.
- Return type:
- Example:
from pythainlp.spell import NorvigSpellChecker checker = NorvigSpellChecker() checker.spell("เส้นตรบ") # output: ['เส้นตรง'] checker.spell("ครัช") # output: ['ครับ', 'ครัว', 'รัช', 'ครัม', 'ครัน', # 'วรัช', 'ครัส', 'ปรัช', 'บรัช', 'ครัง', #'คัช', 'คลัช', 'ครัย', 'ครัด']
- correct(word: str) str [source]
Returns the most possible word, using the probability from the spelling dictionary
- Parameters:
word (str) – A word to correct spelling of
- Returns:
the correct spelling of the given word
- Return type:
- Example:
from pythainlp.spell import NorvigSpellChecker checker = NorvigSpellChecker() checker.correct("ปัญชา") # output: 'ปัญหา' checker.correct("บิญชา") # output: 'บัญชา' checker.correct("มิตรภาบ") # output: 'มิตรภาพ'
- __dict__ = mappingproxy({'__module__': 'pythainlp.spell.pn', '__init__': <function NorvigSpellChecker.__init__>, 'dictionary': <function NorvigSpellChecker.dictionary>, 'known': <function NorvigSpellChecker.known>, 'prob': <function NorvigSpellChecker.prob>, 'freq': <function NorvigSpellChecker.freq>, 'spell': <function NorvigSpellChecker.spell>, 'correct': <function NorvigSpellChecker.correct>, '__dict__': <attribute '__dict__' of 'NorvigSpellChecker' objects>, '__weakref__': <attribute '__weakref__' of 'NorvigSpellChecker' objects>, '__doc__': None, '__annotations__': {}})
- __module__ = 'pythainlp.spell.pn'
The NorvigSpellChecker class is a fundamental component of the pythainlp.spell module. It implements a spell-checking algorithm based on the work of Peter Norvig. This class is designed for more advanced spell-checking and provides customizable settings for spell correction.
DEFAULT_SPELL_CHECKER
- pythainlp.spell.DEFAULT_SPELL_CHECKER = Default instance of the standard NorvigSpellChecker, using word list data from the Thai National Corpus: http://www.arts.chula.ac.th/ling/tnc/
The DEFAULT_SPELL_CHECKER is an instance of the NorvigSpellChecker class with default settings. It is pre-configured to use word list data from the Thai National Corpus, making it a reliable choice for general spell-checking tasks.