pythainlp.spell¶
The pythainlp.spell
finds the closest correctly spelled word to the given text.
Modules¶
-
pythainlp.spell.
correct
(word: str, engine: str = 'pn') → str[source]¶ Corrects the spelling of the given word by returning the correctly spelled word.
- Parameters
- Returns
the corrected word
- Return type
- Example
from pythainlp.spell import correct correct("เส้นตรบ") # output: 'เส้นตรง' correct("ครัช") # output: 'ครับ' correct("สังเกตุ") # output: 'สังเกต' correct("กระปิ") # output: 'กะปิ' correct("เหตการณ") # output: 'เหตุการณ์'
-
pythainlp.spell.
spell
(word: str, engine: str = 'pn') → List[str][source]¶ Provides a list of possible correct spelling of the given word. The list of words are from the words in the dictionary that incurs an edit distance value of 1 or 2. The result is a list of words sorted by their occurrences in the spelling dictionary in descending order.
- Parameters
- Returns
list of possible correct words within 1 or 2 edit distance and sorted by frequency of word occurrences in the spelling dictionary in descending order.
- Return type
- Example
from pythainlp.spell import spell spell("เส้นตรบ", engine="pn") # output: ['เส้นตรง'] spell("เส้นตรบ") # output: ['เส้นตรง'] spell("ครัช") # output: ['ครับ', 'ครัว', 'รัช', 'ครัม', 'ครัน', 'วรัช', 'ครัส', # 'ปรัช', 'บรัช', 'ครัง', 'คัช', 'คลัช', 'ครัย', 'ครัด'] spell("กระปิ") # output: ['กะปิ', 'กระบิ'] spell("สังเกตุ") # output: ['สังเกต'] spell("เหตการณ") # output: ['เหตุการณ์']
-
class
pythainlp.spell.
NorvigSpellChecker
(custom_dict: Optional[Union[Dict[str, int], Iterable[str], Iterable[Tuple[str, int]]]] = None, min_freq: int = 2, min_len: int = 2, max_len: int = 40, dict_filter: Optional[Callable[[str], bool]] = <function _is_thai_and_not_num>)[source]¶ -
__init__
(custom_dict: Optional[Union[Dict[str, int], Iterable[str], Iterable[Tuple[str, int]]]] = None, min_freq: int = 2, min_len: int = 2, max_len: int = 40, dict_filter: Optional[Callable[[str], bool]] = <function _is_thai_and_not_num>)[source]¶ Initializes Peter Norvig’s spell checker object. Spelling dictionary can be customized. By default, spelling dictionary is from Thai National Corpus
Basically, Norvig’s spell checker will choose the most likely spelling correction give a word by searching for candidate corrected words based on edit distance. Then, it selects the candidate with the highest word occurrence probability.
- Parameters
custom_dict (str) –
A custom spelling dictionary. This can be: (1) a dictionary (dict), with words (str)
as keys and frequencies (int) as values;
an iterable (list, tuple, or set) of word (str) and frequency (int) tuples: (str, int); or
an iterable of just words (str), without frequencies – in this case 1 will be assigned to every words.
Default is from Thai National Corpus (around 40,000 words).
min_freq (int) – Minimum frequency of a word to keep (default = 2)
min_len (int) – Minimum length (in characters) of a word to keep (default = 2)
max_len (int) – Maximum length (in characters) of a word to keep (default = 40)
dict_filter (func) – A function to filter the dictionary. Default filter removes any word with number or non-Thai characters. If no filter is required, use None.
-
__weakref__
¶ list of weak references to the object (if defined)
-
correct
(word: str) → str[source]¶ Returns the most possible word, using the probability from the spelling dictionary
- Parameters
word (str) – A word to correct its spelling
- Returns
the correct spelling of the given word
- Return type
- Example
from pythainlp.spell import NorvigSpellChecker checker = NorvigSpellChecker() checker.correct("ปัญชา") # output: 'ปัญหา' checker.correct("บิญชา") # output: 'บัญชา' checker.correct("มิตรภาบ") # output: 'มิตรภาพ'
-
dictionary
() → ItemsView[str, int][source]¶ Returns the spelling dictionary currently used by this spell checker
from pythainlp.spell import NorvigSpellChecker dictionary= [("หวาน", 30), ("มะนาว", 2), ("แอบ", 3223)] checker = NorvigSpellChecker(custom_dict=dictionary) checker.dictionary() # output: dict_items([('หวาน', 30), ('มะนาว', 2), ('แอบ', 3223)])
-
freq
(word: str) → int[source]¶ Returns the frequency of an input word, according to the spelling dictionary
- Parameters
word (str) – A word to check its frequency
- Returns
frequency of the given word in the spelling dictionary
- Return type
- Example
from pythainlp.spell import NorvigSpellChecker checker = NorvigSpellChecker() checker.freq("ปัญญา") # output: 3639 checker.freq("บิญชา") # output: 0
-
known
(words: Iterable[str]) → List[str][source]¶ Returns a list of given words that found in the spelling dictionary
- Parameters
words (list[str]) – A list of words to check if they exist in the spelling dictionary
- Returns
intersection of the given words list and words in the spelling dictionary
- Return type
- Example
from pythainlp.spell import NorvigSpellChecker checker = NorvigSpellChecker() checker.known(["เพยน", "เพล", "เพลง"]) # output: ['เพล', 'เพลง'] checker.known(['ยกไ', 'ไฟล์ม']) # output: [] checker.known([]) # output: []
-
prob
(word: str) → float[source]¶ Returns the probability of an input word, according to the spelling dictionary
- Parameters
word (str) – A word to check its probability of occurrence
- Returns
word occurrence probability
- Return type
- Example
from pythainlp.spell import NorvigSpellChecker checker = NorvigSpellChecker() checker.prob("ครัช") # output: 0.0 checker.prob("รัก") # output: 0.0006959172792052158 checker.prob("น่ารัก") # output: 9.482306849763902e-05
-
spell
(word: str) → List[str][source]¶ Returns a list of all correctly-spelled words whose spelling is similar to the given word by edit distance metrics. The returned list of words will be sorted by the decreasing order of word frequencies in the word spelling dictionary.
First, if the input word is spelled-correctly, this method returns the list of exactly one word which is itself. Next, this method looks for a list of all correctly-spelled words whose edit distance value is 1 within the input word. If there is no such word, that the search expands to a list of words whose edit distance value is 2. And if that still fails, the list of input word is returned.
- Parameters
word (str) – A word to check its spelling
- Returns
list of possible correct words within 1 or 2 edit distance and sorted by frequency of word occurrence in the spelling dictionary in descending order.
- Return type
- Example
from pythainlp.spell import NorvigSpellChecker checker = NorvigSpellChecker() checker.spell("เส้นตรบ") # output: ['เส้นตรง'] checker.spell("ครัช") # output: ['ครับ', 'ครัว', 'รัช', 'ครัม', 'ครัน', # 'วรัช', 'ครัส', 'ปรัช', 'บรัช', 'ครัง', #'คัช', 'คลัช', 'ครัย', 'ครัด']
-
-
pythainlp.spell.
DEFAULT_SPELL_CHECKER
= Default instance of standard NorvigSpellChecker, using word list from Thai National Corpus: http://www.arts.chula.ac.th/ling/tnc/¶
References¶
- 1(1,2)
Peter Norvig (2007). How to Write a Spelling Corrector.