pythainlp.phayathaibert
The pythainlp.phayathaibert module is built upon the phayathaibert base model.
Modules
- class pythainlp.phayathaibert.ThaiTextProcessor[source]
-
- replace_url(text: str) str [source]
Replace url in text with TK_URL (https://stackoverflow.com/a/6041965) :param str text: text to replace url :return: text where urls are replaced :rtype: str :Example:
>>> replace_url("go to https://github.com") go to <url>
- rm_brackets(text: str) str [source]
Remove all empty brackets and artifacts within brackets from text. :param str text: text to remove useless brackets :return: text where all useless brackets are removed :rtype: str :Example:
>>> rm_brackets("hey() whats[;] up{*&} man(hey)") hey whats up man(hey)
- replace_newlines(text: str) str [source]
Replace newlines in text with spaces. :param str text: text to replace all newlines with spaces :return: text where all newlines are replaced with spaces :rtype: str :Example:
>>> rm_useless_spaces("hey whats
- up”)
hey whats up
- rm_useless_spaces(text: str) str [source]
Remove multiple spaces in text. (code from fastai) :param str text: text to replace useless spaces :return: text where all spaces are reduced to one :rtype: str :Example:
>>> rm_useless_spaces("oh no") oh no
- replace_spaces(text: str, space_token: str = '<_>') str [source]
Replace spaces with _ :param str text: text to replace spaces :return: text where all spaces replaced with _ :rtype: str :Example:
>>> replace_spaces("oh no") oh_no
- replace_rep_after(text: str) str [source]
Replace repetitions at the character level in text :param str text: input text to replace character repetition :return: text with repetitive tokens removed. :rtype: str :Example:
>>> text = "กาาาาาาา" >>> replace_rep_after(text) 'กา'
- replace_wrep_post(toks: List[str]) List[str] [source]
Replace repetitive words post tokenization; fastai replace_wrep does not work well with Thai. :param List[str] toks: list of tokens :return: list of tokens where repetitive words are removed. :rtype: List[str] :Example:
>>> toks = ["กา", "น้ำ", "น้ำ", "น้ำ", "น้ำ"] >>> replace_wrep_post(toks) ['กา', 'น้ำ']
- remove_space(toks: List[str]) List[str] [source]
Do not include space for bag-of-word models. :param List[str] toks: list of tokens :return: List of tokens where space tokens (” “) are filtered out :rtype: List[str] :Example:
>>> toks = ["ฉัน", "เดิน", " ", "กลับ", "บ้าน"] >>> remove_space(toks) ['ฉัน', 'เดิน', 'กลับ', 'บ้าน']
- preprocess(text: str, pre_rules: ~typing.List[~typing.Callable] = [<function ThaiTextProcessor.rm_brackets>, <function ThaiTextProcessor.replace_newlines>, <function ThaiTextProcessor.rm_useless_spaces>, <function ThaiTextProcessor.replace_spaces>, <function ThaiTextProcessor.replace_rep_after>], tok_func: ~typing.Callable = <function word_tokenize>) str [source]
- class pythainlp.phayathaibert.ThaiTextAugmenter[source]
-
- augment(text: str, num_augs: int = 3, sample: bool = False) List[str] [source]
Text augmentation from PhayaThaiBERT
- Parameters:
- Returns:
list of text augment
- Return type:
List[str]
- Example:
from pythainlp.augment.lm import ThaiTextAugmenter aug = ThaiTextAugmenter() aug.augment("ช้างมีทั้งหมด 50 ตัว บน", num_args=5) # output = ['ช้างมีทั้งหมด 50 ตัว บนโลกใบนี้ครับ.', 'ช้างมีทั้งหมด 50 ตัว บนพื้นดินครับ...', 'ช้างมีทั้งหมด 50 ตัว บนท้องฟ้าครับ...', 'ช้างมีทั้งหมด 50 ตัว บนดวงจันทร์.‼', 'ช้างมีทั้งหมด 50 ตัว บนเขาค่ะ😁']
- class pythainlp.phayathaibert.PartOfSpeechTagger(model: str = 'lunarlist/pos_thai_phayathai')[source]
-
- get_tag(sentence: str, strategy: str = 'simple') List[List[Tuple[str, str]]] [source]
Marks sentences with part-of-speech (POS) tags.
- Parameters:
sentence (str) – a list of lists of tokenized words
- Returns:
a list of lists of tuples (word, POS tag)
- Return type:
- Example:
Labels POS for given sentence:
from pythainlp.phayathaibert.core import PartOfSpeechTagger tagger = PartOfSpeechTagger() tagger.get_tag("แมวทำอะไรตอนห้าโมงเช้า") # output: # [[('แมว', 'NOUN'), ('ทําอะไร', 'VERB'), ('ตอนห้าโมงเช้า', 'NOUN')]]
- class pythainlp.phayathaibert.NamedEntityTagger(model: str = 'Pavarissy/phayathaibert-thainer')[source]
-
- get_ner(text: str, tag: bool = False, pos: bool = False, strategy: str = 'simple') List[Tuple[str, str]] | List[Tuple[str, str, str]] | str [source]
This function tags named entities in text in IOB format.
- Parameters:
- Returns:
a list of tuples associated with tokenized words, NER tags, POS tags (if the parameter pos is specified as True), and output HTML-like tags (if the parameter tag is specified as True). Otherwise, return a list of tuples associated with tokenized words and NER tags
- Return type:
Union[List[Tuple[str, str]], List[Tuple[str, str, str]], str]
- Example:
>>> from pythainlp.phayathaibert.core import NamedEntityTagger >>> >>> tagger = NamedEntityTagger() >>> tagger.get_ner("ทดสอบนายปวริศ เรืองจุติโพธิ์พานจากประเทศไทย") [('นายปวริศ เรืองจุติโพธิ์พานจากประเทศไทย', 'PERSON'), ('จาก', 'LOCATION'), ('ประเทศไทย', 'LOCATION')] >>> ner.tag("ทดสอบนายปวริศ เรืองจุติโพธิ์พานจากประเทศไทย", tag=True) 'ทดสอบ<PERSON>นายปวริศ เรืองจุติโพธิ์พาน</PERSON> <LOCATION>จาก</LOCATION><LOCATION>ประเทศไทย</LOCATION>'