pythainlp.augment

The textaugment is Thai text augment. This function for text augment task.

Modules

class pythainlp.augment.WordNetAug[source]

Text Augment using wordnet

__init__()[source]

find_synonyms(word: str, pos: str | None = None, postag_corpus: str = 'orchid') → List[str][source]

Find synonyms from wordnet

Parameters:

word (str) – word
pos (str) – part-of-speech type
postag_corpus (str) – postag corpus name

Returns:

list of synonyms

Return type:

List[str]

augment(sentence: str, tokenize: object = <function word_tokenize>, max_syn_sent: int = 6, postag: bool = True, postag_corpus: str = 'orchid') → List[List[str]][source]

Text Augment using wordnet

Parameters:

sentence (str) – thai sentence
tokenize (object) – function for tokenize word
max_syn_sent (int) – max number for synonyms sentence
postag (bool) – on part-of-speech
postag_corpus (str) – postag corpus name

Returns:

list of synonyms

Return type:

List[Tuple[str]]

Example:

from pythainlp.augment import WordNetAug

aug = WordNetAug()
aug.augment("เราชอบไปโรงเรียน")
# output: [('เรา', 'ชอบ', 'ไป', 'ร.ร.'),
 ('เรา', 'ชอบ', 'ไป', 'รร.'),
 ('เรา', 'ชอบ', 'ไป', 'โรงเรียน'),
 ('เรา', 'ชอบ', 'ไป', 'อาคารเรียน'),
 ('เรา', 'ชอบ', 'ไปยัง', 'ร.ร.'),
 ('เรา', 'ชอบ', 'ไปยัง', 'รร.')]

class pythainlp.augment.word2vec.Word2VecAug(model: str, tokenize: object, type: str = 'file')[source]

__init__(model: str, tokenize: object, type: str = 'file') → None[source]

Parameters:

model (str) – path model
tokenize (object) – tokenize function
type (str) – moodel type (file, binary)

modify_sent(sent: str, p: float = 0.7) → List[List[str]][source]

Parameters:

sent (str) – text sentence
p (float) – probability

Return type:

List[List[str]]

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Parameters:

sentence (str) – text sentence
n_sent (int) – max number for synonyms sentence
p (int) – probability

Returns:

list of synonyms

Return type:

List[Tuple[str]]

class pythainlp.augment.word2vec.Thai2fitAug[source]

Text Augment using word2vec from Thai2Fit

Thai2Fit: github.com/cstorm125/thai2fit

__init__()[source]

tokenizer(text: str) → List[str][source]

Parameters:: text (str) – thai text
Return type:: List[str]

load_w2v()[source]: Load thai2fit word2vec model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Text Augment using word2vec from Thai2Fit

Parameters:

sentence (str) – thai sentence
n_sent (int) – number sentence
p (float) – Probability of word

Returns:

list of text augment

Return type:

List[Tuple[str]]

Example:

from pythainlp.augment.word2vec import Thai2fitAug

aug = Thai2fitAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: [('พวกเรา', 'เรียน'), ('ฉัน', 'เรียน')]

class pythainlp.augment.word2vec.LTW2VAug[source]

Text Augment using word2vec from LTW2V

LTW2V: github.com/PyThaiNLP/large-thaiword2vec

__init__()[source]

tokenizer(text: str) → List[str][source]

Parameters:: text (str) – thai text
Return type:: List[str]

load_w2v()[source]: Load ltw2v word2vec model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Text Augment using word2vec from Thai2Fit

Parameters:

sentence (str) – thai sentence
n_sent (int) – number sentence
p (float) – Probability of word

Returns:

list of text augment

Return type:

List[Tuple[str]]

Example:

from pythainlp.augment.word2vec import LTW2VAug

aug = LTW2VAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: [('เขา', 'เรียนหนังสือ'), ('เขา', 'สมัครเรียน')]

class pythainlp.augment.lm.FastTextAug(model_path: str)[source]

Text Augment from FastText

Parameters:: model_path (str) – path of model file

__init__(model_path: str)[source]

Parameters:: model_path (str) – path of model file

tokenize(text: str) → List[str][source]

Thai text tokenize for fasttext

Parameters:: text (str) – thai text
Returns:: list of word
Return type:: List[str]

modify_sent(sent: str, p: float = 0.7) → List[List[str]][source]

Parameters:

sent (str) – text sentence
p (float) – probability

Return type:

List[List[str]]

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Text Augment from FastText

You wants to download thai model from https://fasttext.cc/docs/en/crawl-vectors.html.

Parameters:

sentence (str) – thai sentence
n_sent (int) – number sentence
p (float) – Probability of word

Returns:

list of synonyms

Return type:

List[Tuple[str]]

class pythainlp.augment.lm.Thai2transformersAug[source]

__init__()[source]

generate(sentence: str, num_replace_tokens: int = 3)[source]

augment(sentence: str, num_replace_tokens: int = 3) → List[str][source]

Text Augment from wangchanberta

Parameters:

sentence (str) – thai sentence
num_replace_tokens (int) – number replace tokens

Returns:

list of text augment

Return type:

List[str]

Example:

from pythainlp.augment.lm import Thai2transformersAug

aug=Thai2transformersAug()

aug.augment("ช้างมีทั้งหมด 50 ตัว บน")
# output: ['ช้างมีทั้งหมด 50 ตัว บนโลกใบนี้',
 'ช้างมีทั้งหมด 50 ตัว บนสุด',
 'ช้างมีทั้งหมด 50 ตัว บนบก',
 'ช้างมีทั้งหมด 50 ตัว บนนั้น',
 'ช้างมีทั้งหมด 50 ตัว บนหัว']

class pythainlp.augment.word2vec.bpemb_wv.BPEmbAug(lang: str = 'th', vs: int = 100000, dim: int = 300)[source]

Thai Text Augment using word2vec from BPEmb

BPEmb: github.com/bheinzerling/bpemb

__init__(lang: str = 'th', vs: int = 100000, dim: int = 300)[source]

tokenizer(text: str) → List[str][source]

Parameters:: text (str) – thai text
Return type:: List[str]

load_w2v()[source]: Load BPEmb model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Text Augment using word2vec from BPEmb

Parameters:

sentence (str) – thai sentence
n_sent (int) – number sentence
p (float) – Probability of word

Returns:

list of synonyms

Return type:

List[str]

Example:

from pythainlp.augment.word2vec.bpemb_wv import BPEmbAug

aug = BPEmbAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: ['ผมสอน', 'ผมเข้าเรียน']