pythainlp.augment

The textaugment is Thai text augment. This function for text augment task.

Modules

class pythainlp.augment.WordNetAug[source]

Text Augment using wordnet

__init__()[source]
find_synonyms(word: str, pos: str | None = None, postag_corpus: str = 'orchid') List[str][source]

Find synonyms from wordnet

Parameters:
  • word (str) – word

  • pos (str) – part-of-speech type

  • postag_corpus (str) – postag corpus name

Returns:

list of synonyms

Return type:

List[str]

augment(sentence: str, tokenize: object = <function word_tokenize>, max_syn_sent: int = 6, postag: bool = True, postag_corpus: str = 'orchid') List[List[str]][source]

Text Augment using wordnet

Parameters:
  • sentence (str) – thai sentence

  • tokenize (object) – function for tokenize word

  • max_syn_sent (int) – max number for synonyms sentence

  • postag (bool) – on part-of-speech

  • postag_corpus (str) – postag corpus name

Returns:

list of synonyms

Return type:

List[Tuple[str]]

Example:

from pythainlp.augment import WordNetAug

aug = WordNetAug()
aug.augment("เราชอบไปโรงเรียน")
# output: [('เรา', 'ชอบ', 'ไป', 'ร.ร.'),
 ('เรา', 'ชอบ', 'ไป', 'รร.'),
 ('เรา', 'ชอบ', 'ไป', 'โรงเรียน'),
 ('เรา', 'ชอบ', 'ไป', 'อาคารเรียน'),
 ('เรา', 'ชอบ', 'ไปยัง', 'ร.ร.'),
 ('เรา', 'ชอบ', 'ไปยัง', 'รร.')]
class pythainlp.augment.word2vec.Word2VecAug(model: str, tokenize: object, type: str = 'file')[source]
__init__(model: str, tokenize: object, type: str = 'file') None[source]
Parameters:
  • model (str) – path model

  • tokenize (object) – tokenize function

  • type (str) – moodel type (file, binary)

modify_sent(sent: str, p: float = 0.7) List[List[str]][source]
Parameters:
  • sent (str) – text sentence

  • p (float) – probability

Return type:

List[List[str]]

augment(sentence: str, n_sent: int = 1, p: float = 0.7) List[Tuple[str]][source]
Parameters:
  • sentence (str) – text sentence

  • n_sent (int) – max number for synonyms sentence

  • p (int) – probability

Returns:

list of synonyms

Return type:

List[Tuple[str]]

class pythainlp.augment.word2vec.Thai2fitAug[source]

Text Augment using word2vec from Thai2Fit

Thai2Fit: github.com/cstorm125/thai2fit

__init__()[source]
tokenizer(text: str) List[str][source]
Parameters:

text (str) – thai text

Return type:

List[str]

load_w2v()[source]

Load thai2fit word2vec model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) List[Tuple[str]][source]

Text Augment using word2vec from Thai2Fit

Parameters:
  • sentence (str) – thai sentence

  • n_sent (int) – number sentence

  • p (float) – Probability of word

Returns:

list of text augment

Return type:

List[Tuple[str]]

Example:

from pythainlp.augment.word2vec import Thai2fitAug

aug = Thai2fitAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: [('พวกเรา', 'เรียน'), ('ฉัน', 'เรียน')]
class pythainlp.augment.word2vec.LTW2VAug[source]

Text Augment using word2vec from LTW2V

LTW2V: github.com/PyThaiNLP/large-thaiword2vec

__init__()[source]
tokenizer(text: str) List[str][source]
Parameters:

text (str) – thai text

Return type:

List[str]

load_w2v()[source]

Load ltw2v word2vec model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) List[Tuple[str]][source]

Text Augment using word2vec from Thai2Fit

Parameters:
  • sentence (str) – thai sentence

  • n_sent (int) – number sentence

  • p (float) – Probability of word

Returns:

list of text augment

Return type:

List[Tuple[str]]

Example:

from pythainlp.augment.word2vec import LTW2VAug

aug = LTW2VAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: [('เขา', 'เรียนหนังสือ'), ('เขา', 'สมัครเรียน')]
class pythainlp.augment.lm.FastTextAug(model_path: str)[source]

Text Augment from FastText

Parameters:

model_path (str) – path of model file

__init__(model_path: str)[source]
Parameters:

model_path (str) – path of model file

tokenize(text: str) List[str][source]

Thai text tokenize for fasttext

Parameters:

text (str) – thai text

Returns:

list of word

Return type:

List[str]

modify_sent(sent: str, p: float = 0.7) List[List[str]][source]
Parameters:
  • sent (str) – text sentence

  • p (float) – probability

Return type:

List[List[str]]

augment(sentence: str, n_sent: int = 1, p: float = 0.7) List[Tuple[str]][source]

Text Augment from FastText

You wants to download thai model from https://fasttext.cc/docs/en/crawl-vectors.html.

Parameters:
  • sentence (str) – thai sentence

  • n_sent (int) – number sentence

  • p (float) – Probability of word

Returns:

list of synonyms

Return type:

List[Tuple[str]]

class pythainlp.augment.lm.Thai2transformersAug[source]
__init__()[source]
generate(sentence: str, num_replace_tokens: int = 3)[source]
augment(sentence: str, num_replace_tokens: int = 3) List[str][source]

Text Augment from wangchanberta

Parameters:
  • sentence (str) – thai sentence

  • num_replace_tokens (int) – number replace tokens

Returns:

list of text augment

Return type:

List[str]

Example:

from pythainlp.augment.lm import Thai2transformersAug

aug=Thai2transformersAug()

aug.augment("ช้างมีทั้งหมด 50 ตัว บน")
# output: ['ช้างมีทั้งหมด 50 ตัว บนโลกใบนี้',
 'ช้างมีทั้งหมด 50 ตัว บนสุด',
 'ช้างมีทั้งหมด 50 ตัว บนบก',
 'ช้างมีทั้งหมด 50 ตัว บนนั้น',
 'ช้างมีทั้งหมด 50 ตัว บนหัว']
class pythainlp.augment.word2vec.bpemb_wv.BPEmbAug(lang: str = 'th', vs: int = 100000, dim: int = 300)[source]

Thai Text Augment using word2vec from BPEmb

BPEmb: github.com/bheinzerling/bpemb

__init__(lang: str = 'th', vs: int = 100000, dim: int = 300)[source]
tokenizer(text: str) List[str][source]
Parameters:

text (str) – thai text

Return type:

List[str]

load_w2v()[source]

Load BPEmb model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) List[Tuple[str]][source]

Text Augment using word2vec from BPEmb

Parameters:
  • sentence (str) – thai sentence

  • n_sent (int) – number sentence

  • p (float) – Probability of word

Returns:

list of synonyms

Return type:

List[str]

Example:

from pythainlp.augment.word2vec.bpemb_wv import BPEmbAug

aug = BPEmbAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: ['ผมสอน', 'ผมเข้าเรียน']