pythainlp.augment

Introduction

The pythainlp.augment module is a powerful toolset for text augmentation in the Thai language. Text augmentation is a process that enriches and diversifies textual data by generating alternative versions of the original text. This module is a valuable resource for improving the quality and variety of Thai language data for NLP tasks.

TextAugment Class

The central component of the pythainlp.augment module is the TextAugment class. This class provides various text augmentation techniques and functions to enhance the diversity of your text data. It offers the following methods:

WordNetAug Class

The WordNetAug class is designed to perform text augmentation using WordNet, a lexical database for English. This class enables you to augment Thai text using English synonyms, offering a unique approach to text diversification. The following methods are available within this class:

class pythainlp.augment.WordNetAug[source]

Text Augment using wordnet

__init__()[source]

find_synonyms(word: str, pos: str | None = None, postag_corpus: str = 'orchid') → List[str][source]

Find synonyms using wordnet

Parameters:

word (str) – word
pos (str) – part-of-speech type
postag_corpus (str) – name of POS tag corpus

Returns:

list of synonyms

Return type:

List[str]

augment(sentence: str, tokenize: object = <function word_tokenize>, max_syn_sent: int = 6, postag: bool = True, postag_corpus: str = 'orchid') → List[List[str]][source]

Text Augment using wordnet

Parameters:

sentence (str) – Thai sentence
tokenize (object) – function for tokenizing words
max_syn_sent (int) – maximum number of synonymous sentences
postag (bool) – use part-of-speech
postag_corpus (str) – name of POS tag corpus

Returns:

list of synonyms

Return type:

List[Tuple[str]]

Example:

from pythainlp.augment import WordNetAug

aug = WordNetAug()
aug.augment("เราชอบไปโรงเรียน")
# output: [('เรา', 'ชอบ', 'ไป', 'ร.ร.'),
 ('เรา', 'ชอบ', 'ไป', 'รร.'),
 ('เรา', 'ชอบ', 'ไป', 'โรงเรียน'),
 ('เรา', 'ชอบ', 'ไป', 'อาคารเรียน'),
 ('เรา', 'ชอบ', 'ไปยัง', 'ร.ร.'),
 ('เรา', 'ชอบ', 'ไปยัง', 'รร.')]

Word2VecAug, Thai2fitAug, LTW2VAug Classes

The pythainlp.augment.word2vec package contains multiple classes for text augmentation using Word2Vec models. These classes include Word2VecAug, Thai2fitAug, and LTW2VAug. Each of these classes allows you to use Word2Vec embeddings to generate text variations. Explore the methods provided by these classes to understand their capabilities.

class pythainlp.augment.word2vec.Word2VecAug(model: str, tokenize: object, type: str = 'file')[source]

__init__(model: str, tokenize: object, type: str = 'file') → None[source]

Parameters:

model (str) – path of model
tokenize (object) – tokenize function
type (str) – model type (file, binary)

modify_sent(sent: str, p: float = 0.7) → List[List[str]][source]

Parameters:

sent (str) – text of sentence
p (float) – probability

Return type:

List[List[str]]

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Parameters:

sentence (str) – text of sentence
n_sent (int) – maximum number of synonymous sentences
p (int) – probability

Returns:

list of synonyms

Return type:

List[Tuple[str]]

class pythainlp.augment.word2vec.Thai2fitAug[source]

Text Augment using word2vec from Thai2Fit

Thai2Fit: github.com/cstorm125/thai2fit

__init__()[source]

tokenizer(text: str) → List[str][source]

Parameters:: text (str) – Thai text
Return type:: List[str]

load_w2v()[source]: Load Thai2Fit’s word2vec model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Text Augment using word2vec from Thai2Fit

Parameters:

sentence (str) – Thai sentence
n_sent (int) – number of sentence
p (float) – probability of word

Returns:

list of text augmented

Return type:

List[Tuple[str]]

Example:

from pythainlp.augment.word2vec import Thai2fitAug

aug = Thai2fitAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: [('พวกเรา', 'เรียน'), ('ฉัน', 'เรียน')]

class pythainlp.augment.word2vec.LTW2VAug[source]

Text Augment using word2vec from LTW2V

LTW2V: github.com/PyThaiNLP/large-thaiword2vec

__init__()[source]

tokenizer(text: str) → List[str][source]

Parameters:: text (str) – Thai text
Return type:: List[str]

load_w2v()[source]: Load LTW2V’s word2vec model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Text Augment using word2vec from Thai2Fit

Parameters:

sentence (str) – Thai sentence
n_sent (int) – number of sentence
p (float) – probability of word

Returns:

list of text augmented

Return type:

List[Tuple[str]]

Example:

from pythainlp.augment.word2vec import LTW2VAug

aug = LTW2VAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: [('เขา', 'เรียนหนังสือ'), ('เขา', 'สมัครเรียน')]

FastTextAug and Thai2transformersAug Classes

The pythainlp.augment.lm package offers classes for text augmentation using language models. These classes include FastTextAug and Thai2transformersAug. These classes allow you to use language model-based techniques to diversify text data. Explore their methods to understand their capabilities.

class pythainlp.augment.lm.FastTextAug(model_path: str)[source]

Text Augment from fastText

Parameters:: model_path (str) – path of model file

__init__(model_path: str)[source]

Parameters:: model_path (str) – path of model file

tokenize(text: str) → List[str][source]

Thai text tokenization for fastText

Parameters:: text (str) – Thai text
Returns:: list of words
Return type:: List[str]

modify_sent(sent: str, p: float = 0.7) → List[List[str]][source]

Parameters:

sent (str) – text of sentence
p (float) – probability

Return type:

List[List[str]]

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Text Augment from fastText

You may want to download the Thai model from https://fasttext.cc/docs/en/crawl-vectors.html.

Parameters:

sentence (str) – Thai sentence
n_sent (int) – number of sentences
p (float) – probability of word

Returns:

list of synonyms

Return type:

List[Tuple[str]]

class pythainlp.augment.lm.Thai2transformersAug[source]

__init__()[source]

generate(sentence: str, num_replace_tokens: int = 3)[source]

augment(sentence: str, num_replace_tokens: int = 3) → List[str][source]

Text augmentation from WangchanBERTa

Parameters:

sentence (str) – Thai sentence
num_replace_tokens (int) – number replace tokens

Returns:

list of text augment

Return type:

List[str]

Example:

from pythainlp.augment.lm import Thai2transformersAug

aug = Thai2transformersAug()

aug.augment("ช้างมีทั้งหมด 50 ตัว บน")
# output: ['ช้างมีทั้งหมด 50 ตัว บนโลกใบนี้',
 'ช้างมีทั้งหมด 50 ตัว บนสุด',
 'ช้างมีทั้งหมด 50 ตัว บนบก',
 'ช้างมีทั้งหมด 50 ตัว บนนั้น',
 'ช้างมีทั้งหมด 50 ตัว บนหัว']

BPEmbAug Class

The pythainlp.augment.word2vec.bpemb_wv package contains the BPEmbAug class, which is designed for text augmentation using subword embeddings. This class is particularly useful when working with subword representations for Thai text augmentation.

class pythainlp.augment.word2vec.bpemb_wv.BPEmbAug(lang: str = 'th', vs: int = 100000, dim: int = 300)[source]

Thai Text Augment using word2vec from BPEmb

BPEmb: github.com/bheinzerling/bpemb

__init__(lang: str = 'th', vs: int = 100000, dim: int = 300)[source]

tokenizer(text: str) → List[str][source]

Parameters:: text (str) – Thai text
Return type:: List[str]

load_w2v()[source]: Load BPEmb model

augment(sentence: str, n_sent: int = 1, p: float = 0.7) → List[Tuple[str]][source]

Text Augment using word2vec from BPEmb

Parameters:

sentence (str) – Thai sentence
n_sent (int) – number of sentence
p (float) – probability of word

Returns:

list of synonyms

Return type:

List[str]

Example:

from pythainlp.augment.word2vec.bpemb_wv import BPEmbAug

aug = BPEmbAug()
aug.augment("ผมเรียน", n_sent=2, p=0.5)
# output: ['ผมสอน', 'ผมเข้าเรียน']

Additional Functions

To further enhance your text augmentation tasks, the pythainlp.augment module offers the following functions:

postype2wordnet: This function maps part-of-speech tags to WordNet-compatible POS tags, facilitating the integration of WordNet augmentation with Thai text.

These functions and classes provide diverse techniques for text augmentation in the Thai language, making this module a valuable asset for NLP researchers, developers, and practitioners.

For detailed usage examples and guidelines, please refer to the official PyThaiNLP documentation. The pythainlp.augment module opens up new possibilities for enriching and diversifying Thai text data, leading to improved NLP models and applications.