pythainlp.augment
Introduction
The pythainlp.augment module is a powerful toolset for text augmentation in the Thai language. Text augmentation is a process that enriches and diversifies textual data by generating alternative versions of the original text. This module is a valuable resource for improving the quality and variety of Thai language data for NLP tasks.
TextAugment Class
The central component of the pythainlp.augment module is the TextAugment class. This class provides various text augmentation techniques and functions to enhance the diversity of your text data. It offers the following methods:
WordNetAug Class
The WordNetAug class is designed to perform text augmentation using WordNet, a lexical database for English. This class enables you to augment Thai text using English synonyms, offering a unique approach to text diversification. The following methods are available within this class:
- class pythainlp.augment.WordNetAug[source]
Text Augment using wordnet
- find_synonyms(word: str, pos: str | None = None, postag_corpus: str = 'orchid') List[str] [source]
Find synonyms using wordnet
- augment(sentence: str, tokenize: object = <function word_tokenize>, max_syn_sent: int = 6, postag: bool = True, postag_corpus: str = 'orchid') List[List[str]] [source]
Text Augment using wordnet
- Parameters:
- Returns:
list of synonyms
- Return type:
List[Tuple[str]]
- Example:
from pythainlp.augment import WordNetAug aug = WordNetAug() aug.augment("เราชอบไปโรงเรียน") # output: [('เรา', 'ชอบ', 'ไป', 'ร.ร.'), ('เรา', 'ชอบ', 'ไป', 'รร.'), ('เรา', 'ชอบ', 'ไป', 'โรงเรียน'), ('เรา', 'ชอบ', 'ไป', 'อาคารเรียน'), ('เรา', 'ชอบ', 'ไปยัง', 'ร.ร.'), ('เรา', 'ชอบ', 'ไปยัง', 'รร.')]
Word2VecAug, Thai2fitAug, LTW2VAug Classes
The pythainlp.augment.word2vec package contains multiple classes for text augmentation using Word2Vec models. These classes include Word2VecAug, Thai2fitAug, and LTW2VAug. Each of these classes allows you to use Word2Vec embeddings to generate text variations. Explore the methods provided by these classes to understand their capabilities.
- class pythainlp.augment.word2vec.Word2VecAug(model: str, tokenize: object, type: str = 'file')[source]
- class pythainlp.augment.word2vec.Thai2fitAug[source]
Text Augment using word2vec from Thai2Fit
Thai2Fit: github.com/cstorm125/thai2fit
- augment(sentence: str, n_sent: int = 1, p: float = 0.7) List[Tuple[str]] [source]
Text Augment using word2vec from Thai2Fit
- Parameters:
- Returns:
list of text augmented
- Return type:
List[Tuple[str]]
- Example:
from pythainlp.augment.word2vec import Thai2fitAug aug = Thai2fitAug() aug.augment("ผมเรียน", n_sent=2, p=0.5) # output: [('พวกเรา', 'เรียน'), ('ฉัน', 'เรียน')]
- class pythainlp.augment.word2vec.LTW2VAug[source]
Text Augment using word2vec from LTW2V
LTW2V: github.com/PyThaiNLP/large-thaiword2vec
- augment(sentence: str, n_sent: int = 1, p: float = 0.7) List[Tuple[str]] [source]
Text Augment using word2vec from Thai2Fit
- Parameters:
- Returns:
list of text augmented
- Return type:
List[Tuple[str]]
- Example:
from pythainlp.augment.word2vec import LTW2VAug aug = LTW2VAug() aug.augment("ผมเรียน", n_sent=2, p=0.5) # output: [('เขา', 'เรียนหนังสือ'), ('เขา', 'สมัครเรียน')]
FastTextAug and Thai2transformersAug Classes
The pythainlp.augment.lm package offers classes for text augmentation using language models. These classes include FastTextAug and Thai2transformersAug. These classes allow you to use language model-based techniques to diversify text data. Explore their methods to understand their capabilities.
- class pythainlp.augment.lm.FastTextAug(model_path: str)[source]
Text Augment from fastText
- Parameters:
model_path (str) – path of model file
- class pythainlp.augment.lm.Thai2transformersAug[source]
-
- augment(sentence: str, num_replace_tokens: int = 3) List[str] [source]
Text augmentation from WangchanBERTa
- Parameters:
- Returns:
list of text augment
- Return type:
List[str]
- Example:
from pythainlp.augment.lm import Thai2transformersAug aug = Thai2transformersAug() aug.augment("ช้างมีทั้งหมด 50 ตัว บน") # output: ['ช้างมีทั้งหมด 50 ตัว บนโลกใบนี้', 'ช้างมีทั้งหมด 50 ตัว บนสุด', 'ช้างมีทั้งหมด 50 ตัว บนบก', 'ช้างมีทั้งหมด 50 ตัว บนนั้น', 'ช้างมีทั้งหมด 50 ตัว บนหัว']
BPEmbAug Class
The pythainlp.augment.word2vec.bpemb_wv package contains the BPEmbAug class, which is designed for text augmentation using subword embeddings. This class is particularly useful when working with subword representations for Thai text augmentation.
- class pythainlp.augment.word2vec.bpemb_wv.BPEmbAug(lang: str = 'th', vs: int = 100000, dim: int = 300)[source]
Thai Text Augment using word2vec from BPEmb
BPEmb: github.com/bheinzerling/bpemb
- augment(sentence: str, n_sent: int = 1, p: float = 0.7) List[Tuple[str]] [source]
Text Augment using word2vec from BPEmb
- Parameters:
- Returns:
list of synonyms
- Return type:
List[str]
- Example:
from pythainlp.augment.word2vec.bpemb_wv import BPEmbAug aug = BPEmbAug() aug.augment("ผมเรียน", n_sent=2, p=0.5) # output: ['ผมสอน', 'ผมเข้าเรียน']
Additional Functions
To further enhance your text augmentation tasks, the pythainlp.augment module offers the following functions:
postype2wordnet: This function maps part-of-speech tags to WordNet-compatible POS tags, facilitating the integration of WordNet augmentation with Thai text.
These functions and classes provide diverse techniques for text augmentation in the Thai language, making this module a valuable asset for NLP researchers, developers, and practitioners.
For detailed usage examples and guidelines, please refer to the official PyThaiNLP documentation. The pythainlp.augment module opens up new possibilities for enriching and diversifying Thai text data, leading to improved NLP models and applications.