pythainlp.generate

The pythainlp.generate module provides classes and functions for generating Thai text using n-gram and neural language models.

N-gram generators

class pythainlp.generate.Unigram(name: str = 'tnc')[source]

Text generator using Unigram

Parameters:: name (str) – corpus name * tnc - Thai National Corpus (default) * ttc - Thai Textbook Corpus (TTC) * oscar - OSCAR Corpus

__init__(name: str = 'tnc') → None[source]

counts: dict[str, int]

word: list[str]

n: int

prob: dict[str, float]

gen_sentence(start_seq: str = '', N: int = 3, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) → list[str] | str[source]

Generate a sentence using the unigram model.

Parameters:

start_seq (str) – word to begin sentence with
N (int) – number of words
prob (float) – minimum word probability threshold
output_str (bool) – output as string
duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

Union[list[str], str]

Example:

>>> from pythainlp.generate import Unigram

>>> gen = Unigram()

>>> gen.gen_sentence("แมว")
'แมวเวลานะนั้น'

class pythainlp.generate.Bigram(name: str = 'tnc')[source]

Text generator using Bigram

Parameters:: name (str) – corpus name * tnc - Thai National Corpus (default)

__init__(name: str = 'tnc') → None[source]

uni: dict[str, int]

bi: dict[tuple[str, str], int]

uni_keys: list[str]

bi_keys: list[tuple[str, str]]

words: list[str]

prob(t1: str, t2: str) → float[source]

Compute bigram probability P(t2 | t1).

Parameters:

t1 (str) – first word
t2 (str) – second word

Returns:

probability value

Return type:

float

gen_sentence(start_seq: str = '', N: int = 4, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) → list[str] | str[source]

Generate a sentence using the bigram model.

Parameters:

start_seq (str) – word to begin sentence with
N (int) – number of words
prob (float) – minimum word probability threshold
output_str (bool) – output as string
duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

Union[list[str], str]

Example:

>>> from pythainlp.generate import Bigram

>>> gen = Bigram()

>>> gen.gen_sentence("แมว")
'แมวไม่ได้รับเชื้อมัน'

class pythainlp.generate.Trigram(name: str = 'tnc')[source]

Text generator using Trigram

Parameters:: name (str) – corpus name * tnc - Thai National Corpus (default)

__init__(name: str = 'tnc') → None[source]

uni: dict[str, int]

bi: dict[tuple[str, str], int]

ti: dict[tuple[str, str, str], int]

uni_keys: list[str]

bi_keys: list[tuple[str, str]]

ti_keys: list[tuple[str, str, str]]

words: list[str]

prob(t1: str, t2: str, t3: str) → float[source]

Compute trigram probability P(t3 | t1, t2).

Parameters:

t1 (str) – first word
t2 (str) – second word
t3 (str) – third word

Returns:

probability value

Return type:

float

gen_sentence(start_seq: str | tuple[str, str] = '', N: int = 4, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) → list[str] | str[source]

Generate a sentence using the trigram model.

Parameters:

start_seq (Union[str, tuple[str, str]]) – word or bigram to begin sentence with
N (int) – number of words
prob (float) – minimum word probability threshold
output_str (bool) – output as string
duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

Union[list[str], str]

Example:

>>> from pythainlp.generate import Trigram

>>> gen = Trigram()

>>> gen.gen_sentence()
'ยังทำตัวเป็นเซิร์ฟเวอร์คือ'

Thai2fit helper

WangChanLM

class pythainlp.generate.wangchanglm.WangChanGLM[source]

device: str

torch_dtype: torch.dtype

model_path: str

model: PreTrainedModel

tokenizer: PreTrainedTokenizerBase

df: pd.DataFrame

exclude_ids: list[int]

__init__() → None[source]

exclude_pattern: re.Pattern[str]

stop_token: str

PROMPT_DICT: dict[str, str]

is_exclude(text: str) → bool[source]

load_model(model_path: str = 'pythainlp/wangchanglm-7.5B-sft-en-sharded', return_dict: bool = True, load_in_8bit: bool = False, device: str = 'cuda', torch_dtype: 'torch.dtype' | None = None, offload_folder: str = './', low_cpu_mem_usage: bool = True) → None[source]

Load model

Parameters:

model_path (str) – model path
return_dict (bool) – return dict
load_in_8bit (bool) – load model in 8bit
device (str) – device (cpu, cuda or other)
torch_dtype (Optional[torch.dtype]) – torch_dtype
offload_folder (str) – offload folder
low_cpu_mem_usage (bool) – low cpu mem usage

gen_instruct(text: str, max_new_tokens: int = 512, top_p: float = 0.95, temperature: float = 0.9, top_k: int = 50, no_repeat_ngram_size: int = 2, typical_p: float = 1.0, thai_only: bool = True, skip_special_tokens: bool = True) → str[source]

Generate Instruct

Parameters:

text (str) – text
max_new_tokens (int) – maximum number of new tokens
top_p (float) – top p
temperature (float) – temperature
top_k (int) – top k
no_repeat_ngram_size (int) – do not repeat ngram size
typical_p (float) – typical p
thai_only (bool) – Thai only
skip_special_tokens (bool) – skip special tokens

Returns:

the answer from Instruct

Return type:

str

instruct_generate(instruct: str, context: str = '', max_new_tokens: int = 512, temperature: float = 0.9, top_p: float = 0.95, top_k: int = 50, no_repeat_ngram_size: int = 2, typical_p: float = 1, thai_only: bool = True, skip_special_tokens: bool = True) → str[source]

Generate Instruct

Parameters:

instruct (str) – Instruct
context (str) – context (optional, default is empty string)
max_new_tokens (int) – maximum number of new tokens
top_p (float) – top p
temperature (float) – temperature
top_k (int) – top k
no_repeat_ngram_size (int) – do not repeat ngram size
typical_p (float) – typical p
thai_only (bool) – Thai only
skip_special_tokens (bool) – skip special tokens

Returns:

the answer from Instruct

Return type:

str

Example:

>>>     from pythainlp.generate.wangchanglm import WangChanGLM
>>>     import torch

>>>     model = WangChanGLM()

>>>     model.load_model(device="cpu", torch_dtype=torch.bfloat16)

>>>     print(model.instruct_generate(instruct="ขอวิธีลดน้ำหนัก"))
    ลดน้ําหนักให้ได้ผล ต้องทําอย่างค่อยเป็นค่อยไป
    ปรับเปลี่ยนพฤติกรรมการกินอาหาร
    ออกกําลังกายอย่างสม่ําเสมอ
    และพักผ่อนให้เพียงพอ
    ที่สําคัญควรหลีกเลี่ยงอาหารที่มีแคลอรี่สูง
    เช่น อาหารทอด อาหารมัน อาหารที่มีน้ําตาลสูง
    และเครื่องดื่มแอลกอฮอล์

Usage

Choose the generator class or function for the model you want, initialize it with appropriate parameters, and call its generation methods. Generated text can be used for chatbots, content generation, or data augmentation.

Example

::

from pythainlp.generate import Unigram

unigram = Unigram() sentence = unigram.gen_sentence(“สวัสดีครับ”) print(sentence)