pythainlp.generate

The pythainlp.generate module provides classes and functions for generating Thai text using n-gram and neural language models.

N-gram generators

class pythainlp.generate.Unigram(name: str = 'tnc')[source]

Text generator using Unigram

Parameters:

name (str) – corpus name * tnc - Thai National Corpus (default) * ttc - Thai Textbook Corpus (TTC) * oscar - OSCAR Corpus

__init__(name: str = 'tnc') None[source]
counts: dict[str, int]
word: list[str]
n: int
prob: dict[str, float]
gen_sentence(start_seq: str = '', N: int = 3, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) list[str] | str[source]

Generate a sentence using the unigram model.

Parameters:
  • start_seq (str) – word to begin sentence with

  • N (int) – number of words

  • prob (float) – minimum word probability threshold

  • output_str (bool) – output as string

  • duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

Union[list[str], str]

Example:
>>> from pythainlp.generate import Unigram
>>> gen = Unigram()
>>> gen.gen_sentence("แมว")
'แมวเวลานะนั้น'
class pythainlp.generate.Bigram(name: str = 'tnc')[source]

Text generator using Bigram

Parameters:

name (str) – corpus name * tnc - Thai National Corpus (default)

__init__(name: str = 'tnc') None[source]
uni: dict[str, int]
bi: dict[tuple[str, str], int]
uni_keys: list[str]
bi_keys: list[tuple[str, str]]
words: list[str]
prob(t1: str, t2: str) float[source]

Compute bigram probability P(t2 | t1).

Parameters:
  • t1 (str) – first word

  • t2 (str) – second word

Returns:

probability value

Return type:

float

gen_sentence(start_seq: str = '', N: int = 4, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) list[str] | str[source]

Generate a sentence using the bigram model.

Parameters:
  • start_seq (str) – word to begin sentence with

  • N (int) – number of words

  • prob (float) – minimum word probability threshold

  • output_str (bool) – output as string

  • duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

Union[list[str], str]

Example:
>>> from pythainlp.generate import Bigram
>>> gen = Bigram()
>>> gen.gen_sentence("แมว")
'แมวไม่ได้รับเชื้อมัน'
class pythainlp.generate.Trigram(name: str = 'tnc')[source]

Text generator using Trigram

Parameters:

name (str) – corpus name * tnc - Thai National Corpus (default)

__init__(name: str = 'tnc') None[source]
uni: dict[str, int]
bi: dict[tuple[str, str], int]
ti: dict[tuple[str, str, str], int]
uni_keys: list[str]
bi_keys: list[tuple[str, str]]
ti_keys: list[tuple[str, str, str]]
words: list[str]
prob(t1: str, t2: str, t3: str) float[source]

Compute trigram probability P(t3 | t1, t2).

Parameters:
  • t1 (str) – first word

  • t2 (str) – second word

  • t3 (str) – third word

Returns:

probability value

Return type:

float

gen_sentence(start_seq: str | tuple[str, str] = '', N: int = 4, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) list[str] | str[source]

Generate a sentence using the trigram model.

Parameters:
  • start_seq (Union[str, tuple[str, str]]) – word or bigram to begin sentence with

  • N (int) – number of words

  • prob (float) – minimum word probability threshold

  • output_str (bool) – output as string

  • duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

Union[list[str], str]

Example:
>>> from pythainlp.generate import Trigram
>>> gen = Trigram()
>>> gen.gen_sentence()
'ยังทำตัวเป็นเซิร์ฟเวอร์คือ'

Thai2fit helper

WangChanLM

class pythainlp.generate.wangchanglm.WangChanGLM[source]
device: str
torch_dtype: torch.dtype
model_path: str
model: PreTrainedModel
tokenizer: PreTrainedTokenizerBase
df: pd.DataFrame
exclude_ids: list[int]
__init__() None[source]
exclude_pattern: re.Pattern[str]
stop_token: str
PROMPT_DICT: dict[str, str]
is_exclude(text: str) bool[source]
load_model(model_path: str = 'pythainlp/wangchanglm-7.5B-sft-en-sharded', return_dict: bool = True, load_in_8bit: bool = False, device: str = 'cuda', torch_dtype: 'torch.dtype' | None = None, offload_folder: str = './', low_cpu_mem_usage: bool = True) None[source]

Load model

Parameters:
  • model_path (str) – model path

  • return_dict (bool) – return dict

  • load_in_8bit (bool) – load model in 8bit

  • device (str) – device (cpu, cuda or other)

  • torch_dtype (Optional[torch.dtype]) – torch_dtype

  • offload_folder (str) – offload folder

  • low_cpu_mem_usage (bool) – low cpu mem usage

gen_instruct(text: str, max_new_tokens: int = 512, top_p: float = 0.95, temperature: float = 0.9, top_k: int = 50, no_repeat_ngram_size: int = 2, typical_p: float = 1.0, thai_only: bool = True, skip_special_tokens: bool = True) str[source]

Generate Instruct

Parameters:
  • text (str) – text

  • max_new_tokens (int) – maximum number of new tokens

  • top_p (float) – top p

  • temperature (float) – temperature

  • top_k (int) – top k

  • no_repeat_ngram_size (int) – do not repeat ngram size

  • typical_p (float) – typical p

  • thai_only (bool) – Thai only

  • skip_special_tokens (bool) – skip special tokens

Returns:

the answer from Instruct

Return type:

str

instruct_generate(instruct: str, context: str = '', max_new_tokens: int = 512, temperature: float = 0.9, top_p: float = 0.95, top_k: int = 50, no_repeat_ngram_size: int = 2, typical_p: float = 1, thai_only: bool = True, skip_special_tokens: bool = True) str[source]

Generate Instruct

Parameters:
  • instruct (str) – Instruct

  • context (str) – context (optional, default is empty string)

  • max_new_tokens (int) – maximum number of new tokens

  • top_p (float) – top p

  • temperature (float) – temperature

  • top_k (int) – top k

  • no_repeat_ngram_size (int) – do not repeat ngram size

  • typical_p (float) – typical p

  • thai_only (bool) – Thai only

  • skip_special_tokens (bool) – skip special tokens

Returns:

the answer from Instruct

Return type:

str

Example:
>>>     from pythainlp.generate.wangchanglm import WangChanGLM
>>>     import torch
>>>     model = WangChanGLM()
>>>     model.load_model(device="cpu", torch_dtype=torch.bfloat16)
>>>     print(model.instruct_generate(instruct="ขอวิธีลดน้ำหนัก"))
    ลดน้ําหนักให้ได้ผล ต้องทําอย่างค่อยเป็นค่อยไป
    ปรับเปลี่ยนพฤติกรรมการกินอาหาร
    ออกกําลังกายอย่างสม่ําเสมอ
    และพักผ่อนให้เพียงพอ
    ที่สําคัญควรหลีกเลี่ยงอาหารที่มีแคลอรี่สูง
    เช่น อาหารทอด อาหารมัน อาหารที่มีน้ําตาลสูง
    และเครื่องดื่มแอลกอฮอล์

Usage

Choose the generator class or function for the model you want, initialize it with appropriate parameters, and call its generation methods. Generated text can be used for chatbots, content generation, or data augmentation.

Example

::

from pythainlp.generate import Unigram

unigram = Unigram() sentence = unigram.gen_sentence(“สวัสดีครับ”) print(sentence)