pythainlp.generate

The pythainlp.generate module provides classes and functions for generating Thai text using n-gram and neural language models.

N-gram generators

class pythainlp.generate.Unigram(name: str = 'tnc')[source]

Text generator using Unigram

Parameters:

name (str) – corpus name * tnc - Thai National Corpus (default) * ttc - Thai Textbook Corpus (TTC) * oscar - OSCAR Corpus

__init__(name: str = 'tnc') None[source]
counts: dict[str, int]
word: list[str]
n: int
prob: dict[str, float]
gen_sentence(start_seq: str = '', N: int = 3, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) list[str] | str[source]
Parameters:
  • start_seq (str) – word to begin sentence with

  • N (int) – number of words

  • output_str (bool) – output as string

  • duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

list[str], str

Example:

from pythainlp.generate import Unigram

gen = Unigram()

gen.gen_sentence("แมว")
# output: 'แมวเวลานะนั้น'
class pythainlp.generate.Bigram(name: str = 'tnc')[source]

Text generator using Bigram

Parameters:

name (str) – corpus name * tnc - Thai National Corpus (default)

__init__(name: str = 'tnc') None[source]
uni: dict[str, int]
bi: dict[tuple[str, str], int]
uni_keys: list[str]
bi_keys: list[tuple[str, str]]
words: list[str]
prob(t1: str, t2: str) float[source]

Probability of word

Parameters:
  • t1 (int) – text 1

  • t2 (int) – text 2

Returns:

probability value

Return type:

float

gen_sentence(start_seq: str = '', N: int = 4, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) list[str] | str[source]
Parameters:
  • start_seq (str) – word to begin sentence with

  • N (int) – number of words

  • output_str (bool) – output as string

  • duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

list[str], str

Example:

from pythainlp.generate import Bigram

gen = Bigram()

gen.gen_sentence("แมว")
# output: 'แมวไม่ได้รับเชื้อมัน'
class pythainlp.generate.Trigram(name: str = 'tnc')[source]

Text generator using Trigram

Parameters:

name (str) – corpus name * tnc - Thai National Corpus (default)

__init__(name: str = 'tnc') None[source]
uni: dict[str, int]
bi: dict[tuple[str, str], int]
ti: dict[tuple[str, str, str], int]
uni_keys: list[str]
bi_keys: list[tuple[str, str]]
ti_keys: list[tuple[str, str, str]]
words: list[str]
prob(t1: str, t2: str, t3: str) float[source]

Probability of word

Parameters:
  • t1 (int) – text 1

  • t2 (int) – text 2

  • t3 (int) – text 3

Returns:

probability value

Return type:

float

gen_sentence(start_seq: str | tuple[str, str] = '', N: int = 4, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) list[str] | str[source]
Parameters:
  • start_seq (str) – word to begin sentence with

  • N (int) – number of words

  • output_str (bool) – output as string

  • duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

list[str], str

Example:

from pythainlp.generate import Trigram

gen = Trigram()

gen.gen_sentence()
# output: 'ยังทำตัวเป็นเซิร์ฟเวอร์คือ'

Thai2fit helper

WangChanLM

class pythainlp.generate.wangchanglm.WangChanGLM[source]
device: str
torch_dtype: torch.dtype
model_path: str
model: PreTrainedModel
tokenizer: PreTrainedTokenizerBase
df: pd.DataFrame
exclude_ids: list[int]
__init__() None[source]
exclude_pattern: re.Pattern
stop_token: str
PROMPT_DICT: dict[str, str]
is_exclude(text: str) bool[source]
load_model(model_path: str = 'pythainlp/wangchanglm-7.5B-sft-en-sharded', return_dict: bool = True, load_in_8bit: bool = False, device: str = 'cuda', torch_dtype: 'torch.dtype' | None = None, offload_folder: str = './', low_cpu_mem_usage: bool = True) None[source]

Load model

Parameters:
  • model_path (str) – model path

  • return_dict (bool) – return dict

  • load_in_8bit (bool) – load model in 8bit

  • device (str) – device (cpu, cuda or other)

  • torch_dtype (Optional[torch.dtype]) – torch_dtype

  • offload_folder (str) – offload folder

  • low_cpu_mem_usage (bool) – low cpu mem usage

gen_instruct(text: str, max_new_tokens: int = 512, top_p: float = 0.95, temperature: float = 0.9, top_k: int = 50, no_repeat_ngram_size: int = 2, typical_p: float = 1.0, thai_only: bool = True, skip_special_tokens: bool = True) str[source]

Generate Instruct

Parameters:
  • text (str) – text

  • max_new_tokens (int) – maximum number of new tokens

  • top_p (float) – top p

  • temperature (float) – temperature

  • top_k (int) – top k

  • no_repeat_ngram_size (int) – do not repeat ngram size

  • typical_p (float) – typical p

  • thai_only (bool) – Thai only

  • skip_special_tokens (bool) – skip special tokens

Returns:

the answer from Instruct

Return type:

str

instruct_generate(instruct: str, context: str = '', max_new_tokens: int = 512, temperature: float = 0.9, top_p: float = 0.95, top_k: int = 50, no_repeat_ngram_size: int = 2, typical_p: float = 1, thai_only: bool = True, skip_special_tokens: bool = True) str[source]

Generate Instruct

Parameters:
  • instruct (str) – Instruct

  • context (str) – context (optional, default is empty string)

  • max_new_tokens (int) – maximum number of new tokens

  • top_p (float) – top p

  • temperature (float) – temperature

  • top_k (int) – top k

  • no_repeat_ngram_size (int) – do not repeat ngram size

  • typical_p (float) – typical p

  • thai_only (bool) – Thai only

  • skip_special_tokens (bool) – skip special tokens

Returns:

the answer from Instruct

Return type:

str

Example:

from pythainlp.generate.wangchanglm import WangChanGLM
import torch

model = WangChanGLM()

model.load_model(device="cpu", torch_dtype=torch.bfloat16)

print(model.instruct_generate(instruct="ขอวิธีลดน้ำหนัก"))
# output: ลดน้ําหนักให้ได้ผล ต้องทําอย่างค่อยเป็นค่อยไป
# ปรับเปลี่ยนพฤติกรรมการกินอาหาร
# ออกกําลังกายอย่างสม่ําเสมอ
# และพักผ่อนให้เพียงพอ
# ที่สําคัญควรหลีกเลี่ยงอาหารที่มีแคลอรี่สูง
# เช่น อาหารทอด อาหารมัน อาหารที่มีน้ําตาลสูง
# และเครื่องดื่มแอลกอฮอล์

Usage

Choose the generator class or function for the model you want, initialize it with appropriate parameters, and call its generation methods. Generated text can be used for chatbots, content generation, or data augmentation.

Example

::

from pythainlp.generate import Unigram

unigram = Unigram() sentence = unigram.gen_sentence(“สวัสดีครับ”) print(sentence)