pythainlp.wangchanberta

WangchanBERTa base model: wangchanberta-base-att-spm-uncased 1

We used WangchanBERTa for Thai name tagger task, part-of-speech and subword tokenizer.

If you want to finetune model, You can read https://github.com/vistec-AI/thai2transformers

Speed Benchmark

Function	Named Entity Recognition	Part of Speech
PyThaiNLP basic function	89.7 ms	312 ms
pythainlp.wangchanberta (CPU)	9.64 s	9.65 s
pythainlp.wangchanberta (GPU)	8.02 s	8 s

Notebook:

Modules

class pythainlp.wangchanberta.ThaiNameTagger(dataset_name: str = 'thainer', grouped_entities: bool = True)[source]

get_ner(text: str, tag: bool = False) → Union[List[Tuple[str, str]], str][source]

This function tags named-entitiy from text in IOB format. Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters

text (str) – text in Thai to be tagged
tag (bool) – output like html tag.

Returns

a list of tuple associated with tokenized word group, NER tag, and output like html tag (if the parameter tag is specified as True). Otherwise, return a list of tuple associated with tokenized word and NER tag

Return type

Union[list[tuple[str, str]]], str

pythainlp.wangchanberta.pos_tag(text: str, corpus: str = 'lst20', grouped_word: bool = False) → List[Tuple[str, str]][source]

Marks words with part-of-speech (POS) tags.

Parameters

text (str) – thai text
corpus (str) –
- lst20 - a LST20 tagger (default)
grouped_word (bool) – grouped word (default is False)

Returns

a list of tuples (word, POS tag)

Return type

list[tuple[str, str]]

pythainlp.wangchanberta.segment(text: str) → List[str][source]

Subword tokenize. SentencePiece from wangchanberta model.

Parameters: text (str) – text to be tokenized
Returns: list of subwords
Return type: list[str]

References

1: Lowphansirikul L, Polpanumas C, Jantrakulchai N, Nutanong S. WangchanBERTa: Pretraining transformer-based Thai Language Models. arXiv:210109635 [cs] [Internet]. 2021 Jan 23 [cited 2021 Feb 27]; Available from: http://arxiv.org/abs/2101.09635