pythainlp.wangchanberta

The pythainlp.wangchanberta module is built upon the WangchanBERTa base model, specifically the wangchanberta-base-att-spm-uncased model, as detailed in the paper by Lowphansirikul et al. [1].

This base model is utilized for various natural language processing tasks in the Thai language, including named entity recognition, part-of-speech tagging, and subword tokenization.

If you intend to fine-tune the model or explore its capabilities further, please refer to the thai2transformers repository.

Speed Benchmark

Function	Named Entity Recognition	Part of Speech
PyThaiNLP basic function	89.7 ms	312 ms
pythainlp.wangchanberta (CPU)	9.64 s	9.65 s
pythainlp.wangchanberta (GPU)	8.02 s	8 s

For a comprehensive performance benchmark, the following notebooks are available:

`PyThaiNLP basic function and pythainlp.wangchanberta CPU at Google Colab`_
`pythainlp.wangchanberta GPU`_

Modules

class pythainlp.wangchanberta.NamedEntityRecognition(model: str = 'pythainlp/thainer-corpus-v2-base-model')[source]

The NamedEntityRecognition class is a fundamental component for identifying named entities in Thai text. It allows you to extract entities such as names, locations, and organizations from text data.

__init__(model: str = 'pythainlp/thainer-corpus-v2-base-model') → None[source]

This function tags named entities in text in IOB format.

Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand :param str model: The model that use wangchanberta pretrained.

get_ner(text: str, pos: bool = False, tag: bool = False) → List[Tuple[str, str]] | str[source]

This function tags named entities in text in IOB format. Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:

text (str) – text in Thai to be tagged
tag (bool) – output HTML-like tags.

Returns:

a list of tuples associated with tokenized word groups, NER tags, and output HTML-like tags (if the parameter tag is specified as True). Otherwise, return a list of tuples associated with tokenized words and NER tags

Return type:

Union[list[tuple[str, str]]], str

class pythainlp.wangchanberta.ThaiNameTagger(dataset_name: str = 'thainer', grouped_entities: bool = True)[source]

The ThaiNameTagger class is designed for tagging Thai names within text. This is essential for tasks such as entity recognition, information extraction, and text classification.

__init__(dataset_name: str = 'thainer', grouped_entities: bool = True)[source]

This function tags named entities in text in IOB format.

Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:

dataset_name (str) –
- thainer - ThaiNER dataset
grouped_entities (bool) – grouped entities

get_ner(text: str, pos: bool = False, tag: bool = False) → List[Tuple[str, str]] | str[source]

This function tags named entities in text in IOB format. Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:

text (str) – text in Thai to be tagged
tag (bool) – output HTML-like tags.

Returns:

a list of tuples associated with tokenized word groups, NER tags, and output HTML-like tags (if the parameter tag is specified as True). Otherwise, return a list of tuples associated with tokenized words and NER tags

Return type:

Union[list[tuple[str, str]]], str

pythainlp.wangchanberta.segment(text: str) → List[str][source]

Subword tokenize. SentencePiece from wangchanberta model.

Parameters:: text (str) – text to be tokenized
Returns:: list of subwords
Return type:: list[str]

The segment function is a subword tokenization tool that breaks down text into subword units, offering a foundation for further text processing and analysis.

pythainlp.wangchanberta

Modules

References