pythainlp.wangchanberta

The pythainlp.wangchanberta module is built upon the WangchanBERTa base model, specifically the wangchanberta-base-att-spm-uncased model, as detailed in the paper by Lowphansirikul et al. [1].

This base model is utilized for various natural language processing tasks in the Thai language, including named entity recognition, part-of-speech tagging, and subword tokenization.

If you intend to fine-tune the model or explore its capabilities further, please refer to the thai2transformers repository.

Speed Benchmark

Function

Named Entity Recognition

Part of Speech

PyThaiNLP basic function

89.7 ms

312 ms

pythainlp.wangchanberta (CPU)

9.64 s

9.65 s

pythainlp.wangchanberta (GPU)

8.02 s

8 s

For a comprehensive performance benchmark, the following notebooks are available:

Modules

class pythainlp.wangchanberta.NamedEntityRecognition(model: str = 'pythainlp/thainer-corpus-v2-base-model')[source]

The NamedEntityRecognition class is a fundamental component for identifying named entities in Thai text. It allows you to extract entities such as names, locations, and organizations from text data.

__init__(model: str = 'pythainlp/thainer-corpus-v2-base-model') None[source]

This function tags named entities in text in IOB format.

Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand :param str model: The model that use wangchanberta pretrained.

get_ner(text: str, pos: bool = False, tag: bool = False) List[Tuple[str, str]] | str[source]

This function tags named entities in text in IOB format. Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:
  • text (str) – text in Thai to be tagged

  • tag (bool) – output HTML-like tags.

Returns:

a list of tuples associated with tokenized word groups, NER tags, and output HTML-like tags (if the parameter tag is specified as True). Otherwise, return a list of tuples associated with tokenized words and NER tags

Return type:

Union[list[tuple[str, str]]], str

class pythainlp.wangchanberta.ThaiNameTagger(dataset_name: str = 'thainer', grouped_entities: bool = True)[source]

The ThaiNameTagger class is designed for tagging Thai names within text. This is essential for tasks such as entity recognition, information extraction, and text classification.

__init__(dataset_name: str = 'thainer', grouped_entities: bool = True)[source]

This function tags named entities in text in IOB format.

Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:
  • dataset_name (str) –

    • thainer - ThaiNER dataset

  • grouped_entities (bool) – grouped entities

get_ner(text: str, pos: bool = False, tag: bool = False) List[Tuple[str, str]] | str[source]

This function tags named entities in text in IOB format. Powered by wangchanberta from VISTEC-depa AI Research Institute of Thailand

Parameters:
  • text (str) – text in Thai to be tagged

  • tag (bool) – output HTML-like tags.

Returns:

a list of tuples associated with tokenized word groups, NER tags, and output HTML-like tags (if the parameter tag is specified as True). Otherwise, return a list of tuples associated with tokenized words and NER tags

Return type:

Union[list[tuple[str, str]]], str

pythainlp.wangchanberta.segment(text: str) List[str][source]

Subword tokenize. SentencePiece from wangchanberta model.

Parameters:

text (str) – text to be tokenized

Returns:

list of subwords

Return type:

list[str]

The segment function is a subword tokenization tool that breaks down text into subword units, offering a foundation for further text processing and analysis.

References