pythainlp.benchmarks

Introduction

The pythainlp.benchmarks module is a collection of utility functions designed for benchmarking tasks related to Thai Natural Language Processing (NLP). The module includes tools for word tokenization benchmarking and evaluation metrics for text generation tasks (BLEU and ROUGE).

Tokenization

Word tokenization is a fundamental task in NLP, and it plays a crucial role in various applications, such as text analysis and language processing. The pythainlp.benchmarks module offers a set of functions to assist in the benchmarking and evaluation of word tokenization methods.

Quality Evaluation

The quality of word tokenization can significantly impact the accuracy of downstream NLP tasks. To assess the quality of word tokenization, the module provides a qualitative evaluation using various metrics and techniques.

../_images/evaluation.png — Qualitative evaluation of word tokenization.

Tokenization Functions

pythainlp.benchmarks.word_tokenization.compute_stats(ref_sample: str, raw_sample: str) → dict[str, dict[str, int | str]][source]

Compute statistics for tokenization quality

These statistics include:

Character-Level:

True Positive, False Positive, True Negative, False Negative, Precision, Recall, and f1

Word-Level:

Precision, Recall, and f1

Other:

Correct tokenization indicator: {0, 1} sequence indicating that the corresponding word is tokenized correctly.

Parameters:

ref_sample (str) – ground truth for samples
samples (str) – samples that we want to evaluate

Returns:

metrics at character- and word-level and indicators of correctly tokenized words

Return type:

dict[str, dict[str, Union[int, str]]]

This function is used to compute various statistics and metrics related to word tokenization. It allows you to assess the performance of different tokenization methods.

pythainlp.benchmarks.word_tokenization.benchmark(ref_samples: list[str], samples: list[str]) → pd.DataFrame[source]

Performance benchmarking for samples.

Please see pythainlp.benchmarks.word_tokenization.compute_stats() for the computed metrics.

Parameters:

ref_samples (list[str]) – ground truth for samples
samples (list[str]) – samples that we want to evaluate

Returns:

dataframe with row x col = len(samples) x len(metrics)

Return type:

pandas.DataFrame

The benchmark function facilitates the benchmarking of word tokenization methods. It provides an organized framework for evaluating and comparing the effectiveness of different tokenization tools.

pythainlp.benchmarks.word_tokenization.preprocessing(txt: str, remove_space: bool = True) → str[source]

Clean up text before performing evaluation.

Parameters:

text (str) – text to be preprocessed
remove_space (bool) – whether to remove white space

Returns:

preprocessed text

Return type:

str

Preprocessing is a crucial step in NLP tasks. The preprocessing function assists in preparing text data for tokenization, which is essential for accurate and consistent benchmarking.

Evaluation Metrics

The module provides pure Python implementations of common evaluation metrics (BLEU and ROUGE) that automatically handle Thai text tokenization. These metrics are essential for evaluating machine translation, text summarization, and other text generation tasks.

BLEU Score

BLEU (Bilingual Evaluation Understudy) is a metric for evaluating the quality of machine-translated text. It compares the generated text against one or more reference translations by measuring n-gram precision with a brevity penalty.

pythainlp.benchmarks.bleu_score(references: list[str] | list[list[str]], hypotheses: list[str], tokenize: str = 'newmm', lowercase: bool = False, max_ngram: int = 4, smooth: bool = True) → dict[str, float][source]

Calculate BLEU score for Thai text with automatic tokenization.

This is a pure Python implementation of BLEU (Bilingual Evaluation Understudy) metric that automatically tokenizes Thai text using PyThaiNLP before calculating the score.

Parameters:

references (list[str] | list[list[str]]) – reference translations. Can be: - A list of strings (one reference per hypothesis) - A list of lists of strings (multiple references per hypothesis)
hypotheses (list[str]) – hypothesis translations to evaluate
tokenize (str) – tokenization engine to use (default: “newmm”). See pythainlp.tokenize.word_tokenize() for available engines.
lowercase (bool) – whether to lowercase text before evaluation (default: False)
max_ngram (int) – maximum n-gram order (default: 4)
smooth (bool) – whether to use smoothing for zero counts (default: True)

Returns:

dictionary with ‘bleu’, ‘precisions’, ‘bp’, ‘length_ratio’, ‘hyp_length’, and ‘ref_length’

Return type:

dict[str, float]

Example:

from pythainlp.benchmarks import bleu_score

references = ["สวัสดีครับ วันนี้อากาศดีมาก"]
hypotheses = ["สวัสดีค่ะ วันนี้อากาศดี"]
score = bleu_score(references, hypotheses)
print(f"BLEU score: {score['bleu']:.2f}")

# Multiple references per hypothesis
references = [
    ["สวัสดีครับ", "สวัสดีค่ะ"],  # two refs for first hypothesis
    ["ลาก่อนครับ", "ลาก่อนค่ะ"],  # two refs for second hypothesis
]
hypotheses = ["สวัสดี", "ลาก่อน"]
score = bleu_score(references, hypotheses)

Example:

from pythainlp.benchmarks import bleu_score

# Single reference
references = ["สวัสดีครับ วันนี้อากาศดีมาก"]
hypotheses = ["สวัสดีค่ะ วันนี้อากาศดี"]
score = bleu_score(references, hypotheses)
print(f"BLEU: {score['bleu']:.2f}")

# Multiple references per hypothesis
references = [
    ["สวัสดีครับ", "สวัสดีค่ะ"],
    ["ลาก่อนครับ", "ลาก่อนค่ะ"],
]
hypotheses = ["สวัสดี", "ลาก่อน"]
score = bleu_score(references, hypotheses)
print(f"BLEU: {score['bleu']:.2f}")

ROUGE Score

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating automatic summarization and machine translation. It measures the overlap between the generated text and reference text(s).

pythainlp.benchmarks.rouge_score(reference: str, hypothesis: str, tokenize: str = 'newmm', rouge_types: list[str] | None = None) → dict[str, tuple[float, float, float]][source]

Calculate ROUGE scores for Thai text with automatic tokenization.

This is a pure Python implementation of ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric that automatically tokenizes Thai text using PyThaiNLP.

Supported ROUGE types: - rouge1: unigram-based scoring - rouge2: bigram-based scoring - rougeL: longest common subsequence-based scoring

Parameters:

reference (str) – reference text
hypothesis (str) – hypothesis text to evaluate
tokenize (str) – tokenization engine to use (default: “newmm”). See pythainlp.tokenize.word_tokenize() for available engines.
rouge_types (list[str] | None) – list of ROUGE types to calculate. Default is [“rouge1”, “rouge2”, “rougeL”]

Returns:

dictionary mapping ROUGE type to (precision, recall, fmeasure)

Return type:

dict[str, tuple[float, float, float]]

Example:

from pythainlp.benchmarks import rouge_score

reference = "สวัสดีครับ วันนี้อากาศดีมาก"
hypothesis = "สวัสดีค่ะ วันนี้อากาศดี"
scores = rouge_score(reference, hypothesis)
print(f"ROUGE-1 F-measure: {scores['rouge1'][2]:.4f}")
print(f"ROUGE-2 F-measure: {scores['rouge2'][2]:.4f}")
print(f"ROUGE-L F-measure: {scores['rougeL'][2]:.4f}")

Example:

from pythainlp.benchmarks import rouge_score

reference = "สวัสดีครับ วันนี้อากาศดีมาก"
hypothesis = "สวัสดีค่ะ วันนี้อากาศดี"
scores = rouge_score(reference, hypothesis)

for rouge_type, (precision, recall, fmeasure) in scores.items():
    print(f"{rouge_type}: P={precision:.4f}, R={recall:.4f}, F={fmeasure:.4f}")

Word Error Rate (WER)

Word Error Rate is a common metric for evaluating speech recognition and machine translation systems. It measures the minimum number of word-level edits (insertions, deletions, substitutions) needed to transform the hypothesis into the reference.

pythainlp.benchmarks.word_error_rate(reference: str, hypothesis: str, tokenize: str = 'newmm') → float[source]

Calculate Word Error Rate (WER) for Thai text with automatic tokenization.

Word Error Rate is a common metric for evaluating speech recognition and machine translation systems. It measures the minimum number of word-level edits (insertions, deletions, substitutions) needed to transform the hypothesis into the reference, normalized by the reference length.

WER = (S + D + I) / N

where: - S = number of substitutions - D = number of deletions - I = number of insertions - N = number of words in reference

Parameters:

reference (str) – reference text
hypothesis (str) – hypothesis text to evaluate
tokenize (str) – tokenization engine to use (default: “newmm”). See pythainlp.tokenize.word_tokenize() for available engines.

Returns:

word error rate as a float (0.0 = perfect, >1.0 = very poor)

Return type:

float

Example:

from pythainlp.benchmarks import word_error_rate

reference = "สวัสดีครับ วันนี้อากาศดีมาก"
hypothesis = "สวัสดีค่ะ วันนี้อากาศดี"
wer = word_error_rate(reference, hypothesis)
print(f"WER: {wer:.4f}")

Example:

from pythainlp.benchmarks import word_error_rate

reference = "สวัสดีครับ วันนี้อากาศดีมาก"
hypothesis = "สวัสดีค่ะ วันนี้อากาศดี"
wer = word_error_rate(reference, hypothesis)
print(f"WER: {wer:.4f}")

Character Error Rate (CER)

Character Error Rate is a metric for evaluating speech recognition and optical character recognition (OCR) systems. It measures the minimum number of character-level edits (insertions, deletions, substitutions) needed to transform the hypothesis into the reference.

pythainlp.benchmarks.character_error_rate(reference: str, hypothesis: str) → float[source]

Calculate Character Error Rate (CER) for Thai text.

Character Error Rate is a metric for evaluating speech recognition and optical character recognition (OCR) systems. It measures the minimum number of character-level edits (insertions, deletions, substitutions) needed to transform the hypothesis into the reference, normalized by the reference length.

CER = (S + D + I) / N

where: - S = number of substitutions - D = number of deletions - I = number of insertions - N = number of characters in reference

Parameters:

reference (str) – reference text
hypothesis (str) – hypothesis text to evaluate

Returns:

character error rate as a float (0.0 = perfect, >1.0 = very poor)

Return type:

float

Example:

from pythainlp.benchmarks import character_error_rate

reference = "สวัสดีครับ"
hypothesis = "สวัสดีค่ะ"
cer = character_error_rate(reference, hypothesis)
print(f"CER: {cer:.4f}")

Example:

from pythainlp.benchmarks import character_error_rate

reference = "สวัสดีครับ"
hypothesis = "สวัสดีค่ะ"
cer = character_error_rate(reference, hypothesis)
print(f"CER: {cer:.4f}")

Usage

To make use of these benchmarking functions, you can follow the provided examples and guidelines in the official PyThaiNLP documentation. These tools are invaluable for researchers, developers, and anyone interested in improving and evaluating Thai word tokenization methods and text generation systems.