pythainlp.benchmarks¶

The pythainlp.benchmarks contains utility functions for benchmarking tasked related to Thai NLP. At the moment, we have only for word tokenization. Other tasks will be added soon.

Modules¶

Tokenization¶

Quality¶

Qualitative evaluation of word tokenization.¶

pythainlp.benchmarks.word_tokenization.compute_stats(ref_sample: str, raw_sample: str) → dict [source]¶

Compute statistics for tokenization quality

These statistics includes:

Character-Level:

True Positive, False Positive, True Negative, False Negative, Precision, Recall, and f1

Word-Level:

Precision, Recall, and f1

Other:

Correct tokenization indicator: {0, 1} sequence indicating the correspoding word is tokenized correctly.

Parameters

ref_sample (str) – ground truth samples
samples (str) – samples that we want to evaluate

Returns

metrics in character and word-level and correctly tokenized word indicators

Return type

dict[str, float | str]

pythainlp.benchmarks.word_tokenization.benchmark(ref_samples: List[str], samples: List[str]) → pandas.core.frame.DataFrame[source]¶

Performace benchmark of samples.

Please see pythainlp.benchmarks.word_tokenization.compute_stats() for metrics being computed.

Parameters

ref_samples (list[str]) – ground truth samples
samples (list[str]) – samples that we want to evaluate

Returns

dataframe with row x col = len(samples) x len(metrics)

Return type

pandas.DataFrame

pythainlp.benchmarks.word_tokenization.preprocessing(txt: str, remove_space: bool = True) → str [source]¶

Clean up text before performing evaluation.

Parameters

text (str) – text to be preprocessed
remove_space (bool) – whether remove white space

Returns

preprocessed text

Return type

str