CLS
Blackboard CLS
V1.0
Model Details
- Developer: Wannaphong Phatthiyaphaibun
- This report author: Wannaphong Phatthiyaphaibun
- Model date: 2022-10-14
- Model version: 1.0
- Used in PyThaiNLP version: 3.2 +
- Filename:
pythainlp/corpus/blackboard-cls_v1.0.crfsuite
- GitHub: https://github.com/PyThaiNLP/pythainlp/issues/729
- CRF Model
- License: CC0
Intended Use
- Segmenting Thai text into clauses (smaller than a sentence but bigger than a word)
- Not suitable for other language or non-news domains.
Factors
- Based on known problems with thai natural Language processing.
Metrics
- Evaluation metrics include precision, recall and f1-score.
Training Data
Blackboard treebank
Evaluation Data
Blackboard treebank
Quantitative Analyses
precision recall f1-score support
B_CLS 1.00 1.00 1.00 91698
E_CLS 1.00 1.00 1.00 91700
I_CLS 1.00 1.00 1.00 707795
micro avg 1.00 1.00 1.00 891193
macro avg 1.00 1.00 1.00 891193
weighted avg 1.00 1.00 1.00 891193
samples avg 1.00 1.00 1.00 891193
Ethical Considerations
- It trains from Blackboard treebank. It is possible to have a bias from Blackboard treebank.
Caveats and Recommendations
- The user must perform word segmentation first before using this model.
- Thai text only
LST20 CLS
v0.2
Model Details
- Developer: Wannaphong Phatthiyaphaibun
- This report author: Wannaphong Phatthiyaphaibun
- Model date: 2020-10-03
- Model version: 0.2
- Used in PyThaiNLP version: 2.2.4 +
- Filename:
~/pythainlp-data/cls-v0.2.crfsuite
- GitHub: https://github.com/PyThaiNLP/pythainlp/pull/479
- CRF Model
- License: CC0
Intended Use
- Segmenting Thai text into clauses (smaller than a sentence but bigger than a word)
- Not suitable for other language or non-news domains.
Factors
- Based on known problems with thai natural Language processing.
Metrics
- Evaluation metrics include precision, recall and f1-score.
Training Data
LST20 Corpus Train set (news domain)
Evaluation Data
LST20 Corpus Test set (news domain)
Quantitative Analyses
precision recall f1-score support
B_CLS 0.90 0.94 0.92 16111
E_CLS 0.90 0.94 0.92 15947
I_CLS 0.99 0.97 0.98 169565
micro avg 0.97 0.97 0.97 201623
macro avg 0.93 0.95 0.94 201623
weighted avg 0.97 0.97 0.97 201623
samples avg 0.94 0.94 0.94 201623
Ethical Considerations
- It trains from LST20 Corpus. It is possible to have a bias from LST20 Corpus.
Caveats and Recommendations
- The user must perform word segmentation first before using this model.
- Thai text only