pythainlp.tag

The pythainlp.tag contains functions that are used to mark linguistic and other annotation to different parts of a text including part-of-speech (POS) tag and named entity (NE) tag.

For POS tags, there are three set of available tags: Universal POS tags, ORCHID POS tags 1, and LST20 POS tags 2.

The following table shows Universal POS tags as used in Universal Dependencies (UD):

Abbreviation	Part-of-Speech tag	Examples
ADJ	Adjective	ใหม่, พิเศษ , ก่อน, มาก, สูง
ADP	Adposition	แม้, ว่า, เมื่อ, ของ, สำหรับ
ADV	Adverb	ก่อน, ก็, เล็กน้อย, เลย, สุด
AUX	Auxiliary	เป็น, ใช่, คือ, คล้าย
CCONJ	Coordinating conjunction	แต่, และ, หรือ
DET	Determiner	ที่, นี้, ซึ่ง, ทั้ง, ทุก, หลาย
INTJ	Interjection	อุ้ย, โอ้ย
NOUN	Noun	กำมือ, พวก, สนาม, กีฬา, บัญชี
NUM	Numeral	5,000, 103.7, 2004, หนึ่ง, ร้อย
PART	Particle	มา ขึ้น ไม่ ได้ เข้า
PRON	Pronoun	เรา, เขา, ตัวเอง, ใคร, เธอ
PROPN	Proper noun	โอบามา, แคปิตอลฮิล, จีโอพี, ไมเคิล
PUNCT	Punctuation	(, ), “, ‘, :
SCONJ	Subordinating conjunction	หาก
VERB	Verb	เปิด, ให้, ใช้, เผชิญ, อ่าน

The following table shows POS tags as used in ORCHID:

Abbreviation	Part-of-Speech tag	Examples
NPRP	Proper noun	วินโดวส์ 95, โคโรน่า, โค้ก
NCNM	Cardinal number	หนึ่ง, สอง, สาม, 1, 2, 10
NONM	Ordinal number	ที่หนึ่ง, ที่สอง, ที่สาม, ที่1, ที่2
NLBL	Label noun	1, 2, 3, 4, ก, ข, a, b
NCMN	Common noun	หนังสือ, อาหาร, อาคาร, คน
NTTL	Title noun	ครู, พลเอก
PPRS	Personal pronoun	คุณ, เขา, ฉัน
PDMN	Demonstrative pronoun	นี่, นั้น, ที่นั่น, ที่นี่
PNTR	Interrogative pronoun	ใคร, อะไร, อย่างไร
PREL	Relative pronoun	ที่, ซึ่ง, อัน, ผู้
VACT	Active verb	ทำงาน, ร้องเพลง, กิน
VSTA	Stative verb	เห็น, รู้, คือ
VATT	Attributive verb	อ้วน, ดี, สวย
XVBM	Pre-verb auxiliary, before negator “ไม่”	เกิด, เกือบ, กำลัง
XVAM	Pre-verb auxiliary, after negator “ไม่”	ค่อย, น่า, ได้
XVMM	Pre-verb, before or after negator “ไม่”	ควร, เคย, ต้อง
XVBB	Pre-verb auxiliary, in imperative mood	กรุณา, จง, เชิญ, อย่า, ห้าม
XVAE	Post-verb auxiliary	ไป, มา, ขึ้น
DDAN	Definite determiner, after noun without classifier in between	ยี่, นั่น, โน่น, ทั้งหมด
DDAC	Definite determiner, allowing classifier in between	นี้, นั้น, โน้น, นู้น
DDBQ	Definite determiner, between noun and classifier or preceding quantitative expression	ทั้ง, อีก, เพียง
DDAQ	Definite determiner, following quantitative expression	พอดี, ถ้วน
DIAC	Indefinite determiner, following noun; allowing classifier in between	ไหน, อื่น, ต่างๆ
DIBQ	Indefinite determiner, between noun and classifier or preceding quantitative expression	บาง, ประมาณ, เกือบ
DIAQ	Indefinite determiner, following quantitative expression	กว่า, เศษ
DCNM	Determiner, cardinal number expression	หนึ่งคน, เสือ, 2 ตัว
DONM	Determiner, ordinal number expression	ที่หนึ่ง, ที่สอง, ที่สุดท้สย
ADVN	Adverb with normal form	เก่ง, เร็ว, ช้า, สม่ำเสมอ
ADVI	Adverb with iterative form	เร็วๆ, เสทอๆ, ช้าๆ
ADVP	Adverb with prefixed form	โดยเร็ว
ADVS	Sentential adverb	โดยปกติ, ธรรมดา
CNIT	Unit classifier	ตัว, คน, เล่ม
CLTV	Collective classifier	คู่, กลุ่ม, ฝูง, เชิง, ทาง, ด้าน, แบบ, รุ่น
CMTR	Measurement classifier	กิโลกรัม, แก้ว, ชั่วโมง
CFQC	Frequency classifier	ครั้ง, เที่ยว
CVBL	Verbal classifier	ม้วน, มัด
JCRG	Coordinating conjunction	และ, หรือ, แต่
JCMP	Comparative conjunction	กว่า, เหมือนกับ, เท่ากับ
JSBR	Subordinating conjunction	เพราะว่า, เนื่องจาก ที่, แม้ว่า, ถ้า
RPRE	Preposition	จาก, ละ, ของ, ใต้, บน
INT	Interjection	โอ้บ, โอ้, เออ, เอ๋, อ๋อ
FIXN	Nominal prefix	การทำงาน, ความสนุนสนาน
FIXV	Adverbial prefix	อย่างเร็ว
EAFF	Ending for affirmative sentence	จ๊ะ, จ้ะ, ค่ะ, ครับ, นะ, น่า, เถอะ
EITT	Ending for interrogative sentence	หรือ, เหรอ, ไหม, มั้ย
NEG	Negator	ไม่, มิได้, ไม่ได้, มิ
PUNC	Punctuation	(, ), “, ,, ;

ORCHID corpus uses different set of POS tags. Thus, we make UD POS tags version for ORCHID corpus.

The following table shows the mapping of POS tags from ORCHID to UD:

ORCHID POS tags	Coresponding UD POS tag
NOUN	NOUN
NCMN	NOUN
NTTL	NOUN
CNIT	NOUN
CLTV	NOUN
CMTR	NOUN
CFQC	NOUN
CVBL	NOUN
VACT	VERB
VSTA	VERB
PROPN	PROPN
NPRP	PROPN
ADJ	ADJ
NONM	ADJ
VATT	ADJ
DONM	ADJ
ADV	ADV
ADVN	ADV
ADVI	ADV
ADVP	ADV
ADVS	ADV
INT	INTJ
PRON	PRON
PPRS	PRON
PDMN	PRON
PNTR	PRON
DET	DET
DDAN	DET
DDAC	DET
DDBQ	DET
DDAQ	DET
DIAC	DET
DIBQ	DET
DIAQ	DET
NUM	NUM
NCNM	NUM
NLBL	NUM
DCNM	NUM
AUX	AUX
XVBM	AUX
XVAM	AUX
XVMM	AUX
XVBB	AUX
XVAE	AUX
ADP	ADP
RPRE	ADP
CCONJ	CCONJ
JCRG	CCONJ
SCONJ	SCONJ
PREL	SCONJ
JSBR	SCONJ
JCMP	SCONJ
PART	PART
FIXN	PART
FIXV	PART
EAFF	PART
EITT	PART
NEG	PART
PUNCT	PUNCT
PUNC	PUNCT

Details about LST20 POS tags are available in 2.

The following table shows the mapping of POS tags from LST20 to UD:

LST20 POS tags	Coresponding UD POS tag
AJ	ADJ
AV	ADV
AX	AUX
CC	CCONJ
CL	NOUN
FX	NOUN
IJ	INTJ
NN	NOUN
NU	NUM
PA	PART
PR	PROPN
PS	ADP
PU	PUNCT
VV	VERB
XX	X

For the NE, we use Inside-outside-beggining (IOB) format to tag NE for each word.

B- prefix indicates the begining token of the chunk. I- prefix indicates the intermediate token within the chunk. O indicates that the token does not belong to any NE chunk.

For instance, given a sentence “บารัค โอบามาเป็นประธานธิปดี”, it would tag the tokens “บารัค”, “โอบามา”, “เป็น”, “ประธานาธิปดี” with “B-PERSON”, “I-PERSON”, “O”, and “O” respectively.

The following table shows named entity (NE) tags as used PyThaiNLP:

Named Entity tag	Examples
DATE	2/21/2004, 16 ก.พ., จันทร์
TIME	16.30 น., 5 วัน, 1-3 ปี
EMAIL	info@nrpsc.ac.th
LEN	30 กิโลเมตร, 5 กม.
LOCATION	ไทย, จ.ปราจีนบุรี, กำแพงเพชร
ORGANIZATION	กรมวิทยาศาสตร์การแพทย์, อย.
PERSON	น.พ.จรัล, นางประนอม ทองจันทร์
PHONE	1200, 0 2670 8888
URL	http://www.bangkokhealth.com/
ZIP	10400, 11130
Money	2.7 ล้านบาท, 2,000 บาท
LAW	พ.ร.บ.โรคระบาด พ.ศ.2499, รัฐธรรมนูญ

Modules

pythainlp.tag.pos_tag(words: List[str], engine: str = 'perceptron', corpus: str = 'orchid') → List[Tuple[str, str]][source]

Marks words with part-of-speech (POS) tags, such as ‘NOUN’ and ‘VERB’.

Parameters

words (list) – a list of tokenized words
engine (str) –
- perceptron - perceptron tagger (default)
- unigram - unigram tagger
- wangchanberta - wangchanberta model (support lst20 corpus only and it supports a string only. if you input a list of word, it will convert list word to a string.
- tltk - TLTK: Thai Language Toolkit (support TNC corpus only. if you choose other corpus, It’s change to TNC corpus.)
corpus (str) – the corpus that used to create the language model for tagger * lst20 - LST20 corpus by National Electronics and Computer Technology Center, Thailand * lst20_ud - LST20 text, with tags mapped to Universal POS tag from Universal Dependencies <https://universaldependencies.org/> * orchid - ORCHID corpus, text from Thai academic articles (default) * orchid_ud - ORCHID text, with tags mapped to Universal POS tags * pud - Parallel Universal Dependencies (PUD) treebanks, natively use Universal POS tags * tnc - Thai National Corpus (support tltk engine only)

Returns

a list of tuples (word, POS tag)

Return type

list[tuple[str, str]]

Example

Tag words with corpus orchid (default):

from pythainlp.tag import pos_tag

words = ['ฉัน','มี','ชีวิต','รอด','ใน','อาคาร','หลบภัย','ของ', \
    'นายก', 'เชอร์ชิล']
pos_tag(words)
# output:
# [('ฉัน', 'PPRS'), ('มี', 'VSTA'), ('ชีวิต', 'NCMN'), ('รอด', 'NCMN'),
#   ('ใน', 'RPRE'), ('อาคาร', 'NCMN'), ('หลบภัย', 'NCMN'),
#   ('ของ', 'RPRE'), ('นายก', 'NCMN'), ('เชอร์ชิล', 'NCMN')]

Tag words with corpus orchid_ud:

from pythainlp.tag import pos_tag

words = ['ฉัน','มี','ชีวิต','รอด','ใน','อาคาร','หลบภัย','ของ', \
    'นายก', 'เชอร์ชิล']
pos_tag(words, corpus='orchid_ud')
# output:
# [('ฉัน', 'PROPN'), ('มี', 'VERB'), ('ชีวิต', 'NOUN'),
#   ('รอด', 'NOUN'), ('ใน', 'ADP'),  ('อาคาร', 'NOUN'),
#   ('หลบภัย', 'NOUN'), ('ของ', 'ADP'), ('นายก', 'NOUN'),
#   ('เชอร์ชิล', 'NOUN')]

Tag words with corpus pud:

from pythainlp.tag import pos_tag

words = ['ฉัน','มี','ชีวิต','รอด','ใน','อาคาร','หลบภัย','ของ', \
    'นายก', 'เชอร์ชิล']
pos_tag(words, corpus='pud')
# [('ฉัน', 'PRON'), ('มี', 'VERB'), ('ชีวิต', 'NOUN'), ('รอด', 'VERB'),
#   ('ใน', 'ADP'), ('อาคาร', 'NOUN'), ('หลบภัย', 'NOUN'),
#   ('ของ', 'ADP'), ('นายก', 'NOUN'), ('เชอร์ชิล', 'PROPN')]

Tag words with different engines including perceptron and unigram:

from pythainlp.tag import pos_tag

words = ['เก้าอี้','มี','จำนวน','ขา', ' ', '=', '3']

pos_tag(words, engine='perceptron', corpus='orchid')
# output:
# [('เก้าอี้', 'NCMN'), ('มี', 'VSTA'), ('จำนวน', 'NCMN'),
#   ('ขา', 'NCMN'), (' ', 'PUNC'),
#   ('=', 'PUNC'), ('3', 'NCNM')]

pos_tag(words, engine='unigram', corpus='pud')
# output:
# [('เก้าอี้', None), ('มี', 'VERB'), ('จำนวน', 'NOUN'), ('ขา', None),
#   ('<space>', None), ('<equal>', None), ('3', 'NUM')]

pythainlp.tag.pos_tag_sents(sentences: List[List[str]], engine: str = 'perceptron', corpus: str = 'orchid') → List[List[Tuple[str, str]]][source]

Marks sentences with part-of-speech (POS) tags.

Parameters

sentences (list) – a list of lists of tokenized words
engine (str) –
- perceptron - perceptron tagger (default)
- unigram - unigram tagger
- wangchanberta - wangchanberta model (support lst20 corpus only)
- tltk - TLTK: Thai Language Toolkit (support TNC corpus only. if you choose other corpus, It’s change to TNC corpus.)
corpus (str) –
the corpus that used to create the language model for tagger * lst20 - LST20 corpus by National Electronics and Computer Technology Center, Thailand * lst20_ud - LST20 text, with tags mapped to Universal POS tags from Universal Dependencies <https://universaldependencies.org/> * orchid - ORCHID corpus, text from Thai academic articles (default) * orchid_ud - ORCHID text, with tags mapped to Universal POS tags * pud - Parallel Universal Dependencies (PUD) treebanks, natively use Universal POS tags * tnc - Thai National Corpus (support tltk engine only)

Returns

a list of lists of tuples (word, POS tag)

Return type

list[list[tuple[str, str]]]

Example

Labels POS for two sentences:

from pythainlp.tag import pos_tag_sents

sentences = [['เก้าอี้','มี','3','ขา'], \
                    ['นก', 'บิน', 'กลับ', 'รัง']]
pos_tag_sents(sentences, corpus='pud)
# output:
# [[('เก้าอี้', 'PROPN'), ('มี', 'VERB'), ('3', 'NUM'),
#   ('ขา', 'NOUN')], [('นก', 'NOUN'), ('บิน', 'VERB'),
#   ('กลับ', 'VERB'), ('รัง', 'NOUN')]]

pythainlp.tag.tag_provinces(tokens: List[str]) → List[Tuple[str, str]][source]

This function recognize Thailand provinces in text.

Note that it uses exact match and considers no context.

Parameters: tokens (list[str]) – a list of words
Reutrn: a list of tuple indicating NER for LOCATION in IOB format
Return type: list[tuple[str, str]]
Example

from pythainlp.tag import tag_provinces

text = ['หนองคาย', 'น่าอยู่']
tag_provinces(text)
# output: [('หนองคาย', 'B-LOCATION'), ('น่าอยู่', 'O')]

pythainlp.tag.chunk_parse(sent: List[Tuple[str, str]], engine: str = 'crf', corpus: str = 'orchidpp') → List[str][source]

This function parse thai sentence to phrase structure in IOB format.

Parameters

sent (list) – list [(word,part-of-speech)]
engine (str) – chunk parse engine (now, it has crf only)
corpus (str) – chunk parse corpus (now, it has orchidpp only)

Returns

a list of tuple (word,part-of-speech,chunking)

Return type

List[str]

Example

from pythainlp.tag import chunk_parse, pos_tag

tokens = ["ผม", "รัก", "คุณ"]
tokens_pos = pos_tag(tokens, engine="perceptron", corpus="orchid")

print(chunk_parse(tokens_pos))
# output: ['B-NP', 'B-VP', 'I-VP']

class pythainlp.tag.NER(engine: str, corpus: str = 'thainer')[source]

Named-entity recognizer class

Parameters

engine (str) – Named-entity recognizer engine
corpus (str) – corpus

Options for engine

thainer - Thai NER engine
wangchanberta - wangchanberta model
tltk - wrapper for TLTK.

Options for corpus

thaimer - Thai NER corpus
lst20 - lst20 corpus (wangchanberta only)

Note: for tltk engine, It’s support ner model from tltk only.

tag(text, pos=True, tag=False) → Union[List[Tuple[str, str]], List[Tuple[str, str, str]], str][source]

This function tags named-entitiy from text in IOB format.

Parameters

text (str) – text in Thai to be tagged
pos (bool) – output with part-of-speech tag. (wangchanberta is not support)
tag (bool) – output like html tag.

Returns

a list of tuple associated with tokenized word, NER tag, POS tag (if the parameter pos is specified as True), and output like html tag (if the parameter tag is specified as True). Otherwise, return a list of tuple associated with tokenized word and NER tag

Return type

Union[List[Tuple[str, str]], List[Tuple[str, str, str]], str]

Example

>>> from pythainlp.tag import NER
>>>
>>> ner = NER("thainer")
>>> ner.tag("ทดสอบนายวรรณพงษ์ ภัททิยไพบูลย์")
[('ทดสอบ', 'VV', 'O'),
('นาย', 'NN', 'B-PERSON'),
('วรรณ', 'NN', 'I-PERSON'),
('พงษ์', 'NN', 'I-PERSON'),
(' ', 'PU', 'I-PERSON'),
('ภัททิย', 'NN', 'I-PERSON'),
('ไพบูลย์', 'NN', 'I-PERSON')]
>>> ner.tag("ทดสอบนายวรรณพงษ์ ภัททิยไพบูลย์", tag=True)
'ทดสอบ<PERSON>นายวรรณพงษ์ ภัททิยไพบูลย์</PERSON>'

class pythainlp.tag.thainer.ThaiNameTagger(version: str = '1.5')[source]

Thai named-entity recognizer. :param str version: Thai NER version.

It’s support Thai NER 1.4 & 1.5. The defualt value is 1.5

Example

from pythainlp.tag.named_entity import ThaiNameTagger

thainer15 = ThaiNameTagger(version="1.5")
thainer15.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.")

thainer14 = ThaiNameTagger(version="1.4")
thainer14.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.")

get_ner(text: str, pos: bool = True, tag: bool = False) → Union[List[Tuple[str, str]], List[Tuple[str, str, str]]][source]

This function tags named-entitiy from text in IOB format.

Parameters

text (str) – text in Thai to be tagged
pos (bool) – To include POS tags in the results (True) or exclude (False). The defualt value is True
tag (bool) – output like html tag.

Returns

a list of tuple associated with tokenized word, NER tag, POS tag (if the parameter pos is specified as True), and output like html tag (if the parameter tag is specified as True). Otherwise, return a list of tuple associated with tokenized word and NER tag

Return type

Union[list[tuple[str, str]], list[tuple[str, str, str]]], str

Note

For the POS tags to be included in the results, this function uses pythainlp.tag.pos_tag() with engine as perceptron and corpus as orchid_ud`.

Example

>>> from pythainlp.tag.named_entity import ThaiNameTagger
>>>
>>> ner = ThaiNameTagger()
>>> ner.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.")
[('วันที่', 'NOUN', 'O'), (' ', 'PUNCT', 'O'),
('15', 'NUM', 'B-DATE'), (' ', 'PUNCT', 'I-DATE'),
('ก.ย.', 'NOUN', 'I-DATE'), (' ', 'PUNCT', 'I-DATE'),
('61', 'NUM', 'I-DATE'), (' ', 'PUNCT', 'O'),
('ทดสอบ', 'VERB', 'O'), ('ระบบ', 'NOUN', 'O'),
('เวลา', 'NOUN', 'O'), (' ', 'PUNCT', 'O'),
('14', 'NOUN', 'B-TIME'), (':', 'PUNCT', 'I-TIME'),
('49', 'NUM', 'I-TIME'), (' ', 'PUNCT', 'I-TIME'),
('น.', 'NOUN', 'I-TIME')]
>>>
>>> ner.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.",
                pos=False)
[('วันที่', 'O'), (' ', 'O'),
('15', 'B-DATE'), (' ', 'I-DATE'),
('ก.ย.', 'I-DATE'), (' ', 'I-DATE'),
('61', 'I-DATE'), (' ', 'O'),
('ทดสอบ', 'O'), ('ระบบ', 'O'),
('เวลา', 'O'), (' ', 'O'),
('14', 'B-TIME'), (':', 'I-TIME'),
('49', 'I-TIME'), (' ', 'I-TIME'),
('น.', 'I-TIME')]
>>> ner.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.",
                tag=True)
'วันที่ <DATE>15 ก.ย. 61</DATE> ทดสอบระบบเวลา <TIME>
14:49 น.</TIME>'

pythainlp.tag.tltk.get_ner(text: str, pos: bool = True, tag: bool = False) → Union[List[Tuple[str, str]], List[Tuple[str, str, str]], str][source]

Named-entity recognizer from TLTK

This function tags named-entitiy from text in IOB format.

Parameters

text (str) – text in Thai to be tagged
pos (bool) – To include POS tags in the results (True) or exclude (False). The defualt value is True
tag (bool) – output like html tag.

Returns

a list of tuple associated with tokenized word, NER tag, POS tag (if the parameter pos is specified as True), and output like html tag (if the parameter tag is specified as True). Otherwise, return a list of tuple associated with tokenized word and NER tag

Return type

Union[list[tuple[str, str]], list[tuple[str, str, str]]], str

Example

>>> from pythainlp.tag.tltk import get_ner
>>> get_ner("เขาเรียนที่โรงเรียนนางรอง")
[('เขา', 'PRON', 'O'),
('เรียน', 'VERB', 'O'),
('ที่', 'SCONJ', 'O'),
('โรงเรียน', 'NOUN', 'B-L'),
('นางรอง', 'VERB', 'I-L')]
>>> get_ner("เขาเรียนที่โรงเรียนนางรอง", pos=False)
[('เขา', 'O'),
('เรียน', 'O'),
('ที่', 'O'),
('โรงเรียน', 'B-L'),
('นางรอง', 'I-L')]
>>> get_ner("เขาเรียนที่โรงเรียนนางรอง", tag=True)
'เขาเรียนที่<L>โรงเรียนนางรอง</L>'

Tagger Engines

perceptron

Perceptron tagger is the part-of-speech tagging using the averaged, structured perceptron algorithm.

unigram

Unigram tagger doesn’t take the ordering of words in the list into account.

References

1: Virach Sornlertlamvanich, Naoto Takahashi and Hitoshi Isahara. (2000). Building a Thai Part-Of-Speech Tagged Corpus (ORCHID). The Journal of the Acoustical Society of Japan (E), Vol.20, No.3, pp 189-198, May 1999.
2(1,2): Prachya Boonkwan and Vorapon Luantangsrisuk and Sitthaa Phaholphinyo and Kanyanat Kriengket and Dhanon Leenoi and Charun Phrombut and Monthika Boriboon and Krit Kosawat and Thepchai Supnithi. (2020). The Annotation Guideline of LST20 Corpus. arXiv:2008.05055