pythainlp.tag¶

The pythainlp.tag contains functions that are used to tag different parts of a text including Part-of-Speech (POS) tags, and Named Entity Recognition (NER) tag.

For the POS tags, there are two set of tags including Universal Dependencies (UD) and ORCHID 1 POS tags.

The following table shows the list of Part-of-Speech (POS) tags according to Universal Dependencies (UD) POS tags:

Abbreviation	Part-of-Speech tag	Examples
ADJ	Adjective	ใหม่, พิเศษ , ก่อน, มาก, สูง
ADP	Adposition	แม้, ว่า, เมื่อ, ของ, สำหรับ
ADV	Adverb	ก่อน, ก็, เล็กน้อย, เลย, สุด
AUX	Auxiliary	เป็น, ใช่, คือ, คล้าย
CCONJ	Coordinating conjunction	แต่, และ, หรือ
DET	Determiner	ที่, นี้, ซึ่ง, ทั้ง, ทุก, หลาย
INTJ	Interjection	อุ้ย, โอ้ย
NOUN	Noun	กำมือ, พวก, สนาม, กีฬา, บัญชี
NUM	Numeral	5,000, 103.7, 2004, หนึ่ง, ร้อย
PART	Particle	มา ขึ้น ไม่ ได้ เข้า
PRON	Pronoun	เรา, เขา, ตัวเอง, ใคร, เธอ
PROPN	Proper noun	โอบามา, แคปิตอลฮิล, จีโอพี, ไมเคิล
PUNCT	Punctuation	(, ), “, ‘, :
SCONJ	Subordinating conjunction	หาก
VERB	Verb	เปิด, ให้, ใช้, เผชิญ, อ่าน

The following table shows the list of Part-of-Speech (POS) tags according to ORCHID POS tags from the paper:

Abbreviation	Part-of-Speech tag	Examples
NPRP	Proper noun	วินโดวส์ 95, โคโรน่า, โค้ก
NCNM	Cardinal number	หนึ่ง, สอง, สาม, 1, 2, 10
NONM	Ordinal number	ที่หนึ่ง, ที่สอง, ที่สาม, ที่1, ที่2
NLBL	Label noun	1, 2, 3, 4, ก, ข, a, b
NCMN	Common noun	หนังสือ, อาหาร, อาคาร, คน
NTTL	Title noun	ครู, พลเอก
PPRS	Personal pronoun	คุณ, เขา, ฉัน
PDMN	Demonstrative pronoun	นี่, นั้น, ที่นั่น, ที่นี่
PNTR	Interrogative pronoun	ใคร, อะไร, อย่างไร
PREL	Relative pronoun	ที่, ซึ่ง, อัน, ผู้
VACT	Active verb Îµµ,	ทำงาน, ร้องเพลง, กิน
VSTA	Stative verb	เห็น, รู้, คือ
VATT	Attributive verb	อ้วน, ดี, สวย
XVBM	Pre-verb auxiliary, before negator “ไม่”	เกิด, เกือบ, กำลัง
XVAM	Pre-verb auxiliary, after negator “ไม่”	ค่อย, น่า, ได้
XVMM	Pre-verb, before or after negator “ไม่”	ควร, เคย, ต้อง
XVBB	Pre-verb auxiliary, in imperative mood	กรุณา, จง, เชิญ, อย่า, ห้าม
XVAE	Post-verb auxiliary Å	ไป, มา, ขึ้น
DDAN	Definite determiner, after noun without classifier in between	ยี่, นั่น, โน่น, ทั้งหมด
DDAC	Definite determiner, allowing classifier in between	นี้, นั้น, โน้น, นู้น
DDBQ	Definite determiner, between noun and classifier or preceding quantitative expression	ทั้ง, อีก, เพียง
DDAQ	Definite determiner, following quantitative expression	พอดี, ถ้วน
DIAC	Indefinite determiner, following noun; allowing classifier in between	ไหน, อื่น, ต่างๆ
DIBQ	Indefinite determiner, between noun and classifier or preceding quantitative expression	บาง, ประมาณ, เกือบ
DIAQ	Indefinite determiner, following quantitative expression	กว่า, เศษ
DCNM	Determiner, cardinal number expression	หนึ่งคน, เสือ, 2 ตัว
DONM	Determiner, ordinal number expression	ที่หนึ่ง, ที่สอง, ที่สุดท้สย
ADVN	Adverb with normal form	เก่ง, เร็ว, ช้า, สม่ำเสมอ
ADVI	Adverb with iterative form	เร็วๆ, เสทอๆ, ช้าๆ
ADVP	Adverb with prefixed form	โดยเร็ว
ADVS	Sentential adverb	โดยปกติ, ธรรมดา
CNIT	Unit classifier	ตัว, คน, เล่ม
CLTV	Collective classifier	คู่, กลุ่ม, ฝูง, เชิง, ทาง, ด้าน, แบบ, รุ่น
CMTR	Measurement classifier	กิโลกรัม, แก้ว, ชั่วโมง
CFQC	Frequency classifier	ครั้ง, เที่ยว
CVBL	Verbal classifier	ม้วน, มัด
JCRG	Coordinating conjunction	และ, หรือ, แต่
JCMP	Comparative conjunction	กว่า, เหมือนกับ, เท่ากับ
JSBR	Subordinating conjunction	เพราะว่า, เนื่องจาก ที่, แม้ว่า, ถ้า
RPRE	Preposition	จาก, ละ, ของ, ใต้, บน
INT	Interjection	โอ้บ, โอ้, เออ, เอ๋, อ๋อ
FIXN	Nominal prefix	การทำงาน, ความสนุนสนาน
FIXV	Adverbial prefix	อย่างเร็ว
EAFF	Ending for affirmative sentence	จ๊ะ, จ้ะ, ค่ะ, ครับ, นะ, น่า, เถอะ
EITT	Ending for interrogative sentence	หรือ, เหรอ, ไหม, มั้ย
NEG	Negator	ไม่, มิได้, ไม่ได้, มิ
PUNC	Punctuation	(, ), “, ,, ;

ORCHID corpus uses different set of POS tags. Thus, we make UD POS tags version for ORCHID corpus.

The following table shows the mapping of Part-of-Speech (POS) tags from ORCHID POS tags to UD POS tags:

ORCHID POS tags	Coresponding UD POS tag
NOUN	NOUN
NCMN	NOUN
NTTL	NOUN
CNIT	NOUN
CLTV	NOUN
CMTR	NOUN
CFQC	NOUN
CVBL	NOUN
VACT	VERB
VSTA	VERB
PROPN	PROPN
NPRP	PROPN
ADJ	ADJ
NONM	ADJ
VATT	ADJ
DONM	ADJ
ADV	ADV
ADVN	ADV
ADVI	ADV
ADVP	ADV
ADVS	ADV
INT	INTJ
PRON	PRON
PPRS	PRON
PDMN	PRON
PNTR	PRON
DET	DET
DDAN	DET
DDAC	DET
DDBQ	DET
DDAQ	DET
DIAC	DET
DIBQ	DET
DIAQ	DET
NUM	NUM
NCNM	NUM
NLBL	NUM
DCNM	NUM
AUX	AUX
XVBM	AUX
XVAM	AUX
XVMM	AUX
XVBB	AUX
XVAE	AUX
ADP	ADP
RPRE	ADP
CCONJ	CCONJ
JCRG	CCONJ
SCONJ	SCONJ
PREL	SCONJ
JSBR	SCONJ
JCMP	SCONJ
PART	PART
FIXN	PART
FIXV	PART
EAFF	PART
EITT	PART
AITT	PART
NEG	PART
PUNCT	PUNCT
PUNC	PUNCT

For the NER, we use Inside-outside-beggining (IOB) format to tag NER for each words. For instance, given a sentence “บารัค โอบามาเป็นประธานธิปดี”, it would be tag the tokens “บารัค”, “โอบามา”, “เป็น”, “ประธานาธิปดี” as “B-PERSON”, “I-PERSON”, “I-PERSON”, “O”, and “O” respectively.

The B- prefix indicates begining token for a chunk of person name, “บารัค โอบามา” and I- prefix indicates the intermediate token. However, the term O indicates that a token not belong to any NER chunk.

The following table shows the list of Named Entity Recognition (NER) tags:

Named Entity Recognition tag	Examples
DATE	2/21/2004, 16 ก.พ., จันทร์
TIME	16.30 น., 5 วัน, 1-3 ปี
EMAIL	info@nrpsc.ac.th
LEN	30 กิโลเมตร, 5 กม.
LOCATION	ไทย, จ.ปราจีนบุรี, กำแพงเพชร
ORGANIZATION	กรมวิทยาศาสตร์การแพทย์, อย.
PERSON	น.พ.จรัล, นางประนอม ทองจันทร์
PHONE	1200, 0 2670 8888
URL	http://www.bangkokhealth.com/
ZIP	10400, 11130
Money	2.7 ล้านบาท, 2,000 บาท
LAW	พ.ร.บ.โรคระบาด พ.ศ.2499, รัฐธรรมนูญ

Modules¶

pythainlp.tag.pos_tag(words: List[str], engine: str = 'perceptron', corpus: str = 'orchid') → List[Tuple[str, str]][source]¶

The function tag a list of tokenized words into Part-of-Speech (POS) tags such as ‘NOUN’, ‘VERB’, ‘ADJ’, and ‘DET’.

Parameters

words (list) – a list of tokenized words
engine (str) –
- perceptron - perceptron tagger (default)
- unigram - unigram tagger
corpus (str) –
- orchid - annotated Thai academic articles namedly Orchid (default)
- orchid_ud - annotated Thai academic articles Orchid but the POS tags are mapped to comply with Universal Dependencies POS Tags
- pud - Parallel Universal Dependencies (PUD) treebanks

Returns

returns a list of labels regarding which part of speech it is

Return type

list[tuple[str, str]]

Example

Tag words with corpus orchid (default):

from pythainlp.tag import pos_tag

words = ['ฉัน','มี','ชีวิต','รอด','ใน','อาคาร','หลบภัย','ของ', \
    'นายก', 'เชอร์ชิล']
pos_tag(words)
# output:
# [('ฉัน', 'PPRS'), ('มี', 'VSTA'), ('ชีวิต', 'NCMN'), ('รอด', 'NCMN'),
#   ('ใน', 'RPRE'), ('อาคาร', 'NCMN'), ('หลบภัย', 'NCMN'),
#   ('ของ', 'RPRE'), ('นายก', 'NCMN'), ('เชอร์ชิล', 'NCMN')]

Tag words with corpus orchid_ud:

from pythainlp.tag import pos_tag

words = ['ฉัน','มี','ชีวิต','รอด','ใน','อาคาร','หลบภัย','ของ', \
    'นายก', 'เชอร์ชิล']
pos_tag(words, corpus='orchid_ud')
# output:
# [('ฉัน', 'PROPN'), ('มี', 'VERB'), ('ชีวิต', 'NOUN'),
#   ('รอด', 'NOUN'), ('ใน', 'ADP'),  ('อาคาร', 'NOUN'),
#   ('หลบภัย', 'NOUN'), ('ของ', 'ADP'), ('นายก', 'NOUN'),
#   ('เชอร์ชิล', 'NOUN')]

Tag words with corpus pud:

from pythainlp.tag import pos_tag

words = ['ฉัน','มี','ชีวิต','รอด','ใน','อาคาร','หลบภัย','ของ', \
    'นายก', 'เชอร์ชิล']
pos_tag(words, corpus='pud')
# [('ฉัน', 'PRON'), ('มี', 'VERB'), ('ชีวิต', 'NOUN'), ('รอด', 'VERB'),
#   ('ใน', 'ADP'), ('อาคาร', 'NOUN'), ('หลบภัย', 'NOUN'),
#   ('ของ', 'ADP'), ('นายก', 'NOUN'), ('เชอร์ชิล', 'PROPN')]

Tag words with different engines including perceptron and unigram:

from pythainlp.tag import pos_tag

words = ['เก้าอี้','มี','จำนวน','ขา', ' ', '=', '3']

pos_tag(words, engine='perceptron', corpus='orchid')
# output:
# [('เก้าอี้', 'NCMN'), ('มี', 'VSTA'), ('จำนวน', 'NCMN'),
#   ('ขา', 'NCMN'), (' ', 'PUNC'),
#   ('=', 'PUNC'), ('3', 'NCNM')]

pos_tag(words, engine='unigram', corpus='pud')
# output:
# [('เก้าอี้', None), ('มี', 'VERB'), ('จำนวน', 'NOUN'), ('ขา', None),
#   ('<space>', None), ('<equal>', None), ('3', 'NUM')]

pythainlp.tag.pos_tag_sents(sentences: List[List[str]], engine: str = 'perceptron', corpus: str = 'orchid') → List[List[Tuple[str, str]]][source]¶

The function tag multiple list of tokenized words into Part-of-Speech (POS) tags.

Parameters

sentences (list) – a list of lists of tokenized words
engine (str) –
- perceptron - perceptron tagger (default)
- unigram - unigram tagger
corpus (str) –
- orchid - annotated Thai academic articles namedly Orchid (default)
- orchid_ud - annotated Thai academic articles using Universal Dependencies Tags
- pud - Parallel Universal Dependencies (PUD) treebanks

Returns

returns a list of labels regarding which part of speech it is for each sentence given.

Return type

list[list[tuple[str, str]]]

Example

Labels POS for two sentences:

from pythainlp.tag import pos_tag_sents

sentences = [['เก้าอี้','มี','3','ขา'], \
                    ['นก', 'บิน', 'กลับ', 'รัง']]
pos_tag_sents(sentences, corpus='pud)
# output:
# [[('เก้าอี้', 'PROPN'), ('มี', 'VERB'), ('3', 'NUM'),
#   ('ขา', 'NOUN')], [('นก', 'NOUN'), ('บิน', 'VERB'),
#   ('กลับ', 'VERB'), ('รัง', 'NOUN')]]

pythainlp.tag.tag_provinces(tokens: List[str]) → List[Tuple[str, str]][source]¶

This function recognize Thailand provinces in text.

Parameters: tokens (list[str]) – a list of words
Reutrn: a list of tuple indicating NER for LOCATION in IOB format
Return type: list[tuple[str, str]]
Example

from pythainlp.tag import tag_provinces

text = ['หนองคาย', 'น่าอยู่']
tag_provinces(text)
# output: [('หนองคาย', 'B-LOCATION'), ('น่าอยู่', 'O')]

text = ['อำเภอ', 'ฝาง','เป็น','ส่วน','หนึ่ง','ของ', 'จังหวัด', \
    'เชียงใหม่']
tag_provinces(text)
# output: [('อำเภอ', 'O'), ('ฝาง', 'O'), ('เป็น', 'O'), ('ส่วน', 'O'),
#   ('หนึ่ง', 'O'), ('ของ', 'O'), ('จังหวัด', 'O'),
#   ('เชียงใหม่', 'B-LOCATION')]

class pythainlp.tag.named_entity.ThaiNameTagger[source]¶

get_ner(text: str, pos: bool = True, tag: bool = False) → Union[List[Tuple[str, str]], List[Tuple[str, str, str]]][source]¶

This function tags named-entitiy from text in IOB format.

Parameters

text (string) – text in Thai to be tagged
pos (boolean) – To include POS tags in the results (True) or exclude (False). The defualt value is True
tag (boolean) – output like html tag.

Returns

a list of tuple associated with tokenized word, NER tag, POS tag (if the parameter pos is specified as True), and output like html tag (if the parameter tag is specified as True). Otherwise, return a list of tuple associated with tokenized word and NER tag

Return type

Union[list[tuple[str, str]], list[tuple[str, str, str]]], str

Note

For the POS tags to be included in the results, this function uses pythainlp.tag.pos_tag() with engine as perceptron and corpus as orchid_ud`.

Example

>>> from pythainlp.tag.named_entity import ThaiNameTagger
>>>
>>> ner = ThaiNameTagger()
>>> ner.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.")
[('วันที่', 'NOUN', 'O'), (' ', 'PUNCT', 'O'),
('15', 'NUM', 'B-DATE'), (' ', 'PUNCT', 'I-DATE'),
('ก.ย.', 'NOUN', 'I-DATE'), (' ', 'PUNCT', 'I-DATE'),
('61', 'NUM', 'I-DATE'), (' ', 'PUNCT', 'O'),
('ทดสอบ', 'VERB', 'O'), ('ระบบ', 'NOUN', 'O'),
('เวลา', 'NOUN', 'O'), (' ', 'PUNCT', 'O'),
('14', 'NOUN', 'B-TIME'), (':', 'PUNCT', 'I-TIME'),
('49', 'NUM', 'I-TIME'), (' ', 'PUNCT', 'I-TIME'),
('น.', 'NOUN', 'I-TIME')]
>>>
>>> ner.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.",
                pos=False)
[('วันที่', 'O'), (' ', 'O'),
('15', 'B-DATE'), (' ', 'I-DATE'),
('ก.ย.', 'I-DATE'), (' ', 'I-DATE'),
('61', 'I-DATE'), (' ', 'O'),
('ทดสอบ', 'O'), ('ระบบ', 'O'),
('เวลา', 'O'), (' ', 'O'),
('14', 'B-TIME'), (':', 'I-TIME'),
('49', 'I-TIME'), (' ', 'I-TIME'),
('น.', 'I-TIME')]
>>> ner.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.",
                tag=True)
'วันที่ <DATE>15 ก.ย. 61</DATE> ทดสอบระบบเวลา <TIME>14:49 น.</TIME>'

Tagger Engines¶

perceptron¶

Perceptron tagger is the part-of-speech tagging using the averaged, structured perceptron algorithm.

unigram¶

Unigram tagger doesn’t take the ordering of words in the list into account.

References¶

1: Virach Sornlertlamvanich, Naoto Takahashi and Hitoshi Isahara. Building a Thai Part-Of-Speech Tagged Corpus (ORCHID). The Journal of the Acoustical Society of Japan (E), Vol.20, No.3, pp 189-198, May 1999./p>