pythainlp.tag

The pythainlp.tag contains functions that are used to mark linguistic and other annotation to different parts of a text including part-of-speech (POS) tags and named entity (NE) tags.

For POS tags, there are three sets of available tags: Universal POS tags, ORCHID POS tags [1], and LST20 POS tags [2].

The following table shows Universal POS tags as used in Universal Dependencies (UD):

Abbreviation	Part-of-Speech tag	Examples
ADJ	Adjective	ใหม่, พิเศษ , ก่อน, มาก, สูง
ADP	Adposition	แม้, ว่า, เมื่อ, ของ, สำหรับ
ADV	Adverb	ก่อน, ก็, เล็กน้อย, เลย, สุด
AUX	Auxiliary	เป็น, ใช่, คือ, คล้าย
CCONJ	Coordinating conjunction	แต่, และ, หรือ
DET	Determiner	ที่, นี้, ซึ่ง, ทั้ง, ทุก, หลาย
INTJ	Interjection	อุ้ย, โอ้ย
NOUN	Noun	กำมือ, พวก, สนาม, กีฬา, บัญชี
NUM	Numeral	5,000, 103.7, 2004, หนึ่ง, ร้อย
PART	Particle	มา ขึ้น ไม่ ได้ เข้า
PRON	Pronoun	เรา, เขา, ตัวเอง, ใคร, เธอ
PROPN	Proper noun	โอบามา, แคปิตอลฮิล, จีโอพี, ไมเคิล
PUNCT	Punctuation	(, ), “, ‘, :
SCONJ	Subordinating conjunction	หาก
VERB	Verb	เปิด, ให้, ใช้, เผชิญ, อ่าน

The following table shows POS tags as used in ORCHID:

Abbreviation	Part-of-Speech tag	Examples
NPRP	Proper noun	วินโดวส์ 95, โคโรน่า, โค้ก
NCNM	Cardinal number	หนึ่ง, สอง, สาม, 1, 2, 10
NONM	Ordinal number	ที่หนึ่ง, ที่สอง, ที่สาม, ที่1, ที่2
NLBL	Label noun	1, 2, 3, 4, ก, ข, a, b
NCMN	Common noun	หนังสือ, อาหาร, อาคาร, คน
NTTL	Title noun	ครู, พลเอก
PPRS	Personal pronoun	คุณ, เขา, ฉัน
PDMN	Demonstrative pronoun	นี่, นั้น, ที่นั่น, ที่นี่
PNTR	Interrogative pronoun	ใคร, อะไร, อย่างไร
PREL	Relative pronoun	ที่, ซึ่ง, อัน, ผู้
VACT	Active verb	ทำงาน, ร้องเพลง, กิน
VSTA	Stative verb	เห็น, รู้, คือ
VATT	Attributive verb	อ้วน, ดี, สวย
XVBM	Pre-verb auxiliary, before negator “ไม่”	เกิด, เกือบ, กำลัง
XVAM	Pre-verb auxiliary, after negator “ไม่”	ค่อย, น่า, ได้
XVMM	Pre-verb, before or after negator “ไม่”	ควร, เคย, ต้อง
XVBB	Pre-verb auxiliary, in imperative mood	กรุณา, จง, เชิญ, อย่า, ห้าม
XVAE	Post-verb auxiliary	ไป, มา, ขึ้น
DDAN	Definite determiner, after noun without classifier in between	ยี่, นั่น, โน่น, ทั้งหมด
DDAC	Definite determiner, allowing classifier in between	นี้, นั้น, โน้น, นู้น
DDBQ	Definite determiner, between noun and classifier or preceding quantitative expression	ทั้ง, อีก, เพียง
DDAQ	Definite determiner, following quantitative expression	พอดี, ถ้วน
DIAC	Indefinite determiner, following noun; allowing classifier in between	ไหน, อื่น, ต่างๆ
DIBQ	Indefinite determiner, between noun and classifier or preceding quantitative expression	บาง, ประมาณ, เกือบ
DIAQ	Indefinite determiner, following quantitative expression	กว่า, เศษ
DCNM	Determiner, cardinal number expression	หนึ่งคน, เสือ, 2 ตัว
DONM	Determiner, ordinal number expression	ที่หนึ่ง, ที่สอง, ที่สุดท้สย
ADVN	Adverb with normal form	เก่ง, เร็ว, ช้า, สม่ำเสมอ
ADVI	Adverb with iterative form	เร็วๆ, เสทอๆ, ช้าๆ
ADVP	Adverb with prefixed form	โดยเร็ว
ADVS	Sentential adverb	โดยปกติ, ธรรมดา
CNIT	Unit classifier	ตัว, คน, เล่ม
CLTV	Collective classifier	คู่, กลุ่ม, ฝูง, เชิง, ทาง, ด้าน, แบบ, รุ่น
CMTR	Measurement classifier	กิโลกรัม, แก้ว, ชั่วโมง
CFQC	Frequency classifier	ครั้ง, เที่ยว
CVBL	Verbal classifier	ม้วน, มัด
JCRG	Coordinating conjunction	และ, หรือ, แต่
JCMP	Comparative conjunction	กว่า, เหมือนกับ, เท่ากับ
JSBR	Subordinating conjunction	เพราะว่า, เนื่องจาก ที่, แม้ว่า, ถ้า
RPRE	Preposition	จาก, ละ, ของ, ใต้, บน
INT	Interjection	โอ้บ, โอ้, เออ, เอ๋, อ๋อ
FIXN	Nominal prefix	การทำงาน, ความสนุนสนาน
FIXV	Adverbial prefix	อย่างเร็ว
EAFF	Ending for affirmative sentence	จ๊ะ, จ้ะ, ค่ะ, ครับ, นะ, น่า, เถอะ
EITT	Ending for interrogative sentence	หรือ, เหรอ, ไหม, มั้ย
NEG	Negator	ไม่, มิได้, ไม่ได้, มิ
PUNC	Punctuation	(, ), “, ,, ;

ORCHID corpus uses a different set of POS tags. Thus, we make UD POS tags version for ORCHID corpus.

The following table shows the mapping of POS tags from ORCHID to UD:

Details about LST20 POS tags are available in [2].

The following table shows the mapping of POS tags from LST20 to UD:

LST20 POS tags	Corresponding UD POS tag
AJ	ADJ
AV	ADV
AX	AUX
CC	CCONJ
CL	NOUN
FX	NOUN
IJ	INTJ
NN	NOUN
NU	NUM
PA	PART
PR	PROPN
PS	ADP
PU	PUNCT
VV	VERB
XX	X

For the NE, we use Inside-outside-beginning (IOB) format to tag NE for each word.

B- prefix indicates the beginning token of the chunk. I- prefix indicates the intermediate token within the chunk. O indicates that the token does not belong to any NE chunk.

For instance, given a sentence “บารัค โอบามาเป็นประธานธิปดี”, it would tag the tokens “บารัค”, “โอบามา”, “เป็น”, “ประธานาธิปดี” with “B-PERSON”, “I-PERSON”, “O”, and “O” respectively.

The following table shows named entity (NE) tags as used in PyThaiNLP:

Named Entity tag	Examples
DATE	2/21/2004, 16 ก.พ., จันทร์
TIME	16.30 น., 5 วัน, 1-3 ปี
EMAIL	info@nrpsc.ac.th
LEN	30 กิโลเมตร, 5 กม.
LOCATION	ไทย, จ.ปราจีนบุรี, กำแพงเพชร
ORGANIZATION	กรมวิทยาศาสตร์การแพทย์, อย.
PERSON	น.พ.จรัล, นางประนอม ทองจันทร์
PHONE	1200, 0 2670 8888
URL	https://www.bangkokhealth.com/
ZIP	10400, 11130
Money	2.7 ล้านบาท, 2,000 บาท
LAW	พ.ร.บ.โรคระบาด พ.ศ.2499, รัฐธรรมนูญ

Modules

pythainlp.tag.pos_tag(words: list[str], engine: str = 'perceptron', corpus: str = 'orchid') → list[tuple[str, str]][source]

Marks words with part-of-speech (POS) tags, such as ‘NOUN’ and ‘VERB’.

Parameters:

words (list) – a list of tokenized words
engine (str) –
- perceptron - perceptron tagger (default)
- unigram - unigram tagger
- wangchanberta - wangchanberta model.
- tltk - TLTK: Thai Language Toolkit (support TNC corpora only. If you choose other corpora, they will be converted to TNC corpora.)
corpus (str) – the corpus that is used to create the language model for tagger * orchid - ORCHID corpus, text from Thai academic articles (default) * orchid_ud - ORCHID text, with tags mapped to Universal POS tags * blackboard - blackboard treebank * blackboard_ud - blackboard text, with tags mapped to Universal POS tag from Universal Dependencies <https://universaldependencies.org/> * pud - Parallel Universal Dependencies (PUD) treebanks, natively use Universal POS tags * tdtb - Thai Discourse Treebank , natively use Universal POS tags * tud - Thai Universal Dependency Treebank (TUD) :return: a list of tuples (word, POS tag)

Return type:

list[tuple[str, str]]

Example:

Tag words with corpus orchid (default):

>>> from pythainlp.tag import pos_tag

>>> words = ['ฉัน','มี','ชีวิต','รอด','ใน','อาคาร','หลบภัย','ของ', \
...     'นายก', 'เชอร์ชิล']
>>> pos_tag(words)
[('ฉัน', 'PPRS'), ('มี', 'VSTA'), ('ชีวิต', 'NCMN'), ('รอด', 'NCMN'),
  ('ใน', 'RPRE'), ('อาคาร', 'NCMN'), ('หลบภัย', 'NCMN'),
  ('ของ', 'RPRE'), ('นายก', 'NCMN'), ('เชอร์ชิล', 'NCMN')]

Tag words with corpus orchid_ud:

>>> from pythainlp.tag import pos_tag

>>> words = ['ฉัน','มี','ชีวิต','รอด','ใน','อาคาร','หลบภัย','ของ', \
...     'นายก', 'เชอร์ชิล']
>>> pos_tag(words, corpus='orchid_ud')
[('ฉัน', 'PROPN'), ('มี', 'VERB'), ('ชีวิต', 'NOUN'),
  ('รอด', 'NOUN'), ('ใน', 'ADP'),  ('อาคาร', 'NOUN'),
  ('หลบภัย', 'NOUN'), ('ของ', 'ADP'), ('นายก', 'NOUN'),
  ('เชอร์ชิล', 'NOUN')]

Tag words with corpus pud:

>>> from pythainlp.tag import pos_tag

>>> words = ['ฉัน','มี','ชีวิต','รอด','ใน','อาคาร','หลบภัย','ของ', \
...     'นายก', 'เชอร์ชิล']
>>> pos_tag(words, corpus='pud')
>>> # [('ฉัน', 'PRON'), ('มี', 'VERB'), ('ชีวิต', 'NOUN'), ('รอด', 'VERB'),
>>> #   ('ใน', 'ADP'), ('อาคาร', 'NOUN'), ('หลบภัย', 'NOUN'),
>>> #   ('ของ', 'ADP'), ('นายก', 'NOUN'), ('เชอร์ชิล', 'PROPN')]

Tag words with different engines including perceptron and unigram:

>>> from pythainlp.tag import pos_tag

>>> words = ['เก้าอี้','มี','จำนวน','ขา', ' ', '=', '3']

>>> pos_tag(words, engine='perceptron', corpus='orchid')
[('เก้าอี้', 'NCMN'), ('มี', 'VSTA'), ('จำนวน', 'NCMN'),
  ('ขา', 'NCMN'), (' ', 'PUNC'),
  ('=', 'PUNC'), ('3', 'NCNM')]

>>> pos_tag(words, engine='unigram', corpus='pud')
[('เก้าอี้', None), ('มี', 'VERB'), ('จำนวน', 'NOUN'), ('ขา', None),
  ('<space>', None), ('<equal>', None), ('3', 'NUM')]

pythainlp.tag.pos_tag_sents(sentences: list[list[str]], engine: str = 'perceptron', corpus: str = 'orchid') → list[list[tuple[str, str]]][source]

Marks sentences with part-of-speech (POS) tags.

Parameters:

sentences (list) – a list of lists of tokenized words
engine (str) –
- perceptron - perceptron tagger (default)
- unigram - unigram tagger
- tltk - TLTK: Thai Language Toolkit (support TNC corpus only. If you choose other corpora, they will be converted to TNC corpora.)
corpus (str) –
the corpus that is used to create the language model for tagger * orchid - ORCHID corpus, text from Thai academic articles (default) * orchid_ud - ORCHID text, with tags mapped to Universal POS tags * blackboard - blackboard treebank * blackboard_ud - blackboard text, with tags mapped to Universal POS tag from Universal Dependencies <https://universaldependencies.org/> * pud - Parallel Universal Dependencies (PUD) treebanks, natively use Universal POS tags * tnc - Thai National Corpus (support tltk engine only)

Returns:

a list of lists of tuples (word, POS tag)

Return type:

list[list[tuple[str, str]]]

Example:

Labels POS for two sentences:

>>> from pythainlp.tag import pos_tag_sents

>>> sentences = [['เก้าอี้','มี','3','ขา'], \
...                     ['นก', 'บิน', 'กลับ', 'รัง']]
>>> pos_tag_sents(sentences, corpus='pud')
[[('เก้าอี้', 'PROPN'), ('มี', 'VERB'), ('3', 'NUM'),
  ('ขา', 'NOUN')], [('นก', 'NOUN'), ('บิน', 'VERB'),
  ('กลับ', 'VERB'), ('รัง', 'NOUN')]]

pythainlp.tag.tag_provinces(tokens: list[str]) → list[tuple[str, str]][source]

This function recognizes Thailand provinces in text.

Note that it uses exact match and considers no context.

Parameters:

tokens (list[str]) – a list of words

Returns:

a list of tuples indicating NER for LOCATION in IOB format

Return type:

list[tuple[str, str]]

Example:

>>> from pythainlp.tag import tag_provinces

>>> text = ["หนองคาย", "น่าอยู่"]
>>> tag_provinces(text)
[('หนองคาย', 'B-LOCATION'), ('น่าอยู่', 'O')]

pythainlp.tag.chunk_parse(sent: list[tuple[str, str]], engine: str = 'crf', corpus: str = 'orchidpp') → list[str][source]

Parse a Thai sentence into phrase-structure chunks (IOB format).

Deprecated since version 5.3.2: Use pythainlp.chunk.chunk_parse() instead.

Parameters:

sent (list[tuple[str, str]]) – list of (word, POS-tag) pairs.
engine (str) – chunking engine (default: "crf").
corpus (str) – corpus name (default: "orchidpp").

Returns:

list of IOB chunk labels, one per token.

Return type:

list[str]

class pythainlp.tag.NER(engine: str = 'thainer-v2', corpus: str = 'thainer')[source]

Class of named-entity recognizer

Parameters:

engine (str) – engine of named-entity recognizer
corpus (str) – corpus

Options for engine

phayathaibert - PhayaThaiBERT-based Thai NER engine
thainer - Thai NER engine
thai-nner - Thai Nested NER engine
thainer-v2 - Thai NER engine v2.0 for Thai NER 2.0 (default)
tltk - wrapper for TLTK.
wangchanberta - WangchanBERTa-based Thai NER engine

Options for corpus

thainer - Thai NER corpus (default)
thainer-v2 - Thai NER v2 corpus

Note: The tltk engine supports NER models from tltk only.

The thai-nner engine supports nested NER and ignores corpus parameter.

name_engine: str

engine: NEREngineType

__init__(engine: str = 'thainer-v2', corpus: str = 'thainer') → None[source]

load_engine(engine: str, corpus: str) → None[source]

tag(text: str, pos: bool = False, tag: bool = False) → list[tuple[str, str]] | list[tuple[str, str, str]] | str[source]

This function tags named entities in text in IOB format.

Parameters:

text (str) – text in Thai to be tagged
pos (bool) – output with part-of-speech tags. (wangchanberta is not supported)
tag (bool) – output HTML-like tags.

Returns:

a list of tuples associated with tokenized words, NER tags, POS tags (if the parameter pos is specified as True), and output HTML-like tags (if the parameter tag is specified as True). Otherwise, return a list of tuples associated with tokenized words and NER tags

Return type:

Union[list[tuple[str, str]], list[tuple[str, str, str]], str]

Example:

>>> from pythainlp.tag import NER
>>>
>>> ner = NER("thainer")
>>> ner.tag("ทดสอบ นายวรรณพงษ์ ภัททิยไพบูลย์")
[('ทดสอบ', 'O'),
(' ', 'O'),
('นาย', 'B-PERSON'),
('วรรณ', 'I-PERSON'),
('พงษ์', 'I-PERSON'),
(' ', 'I-PERSON'),
('ภัททิย', 'I-PERSON'),
('ไพบูลย์', 'I-PERSON')]
>>> ner.tag("ทดสอบ นายวรรณพงษ์ ภัททิยไพบูลย์", tag=True)
'ทดสอบ <PERSON>นายวรรณพงษ์ ภัททิยไพบูลย์</PERSON>'

class pythainlp.tag.NNER(engine: str = 'thai_nner')[source]

Nested Named Entity Recognition

Parameters:

engine (str) – engine of nested named entity recognizer
corpus (str) – corpus

Options for engine

thai_nner - Thai NER engine

engine: ThaiNNER

__init__(engine: str = 'thai_nner') → None[source]

load_engine(engine: str = 'thai_nner') → None[source]

tag(text: str, top_level_only: bool = False) → tuple[list[str], list[EntitySpan]][source]

This function tags nested named entities.

Parameters:

text (str) – text in Thai to be tagged
top_level_only (bool) – If True, return only top-level (outermost) entities. If False, return all nested entities. Default is False.

Returns:

a tuple of (tokens, entities) where tokens is a list of tokenized strings and entities is a list of dictionaries containing ‘text’, ‘span’, and ‘entity_type’ keys.

Return type:

tuple[list[str], list[EntitySpan]]

Note

The tokenized output may include empty strings as part of the tokenization process from the underlying Thai-NNER model.

Example:

>>> from pythainlp.tag.named_entity import NNER
>>> nner = NNER()
>>> nner.tag("แมวทำอะไรตอนห้าโมงเช้า")
([
    '<s>',
    '',
    'แมว',
    'ทํา',
    '',
    'อะไร',
    'ตอน',
    '',
    'ห้า',
    '',
    'โมง',
    '',
    'เช้า',
    '</s>'
],
[
    {
        'text': ['', 'ห้า'],
        'span': [7, 9],
        'entity_type': 'cardinal'
    },
    {
        'text': ['', 'ห้า', '', 'โมง'],
        'span': [7, 11],
        'entity_type': 'time'
    },
    {
        'text': ['', 'โมง'],
        'span': [9, 11],
        'entity_type': 'unit'
    }
])
>>> # Get only top-level entities (outermost entities)
>>> nner.tag("แมวทำอะไรตอนห้าโมงเช้า", top_level_only=True)
([...], [{'text': ['', 'ห้า', '', 'โมง'], 'span': [7, 11], 'entity_type': 'time'}])

class pythainlp.tag.thainer.ThaiNameTagger(version: str = '1.4')[source]

Thai named-entity recognizer or Thai NER. This function supports Thai NER 1.4 and 1.5 only. :param str version: Thai NER version.

It supports Thai NER 1.4 & 1.5. The default value is `1.4

Example:

>>> from pythainlp.tag.thainer import ThaiNameTagger

>>> thainer14 = ThaiNameTagger(version="1.4")
>>> thainer14.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.")

__init__(version: str = '1.4') → None[source]

Thai named-entity recognizer.

Parameters:: version (str) – Thai NER version. It’s support Thai NER 1.4 & 1.5. The default value is 1.4

crf: CRFTagger

pos_tag_name: str

get_ner(text: str, pos: bool = True, tag: bool = False) → list[tuple[str, str]] | list[tuple[str, str, str]] | str[source]

This function tags named-entities in text in IOB format.

Parameters:

text (str) – text in Thai to be tagged
pos (bool) – To include POS tags in the results (True) or exclude (False). The default value is True
tag (bool) – output HTML-like tags.

Returns:

a list of tuples associated with tokenized words, NER tags, POS tags (if the parameter pos is specified as True), and output HTML-like tags (if the parameter tag is specified as True). Otherwise, return a list of tuples associated with tokenized words and NER tags

Return type:

Union[list[tuple[str, str]], list[tuple[str, str, str]], str]

Note:

For the POS tags to be included in the results, this function uses pythainlp.tag.pos_tag() with engine perceptron and corpus orchid_ud.

Example:

>>> from pythainlp.tag.thainer import ThaiNameTagger
>>>
>>> ner = ThaiNameTagger()
>>> ner.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.")
[('วันที่', 'NOUN', 'O'), (' ', 'PUNCT', 'O'),
('15', 'NUM', 'B-DATE'), (' ', 'PUNCT', 'I-DATE'),
('ก.ย.', 'NOUN', 'I-DATE'), (' ', 'PUNCT', 'I-DATE'),
('61', 'NUM', 'I-DATE'), (' ', 'PUNCT', 'O'),
('ทดสอบ', 'VERB', 'O'), ('ระบบ', 'NOUN', 'O'),
('เวลา', 'NOUN', 'O'), (' ', 'PUNCT', 'O'),
('14', 'NOUN', 'B-TIME'), (':', 'PUNCT', 'I-TIME'),
('49', 'NUM', 'I-TIME'), (' ', 'PUNCT', 'I-TIME'),
('น.', 'NOUN', 'I-TIME')]
>>>
>>> ner.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.",
                pos=False)
[('วันที่', 'O'), (' ', 'O'),
('15', 'B-DATE'), (' ', 'I-DATE'),
('ก.ย.', 'I-DATE'), (' ', 'I-DATE'),
('61', 'I-DATE'), (' ', 'O'),
('ทดสอบ', 'O'), ('ระบบ', 'O'),
('เวลา', 'O'), (' ', 'O'),
('14', 'B-TIME'), (':', 'I-TIME'),
('49', 'I-TIME'), (' ', 'I-TIME'),
('น.', 'I-TIME')]
>>> ner.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.",
                tag=True)
'วันที่ <DATE>15 ก.ย. 61</DATE> ทดสอบระบบเวลา <TIME>14:49 น.</TIME>'

Tagger Engines

perceptron

Perceptron tagger is a part-of-speech tagging using the averaged, structured perceptron algorithm.

unigram

Unigram tagger doesn’t take the ordering of words in the list into account.

pythainlp.tag

Modules

Tagger Engines

perceptron

unigram

References