pythainlp.tag¶

The pythainlp.tag contains functions that are used to tag different parts of a text.

Modules¶

pythainlp.tag.pos_tag(words: List[str], engine: str = 'perceptron', corpus: str = 'orchid') → List[Tuple[str, str]][source]¶

Part of Speech tagging function.

Parameters

words (list) – a list of tokenized words
engine (str) –
- unigram - unigram tagger
- perceptron - perceptron tagger (default)
- artagger - RDR POS tagger
corpus (str) –
- orchid - annotated Thai academic articles (default)
- orchid_ud - annotated Thai academic articles using Universal Dependencies Tags
- pud - Parallel Universal Dependencies (PUD) treebanks

Returns

returns a list of labels regarding which part of speech it is

pythainlp.tag.pos_tag_sents(sentences: List[List[str]], engine: str = 'perceptron', corpus: str = 'orchid') → List[List[Tuple[str, str]]][source]¶

Part of Speech tagging Sentence function.

Parameters

sentences (list) – a list of lists of tokenized words
engine (str) –
- unigram - unigram tagger
- perceptron - perceptron tagger (default)
- artagger - RDR POS tagger
corpus (str) –
- orchid - annotated Thai academic articles (default)
- orchid_ud - annotated Thai academic articles using Universal Dependencies Tags
- pud - Parallel Universal Dependencies (PUD) treebanks

Returns

returns a list of labels regarding which part of speech it is

pythainlp.tag.tag_provinces(tokens: List[str]) → List[Tuple[str, str]][source]¶

Recognize Thailand provinces in text

Input is a list of words Return a list of tuples

Example::

>>> text = ['หนองคาย', 'น่าอยู่']
>>> tag_provinces(text)
[('หนองคาย', 'B-LOCATION'), ('น่าอยู่', 'O')]

class pythainlp.tag.named_entity.ThaiNameTagger[source]¶

get_ner(text: str, pos: bool = True) → Union[List[Tuple[str, str]], List[Tuple[str, str, str]]][source]¶

Get named-entities in text

Parameters

text (string) – Thai text
pos (boolean) – get Part-Of-Speech tag (True) or get not (False)

Returns

list of strings with name labels (and part-of-speech tags)

Example::

>>> from pythainlp.tag.named_entity import ThaiNameTagger
>>> ner = ThaiNameTagger()
>>> ner.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.")
[('วันที่', 'NOUN', 'O'), (' ', 'PUNCT', 'O'), ('15', 'NUM', 'B-DATE'),
(' ', 'PUNCT', 'I-DATE'), ('ก.ย.', 'NOUN', 'I-DATE'),
(' ', 'PUNCT', 'I-DATE'), ('61', 'NUM', 'I-DATE'),
(' ', 'PUNCT', 'O'), ('ทดสอบ', 'VERB', 'O'),
('ระบบ', 'NOUN', 'O'), ('เวลา', 'NOUN', 'O'), (' ', 'PUNCT', 'O'),
('14', 'NOUN', 'B-TIME'), (':', 'PUNCT', 'I-TIME'), ('49', 'NUM', 'I-TIME'),
(' ', 'PUNCT', 'I-TIME'), ('น.', 'NOUN', 'I-TIME')]
>>> ner.get_ner("วันที่ 15 ก.ย. 61 ทดสอบระบบเวลา 14:49 น.", pos=False)
[('วันที่', 'O'), (' ', 'O'), ('15', 'B-DATE'), (' ', 'I-DATE'),
('ก.ย.', 'I-DATE'), (' ', 'I-DATE'), ('61', 'I-DATE'), (' ', 'O'),
('ทดสอบ', 'O'), ('ระบบ', 'O'), ('เวลา', 'O'), (' ', 'O'), ('14', 'B-TIME'),
(':', 'I-TIME'), ('49', 'I-TIME'), (' ', 'I-TIME'), ('น.', 'I-TIME')]

Tagger Engines¶

perceptron¶

Perceptron tagger is the part-of-speech tagging using the averaged, structured perceptron algorithm.

unigram¶

Unigram tagger doesn’t take the ordering of words in the list into account.

References¶

1: Virach Sornlertlamvanich, Naoto Takahashi and Hitoshi Isahara. Building a Thai Part-Of-Speech Tagged Corpus (ORCHID). The Journal of the Acoustical Society of Japan (E), Vol.20, No.3, pp 189-198, May 1999./p>