pythainlp.chunk

The pythainlp.chunk module provides phrase structure chunking for Thai text, following the NLTK nltk.chunk naming convention.

Chunking groups POS-tagged tokens into phrase structure chunks and returns labels in Inside-Outside-Beginning (IOB) format.

B- prefix indicates the beginning token of a chunk. I- prefix indicates a token inside (continuing) a chunk. O indicates that the token does not belong to any chunk.

Modules

chunk_parse

pythainlp.chunk.chunk_parse(sent: list[tuple[str, str]], engine: str = 'crf', corpus: str = 'orchidpp') list[str][source]

Parse a Thai sentence into phrase-structure chunks (IOB format).

Parameters:
  • sent (list[tuple[str, str]]) – list of (word, POS-tag) pairs.

  • engine (str) – chunking engine; currently only "crf" is supported.

  • corpus (str) – corpus name for the CRF model; currently only "orchidpp" is supported.

Returns:

list of IOB chunk labels, one per token.

Return type:

list[str]

Example:

from pythainlp.chunk import chunk_parse
from pythainlp.tag import pos_tag

tokens = ["ผม", "รัก", "คุณ"]
tokens_pos = pos_tag(tokens, engine="perceptron", corpus="orchid")

print(chunk_parse(tokens_pos))
# output: ['B-NP', 'B-VP', 'I-VP']

CRFChunkParser

class pythainlp.chunk.CRFChunkParser(corpus: str = 'orchidpp')[source]

CRF-based chunk parser for Thai text.

Parses a POS-tagged sentence into phrase-structure chunks (IOB format), following the NLTK nltk.chunk.ChunkParserI convention.

This class supports the context manager protocol for deterministic resource cleanup:

from pythainlp.chunk import CRFChunkParser

with CRFChunkParser() as parser:
    result = parser.parse(tokens_pos)
Parameters:

corpus (str) – corpus name for the CRF model (default: "orchidpp").

tagger: CRFTagger
xseq: list[dict[str, str | bool]]
__init__(corpus: str = 'orchidpp') None[source]
corpus: str
load_model(corpus: str) None[source]

Load the CRF model for the given corpus.

Parameters:

corpus (str) – corpus name.

parse(token_pos: list[tuple[str, str]]) list[str][source]

Parse a POS-tagged sentence into IOB chunk labels.

Parameters:

token_pos (list[tuple[str, str]]) – list of (word, POS-tag) pairs.

Returns:

list of IOB chunk labels, one per token.

Return type:

list[str]