pythainlp.chunk
The pythainlp.chunk module provides phrase structure chunking for Thai
text, following the NLTK nltk.chunk naming convention.
Chunking groups POS-tagged tokens into phrase structure chunks and returns labels in Inside-Outside-Beginning (IOB) format.
B- prefix indicates the beginning token of a chunk. I- prefix indicates a token inside (continuing) a chunk. O indicates that the token does not belong to any chunk.
Modules
chunk_parse
- pythainlp.chunk.chunk_parse(sent: list[tuple[str, str]], engine: str = 'crf', corpus: str = 'orchidpp') list[str][source]
Parse a Thai sentence into phrase-structure chunks (IOB format).
- Parameters:
- Returns:
list of IOB chunk labels, one per token.
- Return type:
- Example:
from pythainlp.chunk import chunk_parse from pythainlp.tag import pos_tag tokens = ["ผม", "รัก", "คุณ"] tokens_pos = pos_tag(tokens, engine="perceptron", corpus="orchid") print(chunk_parse(tokens_pos)) # output: ['B-NP', 'B-VP', 'I-VP']
CRFChunkParser
- class pythainlp.chunk.CRFChunkParser(corpus: str = 'orchidpp')[source]
CRF-based chunk parser for Thai text.
Parses a POS-tagged sentence into phrase-structure chunks (IOB format), following the NLTK
nltk.chunk.ChunkParserIconvention.This class supports the context manager protocol for deterministic resource cleanup:
from pythainlp.chunk import CRFChunkParser with CRFChunkParser() as parser: result = parser.parse(tokens_pos)
- Parameters:
corpus (str) – corpus name for the CRF model (default:
"orchidpp").
- tagger: CRFTagger