pythainlp.summarize

The pythainlp.summarize module provides functions for Thai text summarization and keyword extraction.

Functions

pythainlp.summarize.summarize(text: str, n: int = 1, engine: str = 'frequency', tokenizer: str = 'newmm') → list[str][source]

Summarizes text based on frequency of words.

Under the hood, this function first tokenizes sentences from the given text with pythainlp.tokenize.sent_tokenize(). Then, it computes frequencies of tokenized words (with pythainlp.tokenize.word_tokenize()) in all sentences and normalizes them with maximum word frequency. The words with normalized frequencies that are less than 0.1 or greater than 0.9 will be filtered out from frequency dictionary. Finally, it picks n sentences with highest sum of normalized frequency from all words which are in the sentence and also appear in the frequency dictionary.

Parameters:

text (str) – text to be summarized
n (int) – number of sentences to be included in the summary By default, n is 1 (effective for frequency engine only)
engine (str) – text summarization engine (By default: frequency).
tokenizer (str) – word tokenizer engine name (refer to pythainlp.tokenize.word_tokenize()). By default, tokenizer is newmm (effective for frequency engine only)

Returns:

list of selected sentences

Return type:

list[str]

Options for engine

frequency (default) - word frequency
mt5 - mT5-small model
mt5-small - mT5-small model
mt5-base - mT5-base model
mt5-large - mT5-large model
mt5-xl - mT5-xl model
mt5-xxl - mT5-xxl model
mt5-cpe-kmutt-thai-sentence-sum - mT5 Thai sentence summarization by CPE KMUTT

Example:

>>> from pythainlp.summarize import summarize

>>> text = '''
...         ทำเนียบท่าช้าง หรือ วังถนนพระอาทิตย์
...         ตั้งอยู่บนถนนพระอาทิตย์ เขตพระนคร กรุงเทพมหานคร
...         เดิมเป็นบ้านของเจ้าพระยามหาโยธา (ทอเรียะ คชเสนี)
...         บุตรเจ้าพระยามหาโยธานราธิบดีศรีพิชัยณรงค์ (พญาเจ่ง)
...         ต้นสกุลคชเสนี เชื้อสายมอญ เจ้าพระยามหาโยธา (ทอเรีย)
...         เป็นปู่ของเจ้าจอมมารดากลิ่นในพระบาทสมเด็จพระจอมเกล้าเจ้าอยู่หัว
...         และเป็นมรดกตกทอดมาถึง พระเจ้าบรมวงศ์เธอ กรมพระนเรศรวรฤทธิ์
...         (พระองค์เจ้ากฤดาภินิหาร)
...         ต่อมาในรัชสมัยพระบาทสมเด็จพระจุลจอมเกล้าเจ้าอยู่หัวโปรดเกล้าฯ
...         ให้สร้างตำหนัก 2 ชั้น
...         เป็นที่ประทับของพระเจ้าบรมวงศ์เธอ
...         กรมพระนเรศวรฤทิธิ์และเจ้าจอมมารดา
...         ต่อมาเรียกอาคารหลักนี้ว่า ตำหนักเดิม
...     '''

>>> summarize(text, n=1)
['บุตรเจ้าพระยามหาโยธานราธิบดีศรีพิชัยณรงค์']

>>> summarize(text, n=3)
['บุตรเจ้าพระยามหาโยธานราธิบดีศรีพิชัยณรงค์',
'เดิมเป็นบ้านของเจ้าพระยามหาโยธา',
'เจ้าพระยามหาโยธา']

>>> summarize(text, engine="mt5-small")
['<extra_id_0> ท่าช้าง หรือ วังถนนพระอาทิตย์
เขตพระนคร กรุงเทพมหานคร ฯลฯ ดังนี้:
ที่อยู่ - ศิลปวัฒนธรรม']

>>> text = "ถ้าพูดถึงขนมหวานในตำนานที่ชื่นใจที่สุดแล้วละก็ต้องไม่พ้น น้ำแข็งใส แน่ๆ เพราะว่าเป็นอะไรที่ชื่นใจสุดๆ"
>>> summarize(text, engine="mt5-cpe-kmutt-thai-sentence-sum")
['น้ําแข็งใสเป็นอะไรที่ชื่นใจที่สุด']

pythainlp.summarize.extract_keywords(text: str, keyphrase_ngram_range: tuple[int, int] = (1, 2), max_keywords: int = 5, min_df: int = 1, engine: str = 'keybert', tokenizer: str = 'newmm', stop_words: Iterable[str] | None = None) → list[str][source]

Return the most relevant keywords (and keyphrases) from a document.

Each algorithm may produce completely different keywords, so choose the algorithm carefully.

Note

Calling extract_keywords() is expensive. For repeated use of KeyBERT (the default engine), creating a KeyBERT object directly is recommended.

Parameters:

text (str) – text to extract keywords from
keyphrase_ngram_range (tuple[int, int]) – token range for keywords. (1, 1) allows unigrams only (e.g. “เสา”, “ไฟฟ้า”); (1, 2) allows unigrams and bigrams (e.g. “เสา”, “ไฟฟ้า”, “เสาไฟฟ้า”). Default: (1, 2).
max_keywords (int) – maximum number of keywords to return. Default: 5.
min_df (int) – minimum term frequency to qualify as keyword. Default: 1.
engine (str) – keyword extraction algorithm. Default: 'keybert'.
tokenizer (str) – tokenizer engine name. See pythainlp.tokenize.word_tokenize() for options. Default: 'newmm'.
stop_words (collections.abc.Iterable[str] or None) – words to ignore. If None, pythainlp.corpus.thai_stopwords() is used. Default: None.

Returns:

list of keywords

Return type:

list[str]

Options for engine

keybert (default) - KeyBERT keyword extraction
frequency - word frequency

Example:

>>> from pythainlp.summarize import extract_keywords

>>> text = '''
...     อาหาร หมายถึง ของแข็งหรือของเหลว
...     ที่กินหรือดื่มเข้าสู่ร่างกายแล้ว
...     จะทำให้เกิดพลังงานและความร้อนแก่ร่างกาย
...     ทำให้ร่างกายเจริญเติบโต
...     ซ่อมแซมส่วนที่สึกหรอ ควบคุมการเปลี่ยนแปลงต่างๆ ในร่างกาย
...     ช่วยทำให้อวัยวะต่างๆ ทำงานได้อย่างปกติ
...     อาหารจะต้องไม่มีพิษและไม่เกิดโทษต่อร่างกาย
... '''

>>> keywords = extract_keywords(text)

[‘อวัยวะต่างๆ’, ‘ซ่อมแซมส่วน’, ‘เจริญเติบโต’, ‘ควบคุมการเปลี่ยนแปลง’, ‘มีพิษ’]

>>> keywords = extract_keywords(text, max_keywords=10)

[‘อวัยวะต่างๆ’, ‘ซ่อมแซมส่วน’, ‘เจริญเติบโต’, ‘ควบคุมการเปลี่ยนแปลง’, ‘มีพิษ’, ‘ทำให้ร่างกาย’, ‘ร่างกายเจริญเติบโต’, ‘จะทำให้เกิด’, ‘มีพิษและ’, ‘เกิดโทษ’]

Keyword extraction engines

KeyBERT

Minimal re-implementation of KeyBERT.

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

https://github.com/MaartenGr/KeyBERT

class pythainlp.summarize.keybert.KeyBERT(model_name: str = 'airesearch/wangchanberta-base-att-spm-uncased')[source]

__init__(model_name: str = 'airesearch/wangchanberta-base-att-spm-uncased') → None[source]

ft_pipeline: Pipeline

extract_keywords(text: str, keyphrase_ngram_range: tuple[int, int] = (1, 2), max_keywords: int = 5, min_df: int = 1, tokenizer: str = 'newmm', return_similarity: bool = False, stop_words: Iterable[str] | None = None) → list[str] | list[tuple[str, float]][source]

Extract Thai keywords and/or keyphrases with KeyBERT algorithm. See https://github.com/MaartenGr/KeyBERT.

Parameters:

text (str) – text to be summarized
keyphrase_ngram_range (Tuple[int, int]) – Number of token units to be defined as keyword. The token unit varies w.r.t. tokenizer_engine. For instance, (1, 1) means each token (unigram) can be a keyword (e.g. “เสา”, “ไฟฟ้า”), (1, 2) means one and two consecutive tokens (unigram and bigram) can be keywords (e.g. “เสา”, “ไฟฟ้า”, “เสาไฟฟ้า”) (default: (1, 2))
max_keywords (int) – Number of maximum keywords to be returned. (default: 5)
min_df (int) – Minimum frequency required to be a keyword. (default: 1)
tokenizer (str) – Name of tokenizer engine to use. Refer to options in :func: `pythainlp.tokenize.word_tokenizer() (default: ‘newmm’)
return_similarity (bool) – If True, return keyword scores. (default: False)
stop_words (Optional[Iterable[str]]) – A list of stop words (a.k.a words to be ignored). If not specified, pythainlp.corpus.thai_stopwords() is used. (default: None)

Returns:

list of keywords with score

Example:

>>> from pythainlp.summarize.keybert import KeyBERT

>>> text = '''
...     อาหาร หมายถึง ของแข็งหรือของเหลว
...     ที่กินหรือดื่มเข้าสู่ร่างกายแล้ว
...     จะทำให้เกิดพลังงานและความร้อนแก่ร่างกาย
...     ทำให้ร่างกายเจริญเติบโต
...     ซ่อมแซมส่วนที่สึกหรอ ควบคุมการเปลี่ยนแปลงต่างๆ ในร่างกาย
...     ช่วยทำให้อวัยวะต่างๆ ทำงานได้อย่างปกติ
...     อาหารจะต้องไม่มีพิษและไม่เกิดโทษต่อร่างกาย
... '''

>>> kb = KeyBERT()

>>> keywords = kb.extract_keyword(text)

[‘อวัยวะต่างๆ’, ‘ซ่อมแซมส่วน’, ‘เจริญเติบโต’, ‘ควบคุมการเปลี่ยนแปลง’, ‘มีพิษ’]

>>> keywords = kb.extract_keyword(
...     text, max_keywords=10, return_similarity=True
... )

[(‘อวัยวะต่างๆ’, 0.3228477063109462), (‘ซ่อมแซมส่วน’, 0.31320597838000375), (‘เจริญเติบโต’, 0.29115434699705506), (‘ควบคุมการเปลี่ยนแปลง’, 0.2678430841321016), (‘มีพิษ’, 0.24996827960821494), (‘ทำให้ร่างกาย’, 0.23876962942443258), (‘ร่างกายเจริญเติบโต’, 0.23191285218852364), (‘จะทำให้เกิด’, 0.22425422716846247), (‘มีพิษและ’, 0.22162962875299588), (‘เกิดโทษ’, 0.20773497763458507)]

embed(docs: str | list[str]) → NDArray[np.float32][source]

Create embeddings by averaging vectors from the last hidden layer.

Parameters:: docs (Union[str, list[str]]) – input document or documents
Returns:: embeddings as a float32 array with one row per input document
Return type:: numpy.typing.NDArray[numpy.float32]

class pythainlp.summarize.keybert.KeyBERT(model_name: str = 'airesearch/wangchanberta-base-att-spm-uncased')[source]

__init__(model_name: str = 'airesearch/wangchanberta-base-att-spm-uncased') → None[source]

ft_pipeline: Pipeline

extract_keywords(text: str, keyphrase_ngram_range: tuple[int, int] = (1, 2), max_keywords: int = 5, min_df: int = 1, tokenizer: str = 'newmm', return_similarity: bool = False, stop_words: Iterable[str] | None = None) → list[str] | list[tuple[str, float]][source]

Extract Thai keywords and/or keyphrases with KeyBERT algorithm. See https://github.com/MaartenGr/KeyBERT.

Parameters:

text (str) – text to be summarized
keyphrase_ngram_range (Tuple[int, int]) – Number of token units to be defined as keyword. The token unit varies w.r.t. tokenizer_engine. For instance, (1, 1) means each token (unigram) can be a keyword (e.g. “เสา”, “ไฟฟ้า”), (1, 2) means one and two consecutive tokens (unigram and bigram) can be keywords (e.g. “เสา”, “ไฟฟ้า”, “เสาไฟฟ้า”) (default: (1, 2))
max_keywords (int) – Number of maximum keywords to be returned. (default: 5)
min_df (int) – Minimum frequency required to be a keyword. (default: 1)
tokenizer (str) – Name of tokenizer engine to use. Refer to options in :func: `pythainlp.tokenize.word_tokenizer() (default: ‘newmm’)
return_similarity (bool) – If True, return keyword scores. (default: False)
stop_words (Optional[Iterable[str]]) – A list of stop words (a.k.a words to be ignored). If not specified, pythainlp.corpus.thai_stopwords() is used. (default: None)

Returns:

list of keywords with score

Example:

>>> from pythainlp.summarize.keybert import KeyBERT

>>> text = '''
...     อาหาร หมายถึง ของแข็งหรือของเหลว
...     ที่กินหรือดื่มเข้าสู่ร่างกายแล้ว
...     จะทำให้เกิดพลังงานและความร้อนแก่ร่างกาย
...     ทำให้ร่างกายเจริญเติบโต
...     ซ่อมแซมส่วนที่สึกหรอ ควบคุมการเปลี่ยนแปลงต่างๆ ในร่างกาย
...     ช่วยทำให้อวัยวะต่างๆ ทำงานได้อย่างปกติ
...     อาหารจะต้องไม่มีพิษและไม่เกิดโทษต่อร่างกาย
... '''

>>> kb = KeyBERT()

>>> keywords = kb.extract_keyword(text)

[‘อวัยวะต่างๆ’, ‘ซ่อมแซมส่วน’, ‘เจริญเติบโต’, ‘ควบคุมการเปลี่ยนแปลง’, ‘มีพิษ’]

>>> keywords = kb.extract_keyword(
...     text, max_keywords=10, return_similarity=True
... )

[(‘อวัยวะต่างๆ’, 0.3228477063109462), (‘ซ่อมแซมส่วน’, 0.31320597838000375), (‘เจริญเติบโต’, 0.29115434699705506), (‘ควบคุมการเปลี่ยนแปลง’, 0.2678430841321016), (‘มีพิษ’, 0.24996827960821494), (‘ทำให้ร่างกาย’, 0.23876962942443258), (‘ร่างกายเจริญเติบโต’, 0.23191285218852364), (‘จะทำให้เกิด’, 0.22425422716846247), (‘มีพิษและ’, 0.22162962875299588), (‘เกิดโทษ’, 0.20773497763458507)]

embed(docs: str | list[str]) → NDArray[np.float32][source]

Create embeddings by averaging vectors from the last hidden layer.

Parameters:: docs (Union[str, list[str]]) – input document or documents
Returns:: embeddings as a float32 array with one row per input document
Return type:: numpy.typing.NDArray[numpy.float32]