pythainlp.summarize

The summarize is Thai text summarizer.

Modules

pythainlp.summarize.summarize(text: str, n: int = 1, engine: str = 'frequency', tokenizer: str = 'newmm') → List[str][source]

This function summarizes text based on frequency of words.

Under the hood, this function first tokenize sentence from the given text with pythainlp.tokenize.sent_tokenize(). Then, computes frequencies of tokenized words (with pythainlp.tokenize.word_tokenize()) in all sentences and normalized with maximum word frequency. The words with normalized frequncy that are less than 0.1 or greater than 0.9 will be filtered out from frequency dictionary. Finally, it picks n sentences with highest sum of normalized frequency from all words in the sentence and also appear in the frequency dictionary.

Parameters:

text (str) – text to be summarized
n (int) – number of sentences to be included in the summary By default, n is 1 (effective for frequency engine only)
engine (str) – text summarization engine (By default: frequency).
tokenizer (str) – word tokenizer engine name (refer to pythainlp.tokenize.word_tokenize()). By default, tokenizer is newmm (effective for frequency engine only)

Returns:

list of selected sentences

Options for engine

frequency (default) - frequency of words
mt5 - mT5-small model
mt5-small - mT5-small model
mt5-base - mT5-base model
mt5-large - mT5-large model
mt5-xl - mT5-xl model
mt5-xxl - mT5-xxl model
mt5-cpe-kmutt-thai-sentence-sum - mT5 Thai sentence summarization by CPE KMUTT

Example:

from pythainlp.summarize import summarize

text = '''
        ทำเนียบท่าช้าง หรือ วังถนนพระอาทิตย์
        ตั้งอยู่บนถนนพระอาทิตย์ เขตพระนคร กรุงเทพมหานคร
        เดิมเป็นบ้านของเจ้าพระยามหาโยธา (ทอเรียะ คชเสนี)
        บุตรเจ้าพระยามหาโยธานราธิบดีศรีพิชัยณรงค์ (พญาเจ่ง)
        ต้นสกุลคชเสนี เชื้อสายมอญ เจ้าพระยามหาโยธา (ทอเรีย)
        เป็นปู่ของเจ้าจอมมารดากลิ่นในพระบาทสมเด็จพระจอมเกล้าเจ้าอยู่หัว
        และเป็นมรดกตกทอดมาถึง พระเจ้าบรมวงศ์เธอ กรมพระนเรศรวรฤทธิ์
        (พระองค์เจ้ากฤดาภินิหาร)
        ต่อมาในรัชสมัยพระบาทสมเด็จพระจุลจอมเกล้าเจ้าอยู่หัวโปรดเกล้าฯ
        ให้สร้างตำหนัก 2 ชั้น
        เป็นที่ประทับของพระเจ้าบรมวงศ์เธอ
        กรมพระนเรศวรฤทิธิ์และเจ้าจอมมารดา
        ต่อมาเรียกอาคารหลักนี้ว่า ตำหนักเดิม
    '''

summarize(text, n=1)
# output: ['บุตรเจ้าพระยามหาโยธานราธิบดีศรีพิชัยณรงค์']

summarize(text, n=3)
# output: ['บุตรเจ้าพระยามหาโยธานราธิบดีศรีพิชัยณรงค์',
# 'เดิมเป็นบ้านของเจ้าพระยามหาโยธา',
# 'เจ้าพระยามหาโยธา']

summarize(text, engine="mt5-small")
# output: ['<extra_id_0> ท่าช้าง หรือ วังถนนพระอาทิตย์
# เขตพระนคร กรุงเทพมหานคร ฯลฯ ดังนี้:
# ที่อยู่ - ศิลปวัฒนธรรม']

text = "ถ้าพูดถึงขนมหวานในตำนานที่ชื่นใจที่สุดแล้วละก็ต้องไม่พ้น น้ำแข็งใส แน่ๆ เพราะว่าเป็นอะไรที่ชื่นใจสุดๆ"
summarize(text, engine="mt5-cpe-kmutt-thai-sentence-sum")
# output: ['น้ําแข็งใสเป็นอะไรที่ชื่นใจที่สุด']

pythainlp.summarize.extract_keywords(text: str, keyphrase_ngram_range: Tuple[int, int] = (1, 2), max_keywords: int = 5, min_df: int = 1, engine: str = 'keybert', tokenizer: str = 'newmm', stop_words: Iterable[str] | None = None) → List[str][source]

This function returns most-relevant keywords (and/or keyphrases) from the input document. Each algorithm may produce completely different keywords from each other, so please be careful when choosing the algorithm.

Note: Calling :func: extract_keywords() is expensive. For repetitive use of KeyBERT (the default engine), creating KeyBERT object is highly recommended.

Parameters:

text (str) – text to be summarized
keyphrase_ngram_range (Tuple[int, int]) – Number of token units to be defined as keyword. The token unit varies w.r.t. tokenizer_engine. For instance, (1, 1) means each token (unigram) can be a keyword (e.g. “เสา”, “ไฟฟ้า”), (1, 2) means one and two consecutive tokens (unigram and bigram) can be keywords (e.g. “เสา”, “ไฟฟ้า”, “เสาไฟฟ้า”) (default: (1, 2))
max_keywords (int) – Number of maximum keywords to be returned. (default: 5)
min_df (int) – Minimum frequency required to be a keyword. (default: 1)
engine (str) – Name of algorithm to use for keyword extraction. (default: ‘keybert’)
tokenizer (str) – Name of tokenizer engine to use. Refer to options in :func: `pythainlp.tokenize.word_tokenizer() (default: ‘newmm’)
stop_words (Optional[Iterable[str]]) – A list of stop words (a.k.a words to be ignored). If not specified, pythainlp.corpus.thai_stopwords() is used. (default: None)

Returns:

list of keywords

Options for engine

keybert (default) - KeyBERT keyword extraction algorithm
frequency - frequency of words

Example:

from pythainlp.summarize import extract_keywords

text = '''
    อาหาร หมายถึง ของแข็งหรือของเหลว
    ที่กินหรือดื่มเข้าสู่ร่างกายแล้ว
    จะทำให้เกิดพลังงานและความร้อนแก่ร่างกาย
    ทำให้ร่างกายเจริญเติบโต
    ซ่อมแซมส่วนที่สึกหรอ ควบคุมการเปลี่ยนแปลงต่างๆ ในร่างกาย
    ช่วยทำให้อวัยวะต่างๆ ทำงานได้อย่างปกติ
    อาหารจะต้องไม่มีพิษและไม่เกิดโทษต่อร่างกาย
'''

keywords = extract_keywords(text)

# output: ['อวัยวะต่างๆ',
# 'ซ่อมแซมส่วน',
# 'เจริญเติบโต',
# 'ควบคุมการเปลี่ยนแปลง',
# 'มีพิษ']

keywords = extract_keywords(text, max_keywords=10)

# output: ['อวัยวะต่างๆ',
# 'ซ่อมแซมส่วน',
# 'เจริญเติบโต',
# 'ควบคุมการเปลี่ยนแปลง',
# 'มีพิษ',
# 'ทำให้ร่างกาย',
# 'ร่างกายเจริญเติบโต',
# 'จะทำให้เกิด',
# 'มีพิษและ',
# 'เกิดโทษ']

Keyword Extraction Engines

KeyBERT

Minimal re-implementation of KeyBERT.

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

https://github.com/MaartenGr/KeyBERT

class pythainlp.summarize.keybert.KeyBERT(model_name: str = 'airesearch/wangchanberta-base-att-spm-uncased')[source]

__init__(model_name: str = 'airesearch/wangchanberta-base-att-spm-uncased')[source]

extract_keywords(text: str, keyphrase_ngram_range: Tuple[int, int] = (1, 2), max_keywords: int = 5, min_df: int = 1, tokenizer: str = 'newmm', return_similarity=False, stop_words: Iterable[str] | None = None) → List[str] | List[Tuple[str, float]][source]

Extract Thai keywords and/or keyphrases with KeyBERT algorithm. See https://github.com/MaartenGr/KeyBERT.

Parameters:

text (str) – text to be summarized
keyphrase_ngram_range (Tuple[int, int]) – Number of token units to be defined as keyword. The token unit varies w.r.t. tokenizer_engine. For instance, (1, 1) means each token (unigram) can be a keyword (e.g. “เสา”, “ไฟฟ้า”), (1, 2) means one and two consecutive tokens (unigram and bigram) can be keywords (e.g. “เสา”, “ไฟฟ้า”, “เสาไฟฟ้า”) (default: (1, 2))
max_keywords (int) – Number of maximum keywords to be returned. (default: 5)
min_df (int) – Minimum frequency required to be a keyword. (default: 1)
tokenizer (str) – Name of tokenizer engine to use. Refer to options in :func: `pythainlp.tokenize.word_tokenizer() (default: ‘newmm’)
return_similarity (bool) – If True, return keyword scores. (default: False)
stop_words (Optional[Iterable[str]]) – A list of stop words (a.k.a words to be ignored). If not specified, pythainlp.corpus.thai_stopwords() is used. (default: None)

Returns:

list of keywords with score

Example:

from pythainlp.summarize.keybert import KeyBERT

text = '''
    อาหาร หมายถึง ของแข็งหรือของเหลว
    ที่กินหรือดื่มเข้าสู่ร่างกายแล้ว
    จะทำให้เกิดพลังงานและความร้อนแก่ร่างกาย
    ทำให้ร่างกายเจริญเติบโต
    ซ่อมแซมส่วนที่สึกหรอ ควบคุมการเปลี่ยนแปลงต่างๆ ในร่างกาย
    ช่วยทำให้อวัยวะต่างๆ ทำงานได้อย่างปกติ
    อาหารจะต้องไม่มีพิษและไม่เกิดโทษต่อร่างกาย
'''

kb = KeyBERT()

keywords = kb.extract_keyword(text)

# output: ['อวัยวะต่างๆ',
# 'ซ่อมแซมส่วน',
# 'เจริญเติบโต',
# 'ควบคุมการเปลี่ยนแปลง',
# 'มีพิษ']

keywords = kb.extract_keyword(text, max_keywords=10, return_similarity=True)

# output: [('อวัยวะต่างๆ', 0.3228477063109462),
# ('ซ่อมแซมส่วน', 0.31320597838000375),
# ('เจริญเติบโต', 0.29115434699705506),
# ('ควบคุมการเปลี่ยนแปลง', 0.2678430841321016),
# ('มีพิษ', 0.24996827960821494),
# ('ทำให้ร่างกาย', 0.23876962942443258),
# ('ร่างกายเจริญเติบโต', 0.23191285218852364),
# ('จะทำให้เกิด', 0.22425422716846247),
# ('มีพิษและ', 0.22162962875299588),
# ('เกิดโทษ', 0.20773497763458507)]

embed(docs: str | List[str]) → ndarray[source]: Create an embedding of each input in docs by averaging vectors from last hidden layer.

class pythainlp.summarize.keybert.KeyBERT(model_name: str = 'airesearch/wangchanberta-base-att-spm-uncased')[source]

__init__(model_name: str = 'airesearch/wangchanberta-base-att-spm-uncased')[source]

extract_keywords(text: str, keyphrase_ngram_range: Tuple[int, int] = (1, 2), max_keywords: int = 5, min_df: int = 1, tokenizer: str = 'newmm', return_similarity=False, stop_words: Iterable[str] | None = None) → List[str] | List[Tuple[str, float]][source]

Extract Thai keywords and/or keyphrases with KeyBERT algorithm. See https://github.com/MaartenGr/KeyBERT.

Parameters:

text (str) – text to be summarized
keyphrase_ngram_range (Tuple[int, int]) – Number of token units to be defined as keyword. The token unit varies w.r.t. tokenizer_engine. For instance, (1, 1) means each token (unigram) can be a keyword (e.g. “เสา”, “ไฟฟ้า”), (1, 2) means one and two consecutive tokens (unigram and bigram) can be keywords (e.g. “เสา”, “ไฟฟ้า”, “เสาไฟฟ้า”) (default: (1, 2))
max_keywords (int) – Number of maximum keywords to be returned. (default: 5)
min_df (int) – Minimum frequency required to be a keyword. (default: 1)
tokenizer (str) – Name of tokenizer engine to use. Refer to options in :func: `pythainlp.tokenize.word_tokenizer() (default: ‘newmm’)
return_similarity (bool) – If True, return keyword scores. (default: False)
stop_words (Optional[Iterable[str]]) – A list of stop words (a.k.a words to be ignored). If not specified, pythainlp.corpus.thai_stopwords() is used. (default: None)

Returns:

list of keywords with score

Example:

from pythainlp.summarize.keybert import KeyBERT

text = '''
    อาหาร หมายถึง ของแข็งหรือของเหลว
    ที่กินหรือดื่มเข้าสู่ร่างกายแล้ว
    จะทำให้เกิดพลังงานและความร้อนแก่ร่างกาย
    ทำให้ร่างกายเจริญเติบโต
    ซ่อมแซมส่วนที่สึกหรอ ควบคุมการเปลี่ยนแปลงต่างๆ ในร่างกาย
    ช่วยทำให้อวัยวะต่างๆ ทำงานได้อย่างปกติ
    อาหารจะต้องไม่มีพิษและไม่เกิดโทษต่อร่างกาย
'''

kb = KeyBERT()

keywords = kb.extract_keyword(text)

# output: ['อวัยวะต่างๆ',
# 'ซ่อมแซมส่วน',
# 'เจริญเติบโต',
# 'ควบคุมการเปลี่ยนแปลง',
# 'มีพิษ']

keywords = kb.extract_keyword(text, max_keywords=10, return_similarity=True)

# output: [('อวัยวะต่างๆ', 0.3228477063109462),
# ('ซ่อมแซมส่วน', 0.31320597838000375),
# ('เจริญเติบโต', 0.29115434699705506),
# ('ควบคุมการเปลี่ยนแปลง', 0.2678430841321016),
# ('มีพิษ', 0.24996827960821494),
# ('ทำให้ร่างกาย', 0.23876962942443258),
# ('ร่างกายเจริญเติบโต', 0.23191285218852364),
# ('จะทำให้เกิด', 0.22425422716846247),
# ('มีพิษและ', 0.22162962875299588),
# ('เกิดโทษ', 0.20773497763458507)]

embed(docs: str | List[str]) → ndarray[source]: Create an embedding of each input in docs by averaging vectors from last hidden layer.