pythainlp.ulmfit

Universal Language Model Fine-tuning for Text Classification (ULMFiT).

Modules

class pythainlp.ulmfit.ThaiTokenizer(lang: str = 'th')[source]

Wrapper around a frozen newmm tokenizer to make it a fastai.BaseTokenizer. (see: https://docs.fast.ai/text.transform#BaseTokenizer)

__init__(lang: str = 'th')[source]

static tokenizer(text: str) → List[str][source]

This function tokenizes text with newmm engine and the dictionary specifically for ulmfit related functions (see: Dictonary file (.txt)). :meth: tokenize text with a frozen newmm engine :param str text: text to tokenize :return: tokenized text :rtype: list[str]

Example

Using pythainlp.ulmfit.ThaiTokenizer.tokenizer() is similar to pythainlp.tokenize.word_tokenize() with ulmfit engine.

>>> from pythainlp.ulmfit import ThaiTokenizer
>>> from pythainlp.tokenize import word_tokenize
>>>
>>> text = "อาภรณ์, จินตมยปัญญา ภาวนามยปัญญา"
>>> ThaiTokenizer.tokenizer(text)
 ['อาภรณ์', ',', ' ', 'จิน', 'ตม', 'ย', 'ปัญญา',
 ' ', 'ภาวนามยปัญญา']
>>>
>>> word_tokenize(text, engine='ulmfit')
['อาภรณ์', ',', ' ', 'จิน', 'ตม', 'ย', 'ปัญญา',
 ' ', 'ภาวนามยปัญญา']

add_special_cases(toks)[source]

pythainlp.ulmfit.document_vector(text: str, learn, data, agg: str = 'mean')[source]

This function vectorize Thai input text into a 400 dimension vector using fastai language model and data bunch.

Meth

document_vector get document vector using fastai language model and data bunch

Parameters

text (str) – text to be vectorized with fastai language model.
learn – fastai language model learner
data – fastai data bunch
agg (str) – name of aggregation methods for word embeddings The avialable methods are “mean” and “sum”

Returns

numpy.array of document vector sized 400 based on the encoder of the model

Return type

numpy.ndarray((1, 400))

Example

>>> from pythainlp.ulmfit import document_vectorr
>>> from fastai import *
>>> from fastai.text import *
>>>
>>> # Load Data Bunch
>>> data = load_data(MODEL_PATH, 'thwiki_lm_data.pkl')
>>>
>>> # Initialize language_model_learner
>>> config = dict(emb_sz=400, n_hid=1550, n_layers=4, pad_token=1,
     qrnn=False, tie_weights=True, out_bias=True, output_p=0.25,
     hidden_p=0.1, input_p=0.2, embed_p=0.02, weight_p=0.15)
>>> trn_args = dict(drop_mult=0.9, clip=0.12, alpha=2, beta=1)
>>> learn = language_model_learner(data, AWD_LSTM, config=config,
                                   pretrained=False, **trn_args)
>>> document_vector('วันนี้วันดีปีใหม่', learn, data)

See Also

A notebook showing how to train ulmfit language model and its usage, Jupyter Notebook

pythainlp.ulmfit.fix_html(text: str) → str[source]

List of replacements from html strings in test. (code from fastai)

Parameters

text (str) – text to replace html string

Returns

text where html strings are replaced

Return type

str

Example

>>> from pythainlp.ulmfit import fix_html
>>> fix_html("Anbsp;amp;nbsp;B @.@ ")
A & B.

pythainlp.ulmfit.lowercase_all(toks: Collection[str]) → List[str][source]: Lowercase all English words; English words in Thai texts don’t usually have nuances of capitalization.

pythainlp.ulmfit.merge_wgts(em_sz, wgts, itos_pre, itos_new)[source]

This function is to insert new vocab into an existing model named wgts and update the model’s weights for new vocab with the average embedding.

Meth

merge_wgts insert pretrained weights and vocab into a new set of weights and vocab; use average if vocab not in pretrained vocab

Parameters

em_sz (int) – embedding size
wgts – torch model weights
itos_pre (list) – pretrained list of vocab
itos_new (list) – list of new vocab

Returns

merged torch model weights

Example

from pythainlp.ulmfit import merge_wgts
import torch

wgts = {'0.encoder.weight': torch.randn(5,3)}
itos_pre = ["แมว", "คน", "หนู"]
itos_new = ["ปลา", "เต่า", "นก"]
em_sz = 3

merge_wgts(em_sz, wgts, itos_pre, itos_new)
# output:
# {'0.encoder.weight': tensor([[0.5952, 0.4453, 0.0011],
# [0.5952, 0.4453, 0.0011],
# [0.5952, 0.4453, 0.0011]]),
# '0.encoder_dp.emb.weight': tensor([[0.5952, 0.4453, 0.0011],
# [0.5952, 0.4453, 0.0011],
# [0.5952, 0.4453, 0.0011]]),
# '1.decoder.weight': tensor([[0.5952, 0.4453, 0.0011],
# [0.5952, 0.4453, 0.0011],
# [0.5952, 0.4453, 0.0011]])}

pythainlp.ulmfit.process_thai(text: str, pre_rules: ~typing.Collection = [<function fix_html>, <function reorder_vowels>, <function spec_add_spaces>, <function rm_useless_spaces>, <function rm_useless_newlines>, <function rm_brackets>, <function replace_url>, <function replace_rep_nonum>], tok_func: ~typing.Callable = <bound method Tokenizer.word_tokenize of <pythainlp.tokenize.core.Tokenizer object>>, post_rules: ~typing.Collection = [<function ungroup_emoji>, <function lowercase_all>, <function replace_wrep_post_nonum>, <function remove_space>]) → Collection[str][source]

Process Thai texts for models (with sparse features as default)

Parameters

text (str) – text to be cleaned
pre_rules (list[func]) – rules to apply before tokenization.
tok_func (func) – tokenization function (by default, tok_func is pythainlp.tokenize.word_tokenize())
post_rules (list[func]) – rules to apply after tokenizations

Returns

a list of cleaned tokenized texts

Return type

list[str]

Note

The default pre-rules consists of fix_html(), pythainlp.util.normalize(), spec_add_spaces(), rm_useless_spaces(), rm_useless_newlines(), rm_brackets() and replace_rep_nonum().
The default post-rules consists of ungroup_emoji(), lowercase_all(), replace_wrep_post_nonum(), and remove_space().

Example

Use default pre-rules and post-rules:

>>> from pythainlp.ulmfit import process_thai
>>> text = "บ้านนนนน () อยู่นานนานนาน 😂🤣😃😄😅 PyThaiNLP amp;     "
>>> process_thai(text)
[บ้าน', 'xxrep', '   ', 'อยู่', 'xxwrep', 'นาน', '😂', '🤣',
'😃', '😄', '😅', 'pythainlp', '&']

Modify pre_rules and post_rules arugments with rules provided in pythainlp.ulmfit:

>>> from pythainlp.ulmfit import (
    process_thai,
    replace_rep_after,
    fix_html,
    ungroup_emoji,
    replace_wrep_post,
    remove_space)
>>>
>>> text = "บ้านนนนน () อยู่นานนานนาน 😂🤣😃😄😅 PyThaiNLP amp;     "
>>> process_thai(text,
                 pre_rules=[replace_rep_after, fix_html],
                 post_rules=[ungroup_emoji,
                             replace_wrep_post,
                             remove_space]
                )
['บ้าน', 'xxrep', '5', '()', 'อยู่', 'xxwrep', '2', 'นาน', '😂', '🤣',
 '😃', '😄', '😅', 'PyThaiNLP', '&']

pythainlp.ulmfit.rm_brackets(text: str) → str[source]: Remove all empty brackets and artifacts within brackets from text.

pythainlp.ulmfit.rm_useless_newlines(text: str) → str[source]: Remove multiple newlines in text.

pythainlp.ulmfit.rm_useless_spaces(text: str) → str[source]: Remove multiple spaces in text. (code from fastai)

pythainlp.ulmfit.remove_space(toks: Collection[str]) → List[str][source]

Do not include space for bag-of-word models.

Parameters: toks (list[str]) – list of tokens
Returns: list of tokens where space tokens (” “) are filtered out
Return type: list[str]

pythainlp.ulmfit.replace_rep_after(text: str) → str[source]

Replace repetitions at the character level in text after the repetition. This is done to prevent such case as ‘น้อยยยยยยยย’ becoming ‘น้อ xxrep 8 ย’ ;instead it will retain the word as ‘น้อย xxrep 8’

Parameters

text (str) – input text to replace character repetition

Returns

text with repetitive token xxrep and the counter after character repetition

Return type

str

Example

>>> from pythainlp.ulmfit import replace_rep_after
>>>
>>> text = "กาาาาาาา"
>>> replace_rep_after(text)
'กาxxrep7 '

pythainlp.ulmfit.replace_rep_nonum(text: str) → str[source]

Replace repetitions at the character level in text after the repetition. This is done to prevent such case as ‘น้อยยยยยยยย’ becoming ‘น้อ xxrep ย’; instead it will retain the word as ‘น้อย xxrep ‘

Parameters

text (str) – input text to replace character repetition

Returns

text with repetitive token xxrep after character repetition

Return type

str

Example

>>> from pythainlp.ulmfit import replace_rep_nonum
>>>
>>> text = "กาาาาาาา"
>>> replace_rep_nonum(text)
'กา xxrep '

pythainlp.ulmfit.replace_wrep_post(toks: Collection[str]) → List[str][source]

Replace reptitive words post tokenization; fastai replace_wrep does not work well with Thai.

Parameters

toks (list[str]) – list of tokens

Returns

list of tokens where xxwrep token and the counter is added in front of repetitive words.

Return type

list[str]

Example

>>> from pythainlp.ulmfit import replace_wrep_post_nonum
>>>
>>> toks = ["กา", "น้ำ", "น้ำ", "น้ำ", "น้ำ"]
>>> replace_wrep_post(toks)
['กา', 'xxwrep', '3', 'น้ำ']

pythainlp.ulmfit.replace_wrep_post_nonum(toks: Collection[str]) → List[str][source]

Replace reptitive words post tokenization; fastai replace_wrep does not work well with Thai.

Parameters

toks (list[str]) – list of tokens

Returns

list of tokens where xxwrep token is added in front of repetitive words.

Return type

list[str]

Example

>>> from pythainlp.ulmfit import replace_wrep_post_nonum
>>>
>>> toks = ["กา", "น้ำ", "น้ำ", "น้ำ", "น้ำ"]
>>> replace_wrep_post_nonum(toks)
['กา', 'xxwrep', 'น้ำ']

pythainlp.ulmfit.spec_add_spaces(text: str) → str[source]: Add spaces around / and # in text. (code from fastai)

pythainlp.ulmfit.ungroup_emoji(toks: Collection[str]) → List[str][source]

Ungroup Zero Width Joiner (ZVJ) Emojis

See https://emojipedia.org/emoji-zwj-sequence/

members: tokenizer