pythainlp.ulmfit¶

The ulmfit.utils is utils for ULMFit model.

Modules¶

pythainlp.ulmfit.utils.get_texts(df)[source]¶

Meth

get_texts get tuple of tokenized texts and labels

Parameters

df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column

Returns

tok - lists of tokenized texts with beginning-of-sentence tag xbos as first element of each list
labels - list of labels

pythainlp.ulmfit.utils.get_all(df)[source]¶

Meth

get_all iterate get_texts for all the entire pandas.DataFrame

Parameters

df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column

Returns

tok - lists of tokenized texts with beginning-of-sentence tag xbos as first element of each list
labels - list of labels

pythainlp.ulmfit.utils.numericalizer(df, itos=None, max_vocab=60000, min_freq=2, pad_tok='_pad_', unk_tok='_unk_')[source]¶

Meth

numericalize numericalize tokenized texts for: * tokens with word frequency more than min_freq * at maximum vocab size of max_vocab * add unknown token _unk_ and padding token _pad_ in first and second position * use integer-to-string list itos if avaiable e.g. [‘_unk_’, ‘_pad_’,’first_word’,’second_word’,…]

Parameters

df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column
itos (list) – integer-to-string list
max_vocab (int) – maximum number of vocabulary (default 60000)
min_freq (int) – minimum word frequency to be included (default 2)
pad_tok (str) – padding token
unk_token (str) – unknown token

Returns

lm - numpy.array of numericalized texts
tok - lists of tokenized texts with beginning-of-sentence tag xbos as first element of each list
labels - list of labels
itos - integer-to-string list e.g. [‘_unk_’, ‘_pad_’,’first_word’,’second_word’,…]
stoi - string-to-integer dict e.g. {‘_unk_’:0, ‘_pad_’:1,’first_word’:2,’second_word’:3,…}
freq - collections.Counter for word frequency

pythainlp.ulmfit.utils.merge_wgts(em_sz, wgts, itos_pre, itos_cls)[source]¶

Parameters

df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column
em_sz (int) – size of embedding vectors (pretrained model is at 300)
wgts – saved pyTorch weights of pretrained model
itos_pre (list) – integer-to-string list of pretrained model
itos_cls (list) – integer-to-string list of current dataset

Returns

merged weights of the model for current dataset

pythainlp.ulmfit.utils.document_vector(ss, m, stoi, tok_engine='newmm')[source]¶

Meth

document_vector get document vector using pretrained ULMFit model

Parameters

ss (str) – sentence to extract embeddings
m – pyTorch model
stoi (dict) – string-to-integer dict e.g. {‘_unk_’:0, ‘_pad_’:1,’first_word’:2,’second_word’:3,…}
tok_engine (str) – tokenization engine (recommend using newmm if you are using pretrained ULMFit model)

Returns

numpy.array of document vector sized 300

pythainlp.ulmfit.utils.about()[source]¶

class pythainlp.ulmfit.utils.ThaiTokenizer(engine='newmm')[source]¶

static proc_all(ss)[source]¶

Meth: proc_all runs proc_text for multiple sentences
Parameters: text (str) – text to process
Returns: processed and tokenized text

static proc_all_mp(ss)[source]¶

Meth: proc_all runs proc_text for multiple sentences using multiple cpus
Parameters: text (str) – text to process
Returns: processed and tokenized text

proc_text(text)[source]¶

Meth: proc_text procss and tokenize text removing repetitions, special characters, double spaces
Parameters: text (str) – text to process
Returns: processed and tokenized text

static replace_rep(text)[source]¶

replace_rep() replace 3 or above repetitive characters with tkrep :param str text: text to process :return: processed text where repetitions are replaced by tkrep followed by number of repetitions Example:

>>> from pythainlp.ulmfit.utils import ThaiTokenizer
>>> tt = ThaiTokenizer()
>>> tt.replace_rep('คือดียยยยยย')
คือดีtkrep6ย

sub_br(text)[source]¶

sub_br() replace <br> tags with `

`

param str text: text to process
return: procssed text

tokenize(text)[source]¶

Meth: tokenize text with selected engine
Parameters: text (str) – text to tokenize
Returns: tokenized text