pythainlp.ulmfit¶
The ulmfit.utils
is utils for ULMFit model.
Modules¶
-
pythainlp.ulmfit.utils.
get_texts
(df)[source]¶ - Meth
get_texts get tuple of tokenized texts and labels
- Parameters
df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column
- Returns
tok - lists of tokenized texts with beginning-of-sentence tag xbos as first element of each list
labels - list of labels
-
pythainlp.ulmfit.utils.
get_all
(df)[source]¶ - Meth
get_all iterate get_texts for all the entire pandas.DataFrame
- Parameters
df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column
- Returns
tok - lists of tokenized texts with beginning-of-sentence tag xbos as first element of each list
labels - list of labels
-
pythainlp.ulmfit.utils.
numericalizer
(df, itos=None, max_vocab=60000, min_freq=2, pad_tok='_pad_', unk_tok='_unk_')[source]¶ - Meth
numericalize numericalize tokenized texts for: * tokens with word frequency more than min_freq * at maximum vocab size of max_vocab * add unknown token _unk_ and padding token _pad_ in first and second position * use integer-to-string list itos if avaiable e.g. [‘_unk_’, ‘_pad_’,’first_word’,’second_word’,…]
- Parameters
df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column
itos (list) – integer-to-string list
max_vocab (int) – maximum number of vocabulary (default 60000)
min_freq (int) – minimum word frequency to be included (default 2)
pad_tok (str) – padding token
unk_token (str) – unknown token
- Returns
lm - numpy.array of numericalized texts
tok - lists of tokenized texts with beginning-of-sentence tag xbos as first element of each list
labels - list of labels
itos - integer-to-string list e.g. [‘_unk_’, ‘_pad_’,’first_word’,’second_word’,…]
stoi - string-to-integer dict e.g. {‘_unk_’:0, ‘_pad_’:1,’first_word’:2,’second_word’:3,…}
freq - collections.Counter for word frequency
-
pythainlp.ulmfit.utils.
merge_wgts
(em_sz, wgts, itos_pre, itos_cls)[source]¶ - Parameters
df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column
em_sz (int) – size of embedding vectors (pretrained model is at 300)
wgts – saved pyTorch weights of pretrained model
itos_pre (list) – integer-to-string list of pretrained model
itos_cls (list) – integer-to-string list of current dataset
- Returns
merged weights of the model for current dataset
-
pythainlp.ulmfit.utils.
document_vector
(ss, m, stoi, tok_engine='newmm')[source]¶ - Meth
document_vector get document vector using pretrained ULMFit model
- Parameters
- Returns
numpy.array of document vector sized 300
-
class
pythainlp.ulmfit.utils.
ThaiTokenizer
(engine='newmm')[source]¶ -
static
proc_all
(ss)[source]¶ - Meth
proc_all runs proc_text for multiple sentences
- Parameters
text (str) – text to process
- Returns
processed and tokenized text
-
static
proc_all_mp
(ss)[source]¶ - Meth
proc_all runs proc_text for multiple sentences using multiple cpus
- Parameters
text (str) – text to process
- Returns
processed and tokenized text
-
proc_text
(text)[source]¶ - Meth
proc_text procss and tokenize text removing repetitions, special characters, double spaces
- Parameters
text (str) – text to process
- Returns
processed and tokenized text
-
static
replace_rep
(text)[source]¶ replace_rep()
replace 3 or above repetitive characters with tkrep :param str text: text to process :return: processed text where repetitions are replaced by tkrep followed by number of repetitions Example:>>> from pythainlp.ulmfit.utils import ThaiTokenizer >>> tt = ThaiTokenizer() >>> tt.replace_rep('คือดียยยยยย') คือดีtkrep6ย
-
static