pythainlp.ulmfit¶
The ulmfit.utils is utils for ULMFit model.
Modules¶
- 
pythainlp.ulmfit.utils.get_texts(df)[source]¶
- Meth
- get_texts get tuple of tokenized texts and labels 
- Parameters
- df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column 
- Returns
- tok - lists of tokenized texts with beginning-of-sentence tag xbos as first element of each list 
- labels - list of labels 
 
 
- 
pythainlp.ulmfit.utils.get_all(df)[source]¶
- Meth
- get_all iterate get_texts for all the entire pandas.DataFrame 
- Parameters
- df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column 
- Returns
- tok - lists of tokenized texts with beginning-of-sentence tag xbos as first element of each list 
- labels - list of labels 
 
 
- 
pythainlp.ulmfit.utils.numericalizer(df, itos=None, max_vocab=60000, min_freq=2, pad_tok='_pad_', unk_tok='_unk_')[source]¶
- Meth
- numericalize numericalize tokenized texts for: * tokens with word frequency more than min_freq * at maximum vocab size of max_vocab * add unknown token _unk_ and padding token _pad_ in first and second position * use integer-to-string list itos if avaiable e.g. [‘_unk_’, ‘_pad_’,’first_word’,’second_word’,…] 
- Parameters
- df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column 
- itos (list) – integer-to-string list 
- max_vocab (int) – maximum number of vocabulary (default 60000) 
- min_freq (int) – minimum word frequency to be included (default 2) 
- pad_tok (str) – padding token 
- unk_token (str) – unknown token 
 
- Returns
- lm - numpy.array of numericalized texts 
- tok - lists of tokenized texts with beginning-of-sentence tag xbos as first element of each list 
- labels - list of labels 
- itos - integer-to-string list e.g. [‘_unk_’, ‘_pad_’,’first_word’,’second_word’,…] 
- stoi - string-to-integer dict e.g. {‘_unk_’:0, ‘_pad_’:1,’first_word’:2,’second_word’:3,…} 
- freq - collections.Counter for word frequency 
 
 
- 
pythainlp.ulmfit.utils.merge_wgts(em_sz, wgts, itos_pre, itos_cls)[source]¶
- Parameters
- df (pandas.DataFrame) – pandas.DataFrame with label as first column and text as second column 
- em_sz (int) – size of embedding vectors (pretrained model is at 300) 
- wgts – saved pyTorch weights of pretrained model 
- itos_pre (list) – integer-to-string list of pretrained model 
- itos_cls (list) – integer-to-string list of current dataset 
 
- Returns
- merged weights of the model for current dataset 
 
- 
pythainlp.ulmfit.utils.document_vector(ss, m, stoi, tok_engine='newmm')[source]¶
- Meth
- document_vector get document vector using pretrained ULMFit model 
- Parameters
- Returns
- numpy.array of document vector sized 300 
 
- 
class pythainlp.ulmfit.utils.ThaiTokenizer(engine='newmm')[source]¶
- 
static proc_all(ss)[source]¶
- Meth
- proc_all runs proc_text for multiple sentences 
- Parameters
- text (str) – text to process 
- Returns
- processed and tokenized text 
 
 - 
static proc_all_mp(ss)[source]¶
- Meth
- proc_all runs proc_text for multiple sentences using multiple cpus 
- Parameters
- text (str) – text to process 
- Returns
- processed and tokenized text 
 
 - 
proc_text(text)[source]¶
- Meth
- proc_text procss and tokenize text removing repetitions, special characters, double spaces 
- Parameters
- text (str) – text to process 
- Returns
- processed and tokenized text 
 
 - 
static replace_rep(text)[source]¶
- replace_rep()replace 3 or above repetitive characters with tkrep :param str text: text to process :return: processed text where repetitions are replaced by tkrep followed by number of repetitions Example:- >>> from pythainlp.ulmfit.utils import ThaiTokenizer >>> tt = ThaiTokenizer() >>> tt.replace_rep('คือดียยยยยย') คือดีtkrep6ย 
 
- 
static