pythainlp.ulmfit¶
Universal Language Model Fine-tuning for Text Classification (ULMFiT).
Modules¶
-
class
pythainlp.ulmfit.
ThaiTokenizer
(lang: str = 'th')[source]¶ Wrapper around a frozen newmm tokenizer to make it a
fastai.BaseTokenizer
. (see: https://docs.fast.ai/text.transform#BaseTokenizer)
-
pythainlp.ulmfit.
document_vector
(text: str, learn, data, agg: str = 'mean')[source]¶ This function vectorize Thai input text into a 400 dimension vector using
fastai
language model and data bunch.- Meth
document_vector get document vector using fastai language model and data bunch
- Parameters
- Returns
numpy.array
of document vector sized 400 based on the encoder of the model- Return type
numpy.ndarray((1, 400))
- Example
>>> from pythainlp.ulmfit import document_vectorr >>> from fastai import * >>> from fastai.text import * >>> >>> # Load Data Bunch >>> data = load_data(MODEL_PATH, 'thwiki_lm_data.pkl') >>> >>> # Initialize language_model_learner >>> config = dict(emb_sz=400, n_hid=1550, n_layers=4, pad_token=1, qrnn=False, tie_weights=True, out_bias=True, output_p=0.25, hidden_p=0.1, input_p=0.2, embed_p=0.02, weight_p=0.15) >>> trn_args = dict(drop_mult=0.9, clip=0.12, alpha=2, beta=1) >>> learn = language_model_learner(data, AWD_LSTM, config=config, pretrained=False, **trn_args) >>> document_vector('วันนี้วันดีปีใหม่', learn, data)
- See Also
A notebook showing how to train ulmfit language model and its usage, Jupyter Notebook
-
pythainlp.ulmfit.
fix_html
(text: str) → str[source]¶ List of replacements from html strings in test. (code from fastai)
-
pythainlp.ulmfit.
lowercase_all
(toks: Collection[str]) → List[str][source]¶ Lowercase all English words; English words in Thai texts don’t usually have nuances of capitalization.
-
pythainlp.ulmfit.
merge_wgts
(em_sz, wgts, itos_pre, itos_new)[source]¶ This function is to insert new vocab into an existing model named wgts and update the model’s weights for new vocab with the average embedding.
- Meth
merge_wgts insert pretrained weights and vocab into a new set of weights and vocab; use average if vocab not in pretrained vocab
- Parameters
- Returns
merged torch model weights
-
pythainlp.ulmfit.
process_thai
(text: str, pre_rules: Collection = [<function fix_html>, <function reorder_vowels>, <function spec_add_spaces>, <function rm_useless_spaces>, <function rm_useless_newlines>, <function rm_brackets>, <function replace_url>, <function replace_rep_nonum>], tok_func: Callable = <bound method Tokenizer.word_tokenize of <pythainlp.tokenize.core.Tokenizer object>>, post_rules: Collection = [<function ungroup_emoji>, <function lowercase_all>, <function replace_wrep_post_nonum>, <function remove_space>]) → Collection[str][source]¶ Process Thai texts for models (with sparse features as default)
- Parameters
text (str) – text to be cleaned
pre_rules (list[func]) – rules to apply before tokenization.
tok_func (func) – tokenization function (by default, tok_func is
pythainlp.tokenize.word_tokenize()
)post_rules (list[func]) – rules to apply after tokenizations
- Returns
a list of cleaned tokenized texts
- Return type
- Note
The default pre-rules consists of
fix_html()
,pythainlp.util.normalize()
,spec_add_spaces()
,rm_useless_spaces()
,rm_useless_newlines()
,rm_brackets()
andreplace_rep_nonum()
.The default post-rules consists of
ungroup_emoji()
,lowercase_all()
,replace_wrep_post_nonum()
, andremove_space()
.
- Example
Use default pre-rules and post-rules:
>>> from pythainlp.ulmfit import process_thai >>> text = "บ้านนนนน () อยู่นานนานนาน 😂🤣😃😄😅 PyThaiNLP amp; " >>> process_thai(text) [บ้าน', 'xxrep', ' ', 'อยู่', 'xxwrep', 'นาน', '😂', '🤣', '😃', '😄', '😅', 'pythainlp', '&']
Modify pre_rules and post_rules arugments with rules provided in
pythainlp.ulmfit
:
>>> from pythainlp.ulmfit import ( process_thai, replace_rep_after, fix_html, ungroup_emoji, replace_wrep_post, remove_space) >>> >>> text = "บ้านนนนน () อยู่นานนานนาน 😂🤣😃😄😅 PyThaiNLP amp; " >>> process_thai(text, pre_rules=[replace_rep_after, fix_html], post_rules=[ungroup_emoji, replace_wrep_post, remove_space] ) ['บ้าน', 'xxrep', '5', '()', 'อยู่', 'xxwrep', '2', 'นาน', '😂', '🤣', '😃', '😄', '😅', 'PyThaiNLP', '&']
-
pythainlp.ulmfit.
rm_brackets
(text: str) → str[source]¶ Remove all empty brackets and artifacts within brackets from text.
-
pythainlp.ulmfit.
rm_useless_spaces
(text: str) → str[source]¶ Remove multiple spaces in text. (code from fastai)
-
pythainlp.ulmfit.
remove_space
(toks: Collection[str]) → List[str][source]¶ Do not include space for bag-of-word models.
-
pythainlp.ulmfit.
replace_rep_after
(text: str) → str[source]¶ Replace repetitions at the character level in text after the repetition. This is done to prevent such case as ‘น้อยยยยยยยย’ becoming ‘น้อ xxrep 8 ย’ ;instead it will retain the word as ‘น้อย xxrep 8’
-
pythainlp.ulmfit.
replace_rep_nonum
(text: str) → str[source]¶ Replace repetitions at the character level in text after the repetition. This is done to prevent such case as ‘น้อยยยยยยยย’ becoming ‘น้อ xxrep ย’; instead it will retain the word as ‘น้อย xxrep ‘
-
pythainlp.ulmfit.
replace_wrep_post
(toks: Collection[str]) → List[str][source]¶ Replace reptitive words post tokenization; fastai replace_wrep does not work well with Thai.
- Parameters
- Returns
list of tokens where xxwrep token and the counter is added in front of repetitive words.
- Return type
- Example
>>> from pythainlp.ulmfit import replace_wrep_post_nonum >>> >>> toks = ["กา", "น้ำ", "น้ำ", "น้ำ", "น้ำ"] >>> replace_wrep_post(toks) ['กา', 'xxwrep', '3', 'น้ำ']
-
pythainlp.ulmfit.
replace_wrep_post_nonum
(toks: Collection[str]) → List[str][source]¶ Replace reptitive words post tokenization; fastai replace_wrep does not work well with Thai.
- Parameters
- Returns
list of tokens where xxwrep token is added in front of repetitive words.
- Return type
- Example
>>> from pythainlp.ulmfit import replace_wrep_post_nonum >>> >>> toks = ["กา", "น้ำ", "น้ำ", "น้ำ", "น้ำ"] >>> replace_wrep_post_nonum(toks) ['กา', 'xxwrep', 'น้ำ']
-
pythainlp.ulmfit.
spec_add_spaces
(text: str) → str[source]¶ Add spaces around / and # in text. (code from fastai)
-
pythainlp.ulmfit.
ungroup_emoji
(toks: Collection[str]) → List[str][source]¶ Ungroup Zero Width Joiner (ZVJ) Emojis
- members
tokenizer