pythainlp.ulmfit
Universal Language Model Fine-tuning for Text Classification (ULMFiT).
Modules
- class pythainlp.ulmfit.ThaiTokenizer(lang: str = 'th')[source]
Wrapper around a frozen newmm tokenizer to make it a
fastai.BaseTokenizer
. (see: https://docs.fast.ai/text.transform#BaseTokenizer)- static tokenizer(text: str) List[str] [source]
This function tokenizes text with newmm engine and the dictionary specifically for ulmfit related functions (see: Dictonary file (.txt)). :meth: tokenize text with a frozen newmm engine :param str text: text to tokenize :return: tokenized text :rtype: list[str]
- Example
Using
pythainlp.ulmfit.ThaiTokenizer.tokenizer()
is similar topythainlp.tokenize.word_tokenize()
with ulmfit engine.>>> from pythainlp.ulmfit import ThaiTokenizer >>> from pythainlp.tokenize import word_tokenize >>> >>> text = "อาภรณ์, จินตมยปัญญา ภาวนามยปัญญา" >>> ThaiTokenizer.tokenizer(text) ['อาภรณ์', ',', ' ', 'จิน', 'ตม', 'ย', 'ปัญญา', ' ', 'ภาวนามยปัญญา'] >>> >>> word_tokenize(text, engine='ulmfit') ['อาภรณ์', ',', ' ', 'จิน', 'ตม', 'ย', 'ปัญญา', ' ', 'ภาวนามยปัญญา']
- pythainlp.ulmfit.document_vector(text: str, learn, data, agg: str = 'mean')[source]
This function vectorize Thai input text into a 400 dimension vector using
fastai
language model and data bunch.- Meth
document_vector get document vector using fastai language model and data bunch
- Parameters
- Returns
numpy.array
of document vector sized 400 based on the encoder of the model- Return type
numpy.ndarray((1, 400))
- Example
>>> from pythainlp.ulmfit import document_vectorr >>> from fastai import * >>> from fastai.text import * >>> >>> # Load Data Bunch >>> data = load_data(MODEL_PATH, 'thwiki_lm_data.pkl') >>> >>> # Initialize language_model_learner >>> config = dict(emb_sz=400, n_hid=1550, n_layers=4, pad_token=1, qrnn=False, tie_weights=True, out_bias=True, output_p=0.25, hidden_p=0.1, input_p=0.2, embed_p=0.02, weight_p=0.15) >>> trn_args = dict(drop_mult=0.9, clip=0.12, alpha=2, beta=1) >>> learn = language_model_learner(data, AWD_LSTM, config=config, pretrained=False, **trn_args) >>> document_vector('วันนี้วันดีปีใหม่', learn, data)
- See Also
A notebook showing how to train ulmfit language model and its usage, Jupyter Notebook
- pythainlp.ulmfit.fix_html(text: str) str [source]
List of replacements from html strings in test. (code from fastai)
- pythainlp.ulmfit.lowercase_all(toks: Collection[str]) List[str] [source]
Lowercase all English words; English words in Thai texts don’t usually have nuances of capitalization.
- pythainlp.ulmfit.merge_wgts(em_sz, wgts, itos_pre, itos_new)[source]
This function is to insert new vocab into an existing model named wgts and update the model’s weights for new vocab with the average embedding.
- Meth
merge_wgts insert pretrained weights and vocab into a new set of weights and vocab; use average if vocab not in pretrained vocab
- Parameters
- Returns
merged torch model weights
- Example
from pythainlp.ulmfit import merge_wgts import torch wgts = {'0.encoder.weight': torch.randn(5,3)} itos_pre = ["แมว", "คน", "หนู"] itos_new = ["ปลา", "เต่า", "นก"] em_sz = 3 merge_wgts(em_sz, wgts, itos_pre, itos_new) # output: # {'0.encoder.weight': tensor([[0.5952, 0.4453, 0.0011], # [0.5952, 0.4453, 0.0011], # [0.5952, 0.4453, 0.0011]]), # '0.encoder_dp.emb.weight': tensor([[0.5952, 0.4453, 0.0011], # [0.5952, 0.4453, 0.0011], # [0.5952, 0.4453, 0.0011]]), # '1.decoder.weight': tensor([[0.5952, 0.4453, 0.0011], # [0.5952, 0.4453, 0.0011], # [0.5952, 0.4453, 0.0011]])}
- pythainlp.ulmfit.process_thai(text: str, pre_rules: ~typing.Collection = [<function fix_html>, <function reorder_vowels>, <function spec_add_spaces>, <function rm_useless_spaces>, <function rm_useless_newlines>, <function rm_brackets>, <function replace_url>, <function replace_rep_nonum>], tok_func: ~typing.Callable = <bound method Tokenizer.word_tokenize of <pythainlp.tokenize.core.Tokenizer object>>, post_rules: ~typing.Collection = [<function ungroup_emoji>, <function lowercase_all>, <function replace_wrep_post_nonum>, <function remove_space>]) Collection[str] [source]
Process Thai texts for models (with sparse features as default)
- Parameters
text (str) – text to be cleaned
pre_rules (list[func]) – rules to apply before tokenization.
tok_func (func) – tokenization function (by default, tok_func is
pythainlp.tokenize.word_tokenize()
)post_rules (list[func]) – rules to apply after tokenizations
- Returns
a list of cleaned tokenized texts
- Return type
- Note
The default pre-rules consists of
fix_html()
,pythainlp.util.normalize()
,spec_add_spaces()
,rm_useless_spaces()
,rm_useless_newlines()
,rm_brackets()
andreplace_rep_nonum()
.The default post-rules consists of
ungroup_emoji()
,lowercase_all()
,replace_wrep_post_nonum()
, andremove_space()
.
- Example
Use default pre-rules and post-rules:
>>> from pythainlp.ulmfit import process_thai >>> text = "บ้านนนนน () อยู่นานนานนาน 😂🤣😃😄😅 PyThaiNLP amp; " >>> process_thai(text) [บ้าน', 'xxrep', ' ', 'อยู่', 'xxwrep', 'นาน', '😂', '🤣', '😃', '😄', '😅', 'pythainlp', '&']
Modify pre_rules and post_rules arugments with rules provided in
pythainlp.ulmfit
:
>>> from pythainlp.ulmfit import ( process_thai, replace_rep_after, fix_html, ungroup_emoji, replace_wrep_post, remove_space) >>> >>> text = "บ้านนนนน () อยู่นานนานนาน 😂🤣😃😄😅 PyThaiNLP amp; " >>> process_thai(text, pre_rules=[replace_rep_after, fix_html], post_rules=[ungroup_emoji, replace_wrep_post, remove_space] ) ['บ้าน', 'xxrep', '5', '()', 'อยู่', 'xxwrep', '2', 'นาน', '😂', '🤣', '😃', '😄', '😅', 'PyThaiNLP', '&']
- pythainlp.ulmfit.rm_brackets(text: str) str [source]
Remove all empty brackets and artifacts within brackets from text.
- pythainlp.ulmfit.rm_useless_spaces(text: str) str [source]
Remove multiple spaces in text. (code from fastai)
- pythainlp.ulmfit.remove_space(toks: Collection[str]) List[str] [source]
Do not include space for bag-of-word models.
- pythainlp.ulmfit.replace_rep_after(text: str) str [source]
Replace repetitions at the character level in text after the repetition. This is done to prevent such case as ‘น้อยยยยยยยย’ becoming ‘น้อ xxrep 8 ย’ ;instead it will retain the word as ‘น้อย xxrep 8’
- pythainlp.ulmfit.replace_rep_nonum(text: str) str [source]
Replace repetitions at the character level in text after the repetition. This is done to prevent such case as ‘น้อยยยยยยยย’ becoming ‘น้อ xxrep ย’; instead it will retain the word as ‘น้อย xxrep ‘
- pythainlp.ulmfit.replace_wrep_post(toks: Collection[str]) List[str] [source]
Replace reptitive words post tokenization; fastai replace_wrep does not work well with Thai.
- Parameters
- Returns
list of tokens where xxwrep token and the counter is added in front of repetitive words.
- Return type
- Example
>>> from pythainlp.ulmfit import replace_wrep_post_nonum >>> >>> toks = ["กา", "น้ำ", "น้ำ", "น้ำ", "น้ำ"] >>> replace_wrep_post(toks) ['กา', 'xxwrep', '3', 'น้ำ']
- pythainlp.ulmfit.replace_wrep_post_nonum(toks: Collection[str]) List[str] [source]
Replace reptitive words post tokenization; fastai replace_wrep does not work well with Thai.
- Parameters
- Returns
list of tokens where xxwrep token is added in front of repetitive words.
- Return type
- Example
>>> from pythainlp.ulmfit import replace_wrep_post_nonum >>> >>> toks = ["กา", "น้ำ", "น้ำ", "น้ำ", "น้ำ"] >>> replace_wrep_post_nonum(toks) ['กา', 'xxwrep', 'น้ำ']
- pythainlp.ulmfit.spec_add_spaces(text: str) str [source]
Add spaces around / and # in text. (code from fastai)
- pythainlp.ulmfit.ungroup_emoji(toks: Collection[str]) List[str] [source]
Ungroup Zero Width Joiner (ZVJ) Emojis
- members
tokenizer