pythainlp.ulmfit
Welcome to the pythainlp.ulmfit module, where you’ll find powerful tools for Universal Language Model Fine-tuning for Text Classification (ULMFiT). ULMFiT is a cutting-edge technique for training deep learning models on large text corpora and then fine-tuning them for specific text classification tasks.
Modules
- class pythainlp.ulmfit.ThaiTokenizer(lang: str = 'th')[source]
Wrapper around a frozen newmm tokenizer to make it a
fastai.BaseTokenizer
. (see: https://docs.fast.ai/text.transform#BaseTokenizer)The ThaiTokenizer class is a critical component of ULMFiT, designed for tokenizing Thai text effectively. Tokenization is the process of breaking down text into individual tokens, and this class allows you to do so with precision and accuracy.
- static tokenizer(text: str) List[str] [source]
This function tokenizes text using newmm engine and the dictionary specifically for ulmfit related functions (see: Dictionary file (.txt)). :meth: tokenize text using a frozen newmm engine :param str text: text to tokenize :return: tokenized text :rtype: list[str]
- Example:
Using
pythainlp.ulmfit.ThaiTokenizer.tokenizer()
is similar topythainlp.tokenize.word_tokenize()
using ulmfit engine.>>> from pythainlp.ulmfit import ThaiTokenizer >>> from pythainlp.tokenize import word_tokenize >>> >>> text = "อาภรณ์, จินตมยปัญญา ภาวนามยปัญญา" >>> ThaiTokenizer.tokenizer(text) ['อาภรณ์', ',', ' ', 'จิน', 'ตม', 'ย', 'ปัญญา', ' ', 'ภาวนามยปัญญา'] >>> >>> word_tokenize(text, engine='ulmfit') ['อาภรณ์', ',', ' ', 'จิน', 'ตม', 'ย', 'ปัญญา', ' ', 'ภาวนามยปัญญา']
- pythainlp.ulmfit.document_vector(text: str, learn, data, agg: str = 'mean')[source]
This function vectorizes Thai input text into a 400 dimension vector using
fastai
language model and data bunch.- Meth:
document_vector get document vector using fastai language model and data bunch
- Parameters:
- Returns:
numpy.array
of document vector sized 400 based on the encoder of the model- Return type:
numpy.ndarray((1, 400))
- Example:
>>> from pythainlp.ulmfit import document_vectorr >>> from fastai import * >>> from fastai.text import * >>> >>> # Load Data Bunch >>> data = load_data(MODEL_PATH, 'thwiki_lm_data.pkl') >>> >>> # Initialize language_model_learner >>> config = dict(emb_sz=400, n_hid=1550, n_layers=4, pad_token=1, qrnn=False, tie_weights=True, out_bias=True, output_p=0.25, hidden_p=0.1, input_p=0.2, embed_p=0.02, weight_p=0.15) >>> trn_args = dict(drop_mult=0.9, clip=0.12, alpha=2, beta=1) >>> learn = language_model_learner(data, AWD_LSTM, config=config, pretrained=False, **trn_args) >>> document_vector('วันนี้วันดีปีใหม่', learn, data)
- See Also:
A notebook showing how to train ulmfit language model and its usage, Jupyter Notebook
The document_vector function is a powerful tool that computes document vectors for text data. This functionality is often used in text classification tasks where you need to represent documents as numerical vectors for machine learning models.
- pythainlp.ulmfit.fix_html(text: str) str [source]
Replace HTML strings in test. (codes from fastai)
- Parameters:
text (str) – text to replace HTML strings in
- Returns:
text with HTML strings replaced
- Return type:
- Example:
>>> from pythainlp.ulmfit import fix_html >>> fix_html("Anbsp;amp;nbsp;B @.@ ") A & B.
The fix_html function is a text preprocessing utility that handles HTML-specific characters, making text cleaner and more suitable for text classification.
- pythainlp.ulmfit.lowercase_all(toks: Collection[str]) List[str] [source]
Lowercase all English words; English words in Thai texts don’t usually have nuances of capitalization.
The lowercase_all function is a text processing utility that converts all text to lowercase. This is useful for ensuring uniformity in text data and reducing the complexity of text classification tasks.
- pythainlp.ulmfit.merge_wgts(em_sz, wgts, itos_pre, itos_new)[source]
This function is to insert new vocab into an existing model named wgts and update the model’s weights for new vocab with the average embedding.
- Meth:
merge_wgts insert pretrained weights and vocab into a new set of weights and vocab; use average if vocab not in pretrained vocab
- Parameters:
- Returns:
merged torch model weights
- Example:
from pythainlp.ulmfit import merge_wgts import torch wgts = {'0.encoder.weight': torch.randn(5,3)} itos_pre = ["แมว", "คน", "หนู"] itos_new = ["ปลา", "เต่า", "นก"] em_sz = 3 merge_wgts(em_sz, wgts, itos_pre, itos_new) # output: # {'0.encoder.weight': tensor([[0.5952, 0.4453, 0.0011], # [0.5952, 0.4453, 0.0011], # [0.5952, 0.4453, 0.0011]]), # '0.encoder_dp.emb.weight': tensor([[0.5952, 0.4453, 0.0011], # [0.5952, 0.4453, 0.0011], # [0.5952, 0.4453, 0.0011]]), # '1.decoder.weight': tensor([[0.5952, 0.4453, 0.0011], # [0.5952, 0.4453, 0.0011], # [0.5952, 0.4453, 0.0011]])}
The merge_wgts function is a tool for merging weight arrays, which can be crucial for managing and fine-tuning deep learning models in ULMFiT.
- pythainlp.ulmfit.process_thai(text: str, pre_rules: ~typing.Collection = [<function fix_html>, <function reorder_vowels>, <function spec_add_spaces>, <function rm_useless_spaces>, <function rm_useless_newlines>, <function rm_brackets>, <function replace_url>, <function replace_rep_nonum>], tok_func: ~typing.Callable = <bound method Tokenizer.word_tokenize of <pythainlp.tokenize.core.Tokenizer object>>, post_rules: ~typing.Collection = [<function ungroup_emoji>, <function lowercase_all>, <function replace_wrep_post_nonum>, <function remove_space>]) Collection[str] [source]
Process Thai texts for models (with sparse features as default)
- Parameters:
- Returns:
a list of cleaned tokenized texts
- Return type:
- Note:
The default pre-rules consists of
fix_html()
,pythainlp.util.normalize()
,spec_add_spaces()
,rm_useless_spaces()
,rm_useless_newlines()
,rm_brackets()
andreplace_rep_nonum()
.The default post-rules consists of
ungroup_emoji()
,lowercase_all()
,replace_wrep_post_nonum()
, andremove_space()
.
- Example:
Use default pre-rules and post-rules:
>>> from pythainlp.ulmfit import process_thai >>> text = "บ้านนนนน () อยู่นานนานนาน 😂🤣😃😄😅 PyThaiNLP amp; " >>> process_thai(text) [บ้าน', 'xxrep', ' ', 'อยู่', 'xxwrep', 'นาน', '😂', '🤣', '😃', '😄', '😅', 'pythainlp', '&']
Modify pre_rules and post_rules arguments with rules provided in
pythainlp.ulmfit
:
>>> from pythainlp.ulmfit import ( process_thai, replace_rep_after, fix_html, ungroup_emoji, replace_wrep_post, remove_space) >>> >>> text = "บ้านนนนน () อยู่นานนานนาน 😂🤣😃😄😅 PyThaiNLP amp; " >>> process_thai(text, pre_rules=[replace_rep_after, fix_html], post_rules=[ungroup_emoji, replace_wrep_post, remove_space] ) ['บ้าน', 'xxrep', '5', '()', 'อยู่', 'xxwrep', '2', 'นาน', '😂', '🤣', '😃', '😄', '😅', 'PyThaiNLP', '&']
The process_thai function is designed for preprocessing Thai text data, a vital step in preparing text for ULMFiT-based text classification.
- pythainlp.ulmfit.rm_brackets(text: str) str [source]
Remove all empty brackets and artifacts within brackets from text.
The rm_brackets function removes brackets from text, making it more suitable for text classification tasks that don’t require bracket information.
- pythainlp.ulmfit.rm_useless_newlines(text: str) str [source]
Remove multiple newlines in text.
The rm_useless_newlines function eliminates unnecessary newlines in text data, ensuring that text is more compact and easier to work with in ULMFiT-based text classification.
- pythainlp.ulmfit.rm_useless_spaces(text: str) str [source]
Remove multiple spaces in text. (codes from fastai)
The rm_useless_spaces function removes extraneous spaces from text, making it cleaner and more efficient for ULMFiT-based text classification.
- pythainlp.ulmfit.remove_space(toks: Collection[str]) List[str] [source]
Do not include space for bag-of-word models.
- Parameters:
- Returns:
list of tokens where space tokens (” “) are filtered out
- Return type:
The remove_space function is a utility for removing space characters from text data, streamlining the text for classification purposes.
- pythainlp.ulmfit.replace_rep_after(text: str) str [source]
Replace repetitions at the character level in text after the repeated character. This is to prevent cases such as ‘น้อยยยยยยยย’ becomes ‘น้อ xxrep 8 ย’ ; instead it will retain the word as ‘น้อย xxrep 8’
- Parameters:
text (str) – input text to replace character repetitions in
- Returns:
text with repetitive token xxrep and the counter after the repeated character
- Return type:
- Example:
>>> from pythainlp.ulmfit import replace_rep_after >>> >>> text = "กาาาาาาา" >>> replace_rep_after(text) 'กาxxrep7 '
The replace_rep_after function is a text preprocessing tool for replacing repeated characters in text with a single occurrence. This step helps in standardizing text data for text classification.
- pythainlp.ulmfit.replace_rep_nonum(text: str) str [source]
Replace repetitions at the character level in text after the repetition. This is done to prevent such case as ‘น้อยยยยยยยย’ becoming ‘น้อ xxrep ย’; instead it will retain the word as ‘น้อย xxrep ‘
- Parameters:
text (str) – input text to replace character repetition
- Returns:
text with repetitive token xxrep after character repetition
- Return type:
- Example:
>>> from pythainlp.ulmfit import replace_rep_nonum >>> >>> text = "กาาาาาาา" >>> replace_rep_nonum(text) 'กา xxrep '
The replace_rep_nonum function is similar to replace_rep_after, but it focuses on replacing repeated characters without considering numbers.
- pythainlp.ulmfit.replace_wrep_post(toks: Collection[str]) List[str] [source]
Replace repetitive words after tokenization; fastai replace_wrep does not work well with Thai.
- Parameters:
- Returns:
list of tokens where xxwrep token and the counter is added before repetitive words.
- Return type:
- Example:
>>> from pythainlp.ulmfit import replace_wrep_post_nonum >>> >>> toks = ["กา", "น้ำ", "น้ำ", "น้ำ", "น้ำ"] >>> replace_wrep_post(toks) ['กา', 'xxwrep', '3', 'น้ำ']
The replace_wrep_post function is used for replacing repeated words in text with a single occurrence. This function helps in reducing redundancy in text data, making it more efficient for text classification tasks.
- pythainlp.ulmfit.replace_wrep_post_nonum(toks: Collection[str]) List[str] [source]
Replace reptitive words post tokenization; fastai replace_wrep does not work well with Thai.
- Parameters:
- Returns:
list of tokens where xxwrep token is added in front of repetitive words.
- Return type:
- Example:
>>> from pythainlp.ulmfit import replace_wrep_post_nonum >>> >>> toks = ["กา", "น้ำ", "น้ำ", "น้ำ", "น้ำ"] >>> replace_wrep_post_nonum(toks) ['กา', 'xxwrep', 'น้ำ']
Similar to replace_wrep_post, the replace_wrep_post_nonum function removes repeated words without considering numbers in the text.
- pythainlp.ulmfit.spec_add_spaces(text: str) str [source]
Add spaces around / and # in text. (codes from fastai)
The spec_add_spaces function is a text processing tool for adding spaces between special characters in text data. This step helps in standardizing text for ULMFiT-based text classification.
- pythainlp.ulmfit.ungroup_emoji(toks: Collection[str]) List[str] [source]
Ungroup Zero Width Joiner (ZVJ) Emojis
See https://emojipedia.org/emoji-zwj-sequence/
The ungroup_emoji function is designed for ungrouping emojis in text data, which can be crucial for emoji recognition and classification tasks.