Wangchanberta
This notebook for pythainlp.wangchanberta
Lowphansirikul L, Polpanumas C, Jantrakulchai N, Nutanong S. WangchanBERTa: Pretraining transformer-based Thai Language Models. arXiv preprint arXiv:2101.09635. 2021 Jan 24.
[1]:
#!pip install pythainlp[full]
Collecting https://github.com/PyThaiNLP/pythainlp/archive/add-ner-thai2transformers.zip
Downloading https://github.com/PyThaiNLP/pythainlp/archive/add-ner-thai2transformers.zip
\ 12.6MB 1.8MB/s
Collecting python-crfsuite>=0.9.6
Downloading https://files.pythonhosted.org/packages/79/47/58f16c46506139f17de4630dbcfb877ce41a6355a1bbf3c443edb9708429/python_crfsuite-0.9.7-cp37-cp37m-manylinux1_x86_64.whl (743kB)
|████████████████████████████████| 747kB 8.0MB/s
Requirement already satisfied: requests>=2.22.0 in /usr/local/lib/python3.7/dist-packages (from pythainlp==2.3.0.dev0) (2.23.0)
Collecting tinydb>=3.0
Downloading https://files.pythonhosted.org/packages/af/cd/1ce3d93818cdeda0446b8033d21e5f32daeb3a866bbafd878a9a62058a9c/tinydb-4.4.0-py3-none-any.whl
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.22.0->pythainlp==2.3.0.dev0) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.22.0->pythainlp==2.3.0.dev0) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.22.0->pythainlp==2.3.0.dev0) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.22.0->pythainlp==2.3.0.dev0) (2020.12.5)
Building wheels for collected packages: pythainlp
Building wheel for pythainlp (setup.py) ... done
Created wheel for pythainlp: filename=pythainlp-2.3.0.dev0-cp37-none-any.whl size=11006400 sha256=f89b594cbbebbc1940c16b0957a74182f2ea8169de8270e33f0c6bac5d1d4fcd
Stored in directory: /root/.cache/pip/wheels/9a/be/9e/b2ab1db5c70b14b8d5d8a402e36ed915c2ec906df5c4f4b089
Successfully built pythainlp
Installing collected packages: python-crfsuite, tinydb, pythainlp
Successfully installed pythainlp-2.3.0.dev0 python-crfsuite-0.9.7 tinydb-4.4.0
[2]:
#!pip install transformers sentencepiece
Collecting transformers
Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
|████████████████████████████████| 1.9MB 8.6MB/s
Collecting sentencepiece
Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
|████████████████████████████████| 1.2MB 38.4MB/s
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from transformers) (20.9)
Collecting tokenizers<0.11,>=0.10.1
Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
|████████████████████████████████| 3.2MB 34.1MB/s
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2019.12.20)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers) (3.0.12)
Collecting sacremoses
Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
|████████████████████████████████| 890kB 42.3MB/s
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (1.19.5)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers) (4.41.1)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from transformers) (3.7.0)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->transformers) (2.4.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2020.12.5)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (1.15.0)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (7.1.2)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (1.0.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->transformers) (3.4.1)
Requirement already satisfied: typing-extensions>=3.6.4; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->transformers) (3.7.4.3)
Building wheels for collected packages: sacremoses
Building wheel for sacremoses (setup.py) ... done
Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=26dd1871c98e4cd5fe1938dbeba7086606c31e80a945ec9f752859e252fe7068
Stored in directory: /root/.cache/pip/wheels/29/3c/fd/7ce5c3f0666dab31a50123635e6fb5e19ceb42ce38d4e58f45
Successfully built sacremoses
Installing collected packages: tokenizers, sacremoses, transformers, sentencepiece
Successfully installed sacremoses-0.0.43 sentencepiece-0.1.95 tokenizers-0.10.1 transformers-4.3.3
[3]:
from pythainlp.wangchanberta import ThaiNameTagger, pos_tag
Named Entity Recognition
Dataset support:
thainer
lst20
[4]:
t = ThaiNameTagger(dataset_name="thainer")
[5]:
t.get_ner("ทดสอบผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์",tag=True)
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[5]:
'ทดสอบผมมีชื่อว่า <PERSON>นายวรรณพงษ์ ภัททิยไพบูลย์</PERSON>'
[6]:
t.get_ner("ทดสอบผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์",tag=False)
[6]:
[('ทดสอบผมมีชื่อว่า ', 'O'), ('นายวรรณพงษ์ ภัททิยไพบูลย์', 'B-PERSON')]
[7]:
t.get_ner("โรงเรียนสวนกุหลาบเป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ",tag=False)
[7]:
[('โรงเรียน', 'B-ORGANIZATION'),
('สวนกุหลาบ', 'I-ORGANIZATION'),
('เป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ', 'O')]
[8]:
t.get_ner("โรงเรียนสวนกุหลาบเป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ",tag=True)
[8]:
'<ORGANIZATION>โรงเรียนสวนกุหลาบ</ORGANIZATION>เป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ'
[9]:
t2 = ThaiNameTagger(dataset_name="lst20", grouped_entities=True)
[10]:
t2.get_ner("ทดสอบผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์",tag=True)
[10]:
'ทดสอบผมมีชื่อว่า <TTL>นาย</TTL><PER>วรรณพงษ์ ภัททิยไพบูลย์</PER>'
[11]:
t2.get_ner("ทดสอบผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์",tag=False)
[11]:
[('ทดสอบผมมีชื่อว่า ', 'O'),
('นาย', 'B-TTL'),
('วรรณพงษ์', 'B-PER'),
(' ', 'I-PER'),
('ภัททิยไพบูลย์', 'I-PER')]
[12]:
t2.get_ner("โรงเรียนสวนกุหลาบเป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ",tag=False)
[12]:
[('โรงเรียนสวนกุหลาบ', 'B-ORG'),
('เป็นโรงเรียนที่ดี แต่ไม่มี', 'O'),
('สวนกุหลาบ', 'B-ORG')]
Part of speech
It is use lst20
dataset.
[13]:
pos_tag("ผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์")
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[13]:
[('ผม', 'PR'),
('มีชื่อว่า', 'NN'),
(' ', 'NN'),
('นาย', 'NN'),
('วรรณ', 'NN'),
('พงษ์', 'NN'),
(' ', 'NN'),
('ภั', 'NN'),
('ท', 'NN'),
('ทิ', 'NN'),
('ย', 'NN'),
('ไพบูลย์', 'NN')]
[14]:
pos_tag("ผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์",grouped_word=True)
[14]:
[('ผม', 'PR'), ('มีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์', 'NN')]
Subword
[15]:
from pythainlp.tokenize import subword_tokenize
[16]:
subword_tokenize("ทดสอบตัดคำย่อย", engine="wangchanberta")
[16]:
['▁', 'ทดสอบ', 'ตัด', 'คํา', 'ย่อย']
[16]: