PyThaiNLP

KhanomTanLLM: Open Source Thai LLM

2024-09-12T00:00:00+00:00

Image gen from FLUX.1 [dev]

วันนี้เรายินดีที่จะเปิดตัว KhanomTanLLM (ขนมตาล LLM) เป็น Open Source language model แรกของภาษาอังกฤษ-ภาษาไทย ที่เทรนด้วยชุดข้อมูลเปิด และปล่อยชุดข้อมูลที่ใช้เทรน LLM ทั้งหมด พร้อม pipeline ในการเทรน และโมเดลที่สามารถนำไปใช้งานในเชิงพาณิชย์ได้ นอกจากนั้นเรายังปล่อยโมเดลทั้งขนาด 1B กับ 3B ถือเป็น small lm ตัวแรกที่เป็น Open Source ของภาษาไทยที่เปิดเผยทั้งชุดข้อมูลในการทำ pretrained, pipeline ในการทำ pretrained, และโมเดล

หลังจากที่ Phi model ออกมา ได้จุดประกายโมเดล LLM ที่มีขนาดน้อยกว่า 7B ในการใช้งานในโลกจริง แต่โมเดลที่มีขนาด 1B และ 3B ที่รองรับภาษาไทย ยังมีจำนวนน้อย ได้แก่ gemma-2b, Qwen2-1.5B, XGLM, mGPT และ RWKV เป็นต้น แต่ทั้งหมดไม่ได้เปิดเผยชุดข้อมูลที่นำมาเทรนโมเดลเพื่อทำ pretrained model สู่สาธารณะที่เข้าถึงได้ และ gemma-2b ไม่ได้ถูกนับว่าเป็น Open Source ด้วยเงื่อนไขในการใช้งานโมเดล ดังนั้น เราจึงเริ่มลงมือรวบรวมชุดข้อมูลภาษาไทย-ภาษาอังกฤษ เพื่อทำ Open Source LM ขนาดเล็กที่เราอยากได้ที่เปิดเผยทั้งชุดข้อมูลในการทำ pretrained, pipeline ในการทำ pretrained, และโมเดล

GitHub KhanomTanLLM: https://github.com/PyThaiNLP/KhanomTanLLM

Dataset

เราได้ทำการปล่อยชุดข้อมูลสำหรับการทำ Pretrained LLM ตัวนี้ไว้ที่

Pretraining dataset: https://huggingface.co/datasets/wannaphong/KhanomTanLLM-pretrained-dataset

Thai subset only: https://huggingface.co/datasets/wannaphong/KhanomTanLLM-pretrained-dataset-thai-subset
List Thai subset: https://huggingface.co/collections/pythainlp/datasets-for-pretrained-thai-llm-65db96ab730386b492889a98

โดยชุดข้อมูลทั้งหมดมี 53,376,211,711 Tokens

English: 31,629,984,243 Tokens
Thai: 12,785,565,497 Tokens
Code: 8,913,084,300 Toekns
Parallel data: 190,310,686 Tokens

Based on Typhoon-7B (https://huggingface.co/scb10x/typhoon-7b) tokenizer

สำหรับภาษาอังกฤษ เรานำชุดข้อมูลสังเคราะห์ทำตาม Cosmopedia ของ HuggingFace ที่สังเคราะห์ชุดข้อมูลภาษาอังกฤษไว้ https://huggingface.co/datasets/HuggingFaceTB/cosmopedia และนำชุดข้อมูลอย่าง openwebtext ชุดข้อมูลเว็บ, epfl-llm/guidelines, MathPile_Commercial ชุดข้อมูลคณิตศาสตร์, minipile ชุดข้อมูลขนาดย่อจาก The Pile, goodwiki ชุดข้อมูลวิกิแบบ markdown และชุดข้อมูลจาก bigscience ที่เทรน Bloom LM มาใช้งาน

สำหรับรายละเอียดชุดข้อมูลอ่านได้ที่ https://huggingface.co/datasets/wannaphong/KhanomTanLLM-pretrained-dataset

Tokenizer

เราตัดสินใจใช้ Tokenizer ของ Typhoon-7B https://huggingface.co/scb10x/typhoon-7b ในโมเดลของเรา เพื่อประหยัดทรัพยากรในการเทรน Tokenizer

Pretraining

เราได้ใช้ pipeline สำหรับเทรน LLM ของเราด้วย EasyLM project เป็นชุด pipeline ของโมเดล OpenLLaMA เราได้ยืนขอการสนับสนุน TPU ผ่านโครงการ TPU Research Cloud ของ Google และเราได้ใช้เครติดฟรีของ Google Cloud สำหรับการทำ pretrained model ทำให้เราไม่เสียค่าใช้จ่ายใด ๆ ในการเทรนโมเดลเลย

เราได้ทำการเทรนโมเดลทั้งขนาด 1B กับ 3B บนชุดข้อมูลเดียวกัน โดยใช้สถาปัตยกรรม Llama 2 จำนวนแค่ 1 Epoch เพื่อไม่ให้ repeat

สำหรับ pipeline ในการทำ pretrained model สามารถดูได้ที่ https://github.com/wannaphong/EasyLM/tree/KhanomTanLLM-pretraining

Pretrained Models:

Model

หลังจากที่เราได้โมเดลจาก pretraining แล้ว เราได้นำไปทำ SFT โดยมีโมเดลกับชุดข้อมูลดังนี้

Instruct Models:

Instruct dataset: wannaphong/KhanomTanLLM-Instruct-dataset
SFT Script: https://github.com/PyThaiNLP/KhanomTanLLM/tree/main/finetuning
1B: https://huggingface.co/pythainlp/KhanomTanLLM-1B-Instruct
3B: https://huggingface.co/pythainlp/KhanomTanLLM-3B-Instruct/

Acknowledgements

Research supported with Cloud TPUs from Google’s TPU Research Cloud (TRC). We use TPU4-64 for training model.

Thank you TPU Research Cloud and EasyLM project! We use EasyLM for pretraining model.

บทส่งท้าย

หากคุณนำโมเดลไป eval จะพบว่าโมเดลมีประสิทธิภาพค่อนข้างต่ำในหลายชุดทดสอบ เพราะด้วยขนาดโมเดลที่มีแค่ 1B กับ 3B และเราไม่มีทรัพยากรมากเพียงพอที่จะนำชุดข้อมูลขนาดใหญ่จากภาษาอังกฤษมาเทรนร่วมด้วย เช่น FineWeb, Dolma, The Pile เป็นต้น เราได้เทรน LLM ตัวนี้ด้วยชุดข้อมูลข้อความเพียง 53B tokens หากได้รับการเทรนขนาด >1T tokens น่าจะมีประสิทธิภาพมากยิ่งขึ้น นอกจากนี้ชุดข้อมูลภาษาไทยยังมีขนาดเล็กเกินไปสำหรับการเทรน LLM ให้มีประสิทธิภาพดีที่สุด (12B) ทางแก้ที่ดีที่สุด คือ การปล่อยชุดข้อมูลออกสู่สาธารณะให้มากยิ่งขึ้น, ขอความร่วมมือชุนชนในไทยในการสนับสนุนชุดข้อมูลเปิด และแนวทางการสังเคราะห์ชุดข้อมูลอาจเป็นหนึ่งในแนวทางแก้ไขปัญหาได้

สุดท้ายนี้ เราหวังว่า ชุดข้อมูล pretrained, pipeline, และโมเดลที่เราปล่อยออกสู่สาธารณะจะเป็นประโยชน์ต่อผู้ที่สนใจทำ pretrained Thai LLM และช่วยส่งเสริมวงการ Open Source AI ในประเทศไทยมากยิ่งขึ้น

เขียนโดย วรรณพงษ์ ภัททิยไพบูลย์

PyThaiNLP 5.0 Released!

2024-02-10T00:00:00+00:00

We are excited to announce the latest release of PyThaiNLP - version 5.0! PyThaiNLP is a Python library for Thai natural language processing (NLP). We are welcome to release PyThaiNLP 5.0! With PyThaiNLP 5.0, you can expect improved performance and accuracy for NLP tasks in Thai. We have also added new functions to make your NLP tasks even easier and more efficient.

Documentation: https://pythainlp.github.io/docs/5.0
Report bug: https://github.com/PyThaiNLP/pythainlp/issues

See more: https://github.com/PyThaiNLP/pythainlp/releases/tag/v5.0.0

We build Thai NLP. #PyThaiNLP #ThaiNLP

PyThaiNLP Joined NLP-0SS at EMNLP 2023!

2023-12-08T00:00:00+00:00

PyThaiNLP was present at 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS) 6 Dec 2023 @ EMNLP 2023 in Singapore by Peerat Limkonchotiwat.

You can read our paper at PyThaiNLP: Thai Natural Language Processing in Python.

Poster: https://github.com/nlposs/NLP-OSS/blob/master/nlposs-2023/08-PyThaiNLP-Poster.pdf

Slide: https://github.com/nlposs/NLP-OSS/blob/master/nlposs-2023/08-PyThaiNLP-Slide.pdf

PyThaiNLP Joined Hacktoberfest 2023!

2023-09-23T00:00:00+00:00

PyThaiNLP Joined Hacktoberfest 2023! You can contributing to PyThaiNLP and get free gift from Hacktoberfest 2023. Just coding and pull request!

Contributing to PyThaiNLP: https://github.com/PyThaiNLP/pythainlp/blob/dev/CONTRIBUTING.md GitHub: https://github.com/PyThaiNLP/pythainlp Hacktoberfest: https://hacktoberfest.com

#Hacktoberfest2023 #Hacktoberfest #PyThaiNLP #ThaiNLP

Han-solo - Thai syllable segmenter Released!

2023-07-30T00:00:00+00:00

🪿 Han-solo: Thai syllable segmenter

This work wants to create a Thai syllable segmenter that can work in the Thai social media domain. It use data from Wisesight Sentiment Corpus.

This work uses 2 datasets:

Nutcha Dataset (Thai news domain). See more data_nutcha/
Han-solo: Thai syllable segmenter dataset (Thai social media domain). See more Han-solo: Thai syllable segmenter

We train the model by CRF model that uses the same feature from ssg.

This project is developed by 🪿 Wannaphong Phatthiyaphaibun.

GitHub: PyThaiNLP/Han-solo

Han-Coref Thai Coreference resolution by PyThaiNLP Released!

2023-05-24T00:00:00+00:00

Han-Coref: Thai Coreference resolution by PyThaiNLP

This project want to create Thai Coreference resolution system.

This project is developed by 🪿 Wannaphong Phatthiyaphaibun.

GitHub: PyThaiNLP/han-coref

WangChanGLM Model Released!

2023-04-29T00:00:00+00:00

WangChanGLM is a multilingual, instruction-finetuned Facebook XGLM-7.5B using open-source, commercially permissible datasets, released under CC-BY SA 4.0.

GitHub: WangChanGLM - The Multilingual Instruction-Following Model
Blog: Medium

PyThaiNLP 4.0 Released!

2023-04-14T00:00:00+00:00

PyThaiNLP published the first version is 0.0.4 to PyPI at 6 years ago, so PyThaiNLP 4.0 will have special codename. The codename for PyThaiNLP 4.0 is PyThaiNLP 4.0 (Real).

See 4.0 Milestone.

Documentation: https://pythainlp.github.io/docs/4.0

Report bug: https://github.com/PyThaiNLP/pythainlp/issues

See PyThaiNLP 4.0 Change Log

If you want to contribute to PyThaiNLP, you can read Contributing to PyThaiNLP.

GitHub: [https://github.com/PyThaiNLP/pythainlp/releases/tag/v4.0.0](https://github.com/PyThaiNLP/pythainlp/releases/tag/v3.1.

มาช่วยกันสร้างชุดข้อมูลบทสนทนาภาษาไทยสำหรับสอนแชทบอทที่เหมือน ChatGPT กัน!

2023-02-19T00:00:00+00:00

โครงการ PyThaiNLP ได้จัดทำหน้าเว็บไซต์เอกสาร เพื่อให้ผู้ที่สนใจมาร่วมกันชุดข้อมูลบทสนทนาภาษาไทยสำหรับสอนแชทบอทที่เหมือน ChatGPT บนโครงการที่ชื่อว่า โครงการ Open Assistant โดย LAION-AI

ท่านสามารถอ่านรายละเอียดเพิ่มเติมได้ที่ มาช่วยกันสร้างชุดข้อมูลบทสนทนาภาษาไทยสำหรับสอนแชทบอทที่เหมือน ChatGPT กัน!

pythainlp.github.io/Open-Assistant-Thailand/

We build Thai NLP.

PyThaiNLP

PyThaiNLP v3.1.1 Released!

2022-10-31T00:00:00+00:00

PyThaiNLP v3.1.1 is the releases updates of PyThaiNLP v3.1.0.

What’s Changed

pythainlp.tools.misspell changed to pythainlp.tools.misspell.misspell.
Add Reduce import time to PyThaiNLP 3.1.1 #753
Doc: Lst20 deprecation warning for 3.1.1 #752

You can install or upgrade by pip install pythainlp==3.1.1

GitHub: https://github.com/PyThaiNLP/pythainlp/releases/tag/v3.1.1