Phupha-Word-freq

Phupha: Thai Word Frequency Dataset

Phupha is A Thai Word Frequency Dataset from Common Crawl Corpus

GitHub: https://github.com/PyThaiNLP/Phupha-Word-freq

We use Infini-gram mini API to query word count from CommonCrawl Corpus (Common Crawl July 2025 Crawl).

File:

Code license: Apache-2.0 license

Dataset license: Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0).

Citation

If you use Phupha in your project or publication, please cite the library as follows:

Phatthiyaphaibun, W. (2026). Phupha: Thai Word Frequency Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18490474

or by BibTeX entry:

@dataset{phatthiyaphaibun_2026_18490474,
  author       = {Phatthiyaphaibun, Wannaphong},
  title        = {Phupha: Thai Word Frequency Dataset},
  month        = feb,
  year         = 2026,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.18490474},
  url          = {https://doi.org/10.5281/zenodo.18490474},
}