Corpus test
This directory contains tests that verify the integrity, format,
parseability, and catalog functionality of corpus in PyThaiNLP.
Purpose
These tests are separate from regular unit tests because:
- They test actual file loading and parsing (not mocked)
- Downloadable corpus tests require network access and can be slow
- They verify corpus format and structure
- They test corpus catalog download and query functionality
- They should only run when corpus files or corpus code changes
Test categories
Corpus catalog tests (test_catalog.py)
Tests corpus catalog functionality:
- Catalog download from remote server
- Catalog URL and path validation
- Catalog JSON structure verification
- Querying specific corpus details
- Version information validation
Built-in corpus tests (test_builtin_corpus.py)
Tests corpus files that are included in the package:
- Text word lists (negations, stopwords, syllables, words, etc.)
- CSV files (provinces)
- Frequency data (TNC, TTC)
- Name lists (family names, person names)
Downloadable corpus tests (test_downloadable_corpus.py)
Tests corpus files that need to be downloaded:
- OSCAR word frequencies (96MB)
- TNC bigram/trigram frequencies (41MB + 145MB)
This test will take longer time than others due to
size of the downloads.
Running tests
Run all corpus tests:
python -m unittest discover -s tests/corpus -v
Run only catalog tests:
python -m unittest tests.corpus.test_catalog -v
Run only built-in corpus tests:
python -m unittest tests.corpus.test_builtin_corpus -v
Run only downloadable corpus tests:
python -m unittest tests.corpus.test_downloadable_corpus -v
CI integration
The corpus test runs automatically via
GitHub Actions workflow (.github/workflows/corpus.yml) when:
- Changes are made to
pythainlp/corpus/**
- Changes are made to
tests/corpus/**
- The workflow file itself is modified
What is tested
Each test verifies:
- Loadability: File can be loaded without errors
- Type correctness: Returns expected data type
(frozenset, list, dict)
- Non-empty: Contains actual data
- Format validity: Data structure matches expected format
- Content validity: Contains expected content
(e.g., Thai characters)
- Catalog functionality: Catalog can be downloaded
and queried correctly
Adding new tests
When adding a new corpus file or function to pythainlp.corpus:
- Add a test to
test_builtin_corpus.py if it’s included in the package
- Add a test to
test_downloadable_corpus.py if it requires download
- Add a test to
test_catalog.py if it involves catalog operations
- Verify the test catches format errors by temporarily breaking the corpus
Relationship to unit tests
- Unit tests (
tests/core/test_corpus.py):
Use mocks for speed, test code logic
- Corpus test (this directory):
Use real data, test file integrity and catalog
Both test suites are important and complementary.