pythainlp

Corpus test

This directory contains tests that verify the integrity, format, parseability, and catalog functionality of corpus in PyThaiNLP.

Purpose

These tests are separate from regular unit tests because:

They test actual file loading and parsing (not mocked)
Downloadable corpus tests require network access and can be slow
They verify corpus format and structure
They test corpus catalog download and query functionality
They should only run when corpus files or corpus code changes

Test categories

Corpus catalog tests (`test_catalog.py`)

Tests corpus catalog functionality:

Catalog download from remote server
Catalog URL and path validation
Catalog JSON structure verification
Querying specific corpus details
Version information validation

Built-in corpus tests (`test_builtin_corpus.py`)

Tests corpus files that are included in the package:

Text word lists (negations, stopwords, syllables, words, etc.)
CSV files (provinces)
Frequency data (TNC, TTC)
Name lists (family names, person names)

Downloadable corpus tests (`test_downloadable_corpus.py`)

Tests corpus files that need to be downloaded:

OSCAR word frequencies (96MB)
TNC bigram/trigram frequencies (41MB + 145MB)

This test will take longer time than others due to size of the downloads.

Running tests

Run all corpus tests:

python -m unittest discover -s tests/corpus -v

Run only catalog tests:

python -m unittest tests.corpus.test_catalog -v

Run only built-in corpus tests:

python -m unittest tests.corpus.test_builtin_corpus -v

Run only downloadable corpus tests:

python -m unittest tests.corpus.test_downloadable_corpus -v

CI integration

The corpus test runs automatically via GitHub Actions workflow (.github/workflows/corpus.yml) when:

Changes are made to pythainlp/corpus/**
Changes are made to tests/corpus/**
The workflow file itself is modified

What is tested

Each test verifies:

Loadability: File can be loaded without errors
Type correctness: Returns expected data type (frozenset, list, dict)
Non-empty: Contains actual data
Format validity: Data structure matches expected format
Content validity: Contains expected content (e.g., Thai characters)
Catalog functionality: Catalog can be downloaded and queried correctly

Adding new tests

When adding a new corpus file or function to pythainlp.corpus:

Add a test to test_builtin_corpus.py if it’s included in the package
Add a test to test_downloadable_corpus.py if it requires download
Add a test to test_catalog.py if it involves catalog operations
Verify the test catches format errors by temporarily breaking the corpus

Relationship to unit tests

Unit tests (tests/core/test_corpus.py): Use mocks for speed, test code logic
Corpus test (this directory): Use real data, test file integrity and catalog

Both test suites are important and complementary.

This site is open source. Improve this page.