pythainlp

Corpus test

This directory contains tests that verify the integrity, format, parseability, and catalog functionality of corpus in PyThaiNLP.

Purpose

These tests are separate from regular unit tests because:

  1. They test actual file loading and parsing (not mocked)
  2. Downloadable corpus tests require network access and can be slow
  3. They verify corpus format and structure
  4. They test corpus catalog download and query functionality
  5. They should only run when corpus files or corpus code changes

Test categories

Corpus catalog tests (test_catalog.py)

Tests corpus catalog functionality:

Built-in corpus tests (test_builtin_corpus.py)

Tests corpus files that are included in the package:

Downloadable corpus tests (test_downloadable_corpus.py)

Tests corpus files that need to be downloaded:

This test will take longer time than others due to size of the downloads.

Running tests

Run all corpus tests:

python -m unittest discover -s tests/corpus -v

Run only catalog tests:

python -m unittest tests.corpus.test_catalog -v

Run only built-in corpus tests:

python -m unittest tests.corpus.test_builtin_corpus -v

Run only downloadable corpus tests:

python -m unittest tests.corpus.test_downloadable_corpus -v

CI integration

The corpus test runs automatically via GitHub Actions workflow (.github/workflows/corpus.yml) when:

What is tested

Each test verifies:

  1. Loadability: File can be loaded without errors
  2. Type correctness: Returns expected data type (frozenset, list, dict)
  3. Non-empty: Contains actual data
  4. Format validity: Data structure matches expected format
  5. Content validity: Contains expected content (e.g., Thai characters)
  6. Catalog functionality: Catalog can be downloaded and queried correctly

Adding new tests

When adding a new corpus file or function to pythainlp.corpus:

  1. Add a test to test_builtin_corpus.py if it’s included in the package
  2. Add a test to test_downloadable_corpus.py if it requires download
  3. Add a test to test_catalog.py if it involves catalog operations
  4. Verify the test catches format errors by temporarily breaking the corpus

Relationship to unit tests

Both test suites are important and complementary.