Half-Precision Support¶
This page documents which kornia modules support half-precision floating-point dtypes
(torch.float16 and torch.bfloat16) and what limitations to expect.
Module |
float16 |
bfloat16 |
Notes |
|---|---|---|---|
|
⚠️ Partial |
⚠️ Partial |
Most color space conversions work for both half-precision dtypes. FFT-based operations may fail on CUDA. |
|
⚠️ Partial |
⚠️ Partial |
Basic convolution-based filters (Gaussian, Sobel, Median, Box) work
for both dtypes. FFT-based operations ( |
|
⚠️ Partial |
⚠️ Partial |
Histogram equalization, CLAHE, gamma correction, and ZCA whitening work
for both dtypes. ZCA linalg ops go through |
|
✅ Yes |
✅ Yes |
Uses only convolution and pooling; no dtype restrictions. |
|
⚠️ Partial |
⚠️ Partial |
Both dtypes are accepted by |
|
⚠️ Partial |
⚠️ Partial |
Affine, homography, resize, and warp operations use |
|
⚠️ Partial |
⚠️ Partial |
Pinhole camera model and most projection ops work for both dtypes.
|
|
❌ No |
❌ No |
|
|
⚠️ Partial |
⚠️ Partial |
SVD and solve operations use |
|
⚠️ Partial |
⚠️ Partial |
Uses |
|
⚠️ Partial |
⚠️ Partial |
Most rotation/translation operations (SO2, SO3, SE2, SE3) work for both dtypes via cast helpers. A few code paths may still fail. |
|
⚠️ Partial |
⚠️ Partial |
RANSAC-based solvers use |
|
⚠️ Partial |
⚠️ Partial |
Soft-argmax and weighted softmax work for both dtypes. Precision-sensitive ops may produce inaccurate results. |
|
⚠️ Partial |
⚠️ Partial |
Photometric losses (SSIM, PSNR, MS-SSIM) work for both dtypes. Losses based on linalg operations (Hausdorff, etc.) may not. |
|
⚠️ Partial |
⚠️ Partial |
Local feature detectors and descriptors (SIFT, HardNet, DISK, DeDoDe)
work for inference. Feature matching uses a manual |
|
⚠️ Partial |
⚠️ Partial |
Simple pixel-level metrics work for both dtypes. Metrics involving linalg operations may not. |
|
⚠️ Partial |
⚠️ Partial |
Conv-based models work for both dtypes. Attention-based models (e.g. VLMs, ViTs) may have internal dtype mismatches. |
Legend¶
✅ Yes — Works correctly; results are accurate at the given precision.
⚠️ Partial — Some operations work; others fail at runtime or produce inaccurate results due to limited numerical range/precision.
❌ No — Not supported; raises a
RuntimeErrororTypeErrorat runtime (explicit dtype check in the implementation).
Test Results¶
Measured on commit 6131e98 (2026-03-21), full test suite (no --runslow).
Pass% = passed ÷ (passed + failed); skipped and xfailed tests are excluded.
Run |
Passed |
Failed |
Skipped |
Pass% |
|---|---|---|---|---|
CPU float32 (baseline) |
7647 |
3 |
3269 |
99.9% |
CUDA float32 (baseline) |
7634 |
3 |
3280 |
99.9% |
CPU float16 |
6866 |
747 |
3306 |
90.1% |
CPU bfloat16 |
6838 |
812 |
3269 |
89.3% |
CUDA float16 (KORNIA_TEST_IN_SUBPROCESS=1) |
6727 |
643 |
3556 |
91.3% |
CUDA bfloat16 (KORNIA_TEST_IN_SUBPROCESS=1) |
6695 |
713 |
3518 |
90.4% |
Note
CUDA half-precision tests are measured using KORNIA_TEST_IN_SUBPROCESS=1
which bypasses the skip_half_precision_on_cuda fixture. Each test then
runs in the same process but with the cuda_device_assert_guard fixture
synchronising CUDA before and after each test. For full isolation the current
implementation uses subprocess.run for true process isolation; a fresh
--isolate-half-precision flag spawns each test in a fresh subprocess.run
process with no shared CUDA state.
Test Suite Behaviour¶
Half-precision tests live in the same directories and files as their
float32/float64 counterparts. They are run as separate, isolated pytest
invocations rather than being mixed into a combined --dtype=all run.
This prevents a CUDA device-side assert in a half-precision test from
corrupting the CUDA context and causing unrelated float32 tests to fail.
# Standard precision — default CI
pixi run test tests/ --dtype=float32,float64
# Half-precision — run in isolation, per directory
pytest tests/color/ --dtype=float16,bfloat16
pytest tests/geometry/ --dtype=float16,bfloat16 --device=cuda
Two autouse fixtures in the root conftest.py enforce safe behaviour:
``skip_half_precision_on_cuda`` — skips float16/bfloat16 tests on CUDA in combined runs so no half-precision kernel is ever launched (and therefore no device-side assert can fire).
``cuda_device_assert_guard`` — synchronises CUDA before and after each CUDA test to catch async device-side assert errors in the test that caused them, not in the next one. If the context is already corrupted, the test is skipped rather than allowed to fail spuriously.
With --isolate-half-precision, each float16/bfloat16 CUDA test is
intercepted by a custom pytest_runtest_protocol hook and executed in a
completely fresh Python process via subprocess.run. There is no shared
CUDA context between tests, so a device-side assert in one test cannot affect
any other.
See TESTING.md in the repository root for a full description of the
contamination mechanism and fixture implementation.