kornia.contrib

Models

Base

class kornia.models.base.ModelBase(*args, **kwargs)[source]

Abstract model class with some utilities function.

compile(*, fullgraph=False, dynamic=False, backend='inductor', mode=None, options=None, disable=False)[source]

Compile this model with torch.compile().

Parameters:
  • fullgraph (bool, optional) – Whether Dynamo should require a single full graph. Default: False

  • dynamic (bool, optional) – Whether dynamic shape tracing is enabled. Default: False

  • backend (str, optional) – Compilation backend name passed to torch.compile(). Default: "inductor"

  • mode (Optional[str], optional) – Optional backend-specific compilation mode. Default: None

  • options (Optional[dict[Any, Any]], optional) – Optional backend-specific option dictionary. Default: None

  • disable (bool, optional) – If True, return an uncompiled model wrapper according to PyTorch’s compile semantics. Default: False

Return type:

ModelBase[TypeVar(ModelConfig)]

Returns:

Compiled model object with the same high-level interface as this instance.

abstractmethod static from_config(config)[source]

Build/load the model.

Parameters:

config (TypeVar(ModelConfig)) – The specifications for the model be build/loaded

Return type:

ModelBase[TypeVar(ModelConfig)]

load_checkpoint(checkpoint, device=None)[source]

Load checkpoint from a given url or file.

Parameters:
  • checkpoint (str) – The url or filepath for the respective checkpoint

  • device (Optional[device], optional) – The desired device to load the weights and move the model Default: None

Return type:

None

EfficientViT

class kornia.models.efficient_vit.EfficientViT(backbone)[source]

EfficientViT backbone model.

__init__(backbone)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(images)[source]

Extract features from the input images.

Parameters:

images (Tensor) – input images tensor of shape \((B, C, H, W)\).

Returns:

a dictionary containing the features.

Return type:

Dict[str, torch.Tensor]

static from_config(config)[source]

Build the EfficientViT model from a configuration object.

Parameters:

config (EfficientViTConfig) – EfficientViT configuration object. See EfficientViTConfig.

Returns:

the EfficientViT model.

Return type:

EfficientViT

load_checkpoint(checkpoint, device=None)[source]

Load checkpoint from a given url or file.

Parameters:
  • checkpoint (str) – The url or filepath for the respective checkpoint

  • device (Optional[device], optional) – The desired device to load the weights and move the model Default: None

Return type:

None

class kornia.models.efficient_vit.EfficientViTConfig(checkpoint=<factory>)[source]

Configuration to construct EfficientViT model.

Model weights can be loaded from a checkpoint URL or local path. The model weights are hosted on HuggingFace’s model hub: https://huggingface.co/kornia.

Parameters:

checkpoint (str, optional) – URL or local path of model weights. Default: <factory>

checkpoint: str
classmethod from_pretrained(model_type, resolution)[source]

Return a configuration object from a pre-trained model.

Parameters:
  • model_type (Literal['b1', 'b2', 'b3']) – model type, one of "b1", "b2", "b3".

  • resolution (Literal[224, 256, 288]) – input resolution, one of 224, 256, 288.

Return type:

EfficientViTConfig

Backbones

class kornia.models.efficient_vit.backbone.EfficientViTBackbone(width_list, depth_list, in_channels=3, dim=32, expand_ratio=4, norm='bn2d', act_func='hswish')[source]

Implement the EfficientViT backbone architecture.

EfficientViT is a high-speed vision transformer designed for efficient inference on mobile and edge devices by optimizing the attention mechanism and structural blocks.

Parameters:
  • width_list (list[int]) – List of widths for each stage.

  • depth_list (list[int]) – List of depths (number of blocks) for each stage.

  • in_channels (int, optional) – Number of input image channels. Default: 3.

  • dim (int, optional) – Dimension of the query, key, and value tensors in the attention mechanism. Default: 32.

  • expand_ratio (float, optional) – Expansion ratio for the MBConv blocks. Default: 4.

  • norm (str, optional) – Normalization layer type. Default: “bn2d”.

  • act_func (str, optional) – Activation function type. Default: “hswish”.

static build_local_block(in_channels, out_channels, stride, expand_ratio, norm, act_func, fewer_norm=False)[source]

Build the local convolution block used between EfficientViT stages.

Parameters:
  • in_channels (int) – Number of input feature channels.

  • out_channels (int) – Number of output feature channels.

  • stride (int) – Spatial stride for the block.

  • expand_ratio (float) – Expansion ratio used by MBConv-style blocks.

  • norm (str) – Normalization layer name.

  • act_func (str) – Activation function name.

  • fewer_norm (bool, optional) – If True, omit selected normalization layers. Default: False

Return type:

Module

Returns:

Depthwise-separable or inverted-bottleneck convolution block.

forward(x)[source]

Run the EfficientViT backbone and collect stage outputs.

Parameters:

x (Tensor) – Image tensor with shape \((B, C, H, W)\).

Return type:

dict[str, Tensor]

Returns:

Dictionary containing the input, each stage output, and "stage_final" for the final feature map.

kornia.models.efficient_vit.backbone.efficientvit_backbone_b0(**kwargs)[source]

Create EfficientViT B0.

Return type:

EfficientViTBackbone

kornia.models.efficient_vit.backbone.efficientvit_backbone_b1(**kwargs)[source]

Create EfficientViT B1.

Return type:

EfficientViTBackbone

kornia.models.efficient_vit.backbone.efficientvit_backbone_b2(**kwargs)[source]

Create EfficientViT B2.

Return type:

EfficientViTBackbone

kornia.models.efficient_vit.backbone.efficientvit_backbone_b3(**kwargs)[source]

Create EfficientViT B3.

Return type:

EfficientViTBackbone

class kornia.models.efficient_vit.backbone.EfficientViTLargeBackbone(width_list, depth_list, in_channels=3, qkv_dim=32, norm='bn2d', act_func='gelu')[source]

Implement the large-scale variant of the EfficientViT backbone.

This backbone is designed for high-resolution dense prediction tasks. It utilizes multi-scale linear attention to achieve a global receptive field while maintaining linear computational complexity relative to the input resolution.

Parameters:
  • width_list (list[int]) – List of channel widths for each stage of the backbone.

  • depth_list (list[int]) – List of number of blocks for each stage.

  • in_channels (int, optional) – Number of input image channels. Default: 3.

  • qkv_dim (int, optional) – The internal dimension for query, key, and value projections in the attention layers. Default: 32.

  • norm (str, optional) – Normalization layer type to use (e.g., “bn2d”, “ln”). Default: “bn2d”.

  • act_func (str, optional) – Activation function type to use (e.g., “gelu”, “relu”). Default: “gelu”.

static build_local_block(stage_id, in_channels, out_channels, stride, expand_ratio, norm, act_func, fewer_norm=False)[source]

Build a local block for an EfficientViT large stage.

Parameters:
  • stage_id (int) – Index of the stage being constructed.

  • in_channels (int) – Number of input feature channels.

  • out_channels (int) – Number of output feature channels.

  • stride (int) – Spatial stride for the block.

  • expand_ratio (float) – Expansion ratio controlling intermediate channels.

  • norm (str) – Normalization layer name.

  • act_func (str) – Activation function name.

  • fewer_norm (bool, optional) – If True, use the reduced-normalization variant. Default: False

Return type:

Module

Returns:

Residual, fused-MBConv, or MBConv block chosen for the stage.

forward(x)[source]

Run the EfficientViT large backbone and collect stage outputs.

Parameters:

x (Tensor) – Image tensor with shape \((B, C, H, W)\).

Return type:

dict[str, Tensor]

Returns:

Dictionary with entries for each stage and "stage_final" for the final feature map.

kornia.models.efficient_vit.backbone.efficientvit_backbone_l0(**kwargs)[source]

Create EfficientViT L0.

Return type:

EfficientViTLargeBackbone

kornia.models.efficient_vit.backbone.efficientvit_backbone_l1(**kwargs)[source]

Create EfficientViT L.

Return type:

EfficientViTLargeBackbone

kornia.models.efficient_vit.backbone.efficientvit_backbone_l2(**kwargs)[source]

Create EfficientViT L2.

Return type:

EfficientViTLargeBackbone

kornia.models.efficient_vit.backbone.efficientvit_backbone_l3(**kwargs)[source]

Create EfficientViT L3.

Return type:

EfficientViTLargeBackbone

Structures

class kornia.models.structures.SegmentationResults(logits, scores, mask_threshold=0.0, _original_res_logits=None)[source]

Encapsulate the results obtained by a Segmentation model.

Parameters:
  • logits (Tensor) – Results logits with shape \((B, C, H, W)\), where \(C\) refers to the number of predicted masks

  • scores (Tensor) – The scores from the logits. Shape \((B, C)\)

  • mask_threshold (float, optional) – The threshold value to generate the binary_masks from the logits Default: 0.0

property binary_masks: Tensor

Binary mask generated from logits considering the mask_threshold.

Shape will be the same of logits \((B, C, H, W)\) where \(C\) is the number masks predicted.

Note

If you run original_res_logits, this will generate the masks based on the original resolution logits. Otherwise, this will use the low resolution logits (self.logits).

logits: Tensor
mask_threshold: float = 0.0
original_res_logits(input_size, original_size, image_size_encoder)[source]

Remove padding and upscale the logits to the original image size.

Resize to image encoder input -> remove padding (bottom and right) -> Resize to original size

Note

This method set a internal original_res_logits which will be used if available for the binary masks.

Parameters:
  • input_size (tuple[int, int]) – The size of the image input to the model, in (H, W) format. Used to remove padding.

  • original_size (tuple[int, int]) – The original size of the image before resizing for input to the model, in (H, W) format.

  • image_size_encoder (Optional[tuple[int, int]]) – The size of the input image for image encoder, in (H, W) format. Used to resize the logits back to encoder resolution before remove the padding.

Return type:

Tensor

Returns:

Batched logits in \((K, C, H, W)\) format, where (H, W) is given by original_size.

scores: Tensor
squeeze(dim=0)[source]

Realize a squeeze for the dim given for all properties.

Return type:

SegmentationResults

class kornia.models.structures.Prompts(points=None, boxes=None, masks=None)[source]

Encapsulate the prompts inputs for a Model.

Parameters:
  • points (Optional[tuple[Tensor, Tensor]], optional) – A tuple with the keypoints (coordinates x, y) and their respective labels. Shape \((K, N, 2)\) for the keypoints, and \((K, N)\) Default: None

  • boxes (Optional[Tensor], optional) – Batched box inputs, with shape \((K, 4)\). Expected to be into xyxy format. Default: None

  • masks (Optional[Tensor], optional) – Batched mask prompts to the model with shape \((K, 1, H, W)\) Default: None

boxes: Tensor | None = None
property keypoints: Tensor | None

The keypoints from the points.

property keypoints_labels: Tensor | None

The keypoints labels from the points.

masks: Tensor | None = None
points: tuple[Tensor, Tensor] | None = None

VisualPrompter

class kornia.contrib.visual_prompter.VisualPrompter(config=None, device=None, dtype=None)[source]

Allow the user to run multiple query with multiple prompts for a model.

At the moment, we just support the SAM model. The model is loaded based on the given config.

For default the images are transformed to have their long side with size of the image_encoder.img_size. This Prompter class ensure to transform the images and the prompts before prediction. Also, the image is passed automatically for the method preprocess_image, which is responsible for F.normalize the image and F.pad it to have the right size for the SAM model \((\text{image_encoder.img_size}, \text{image_encoder.img_size})\). For default the image is normalized by the mean and standard deviation of the SAM dataset values.

Parameters:
  • config (Optional[SamConfig], optional) – A model config to generate the model. Now just the SAM model is supported. Default: None

  • device (Optional[device], optional) – The desired device to use the model. Default: None

  • dtype (Optional[dtype], optional) – The desired dtype to use the model. Default: None

Example

>>> # prompter = VisualPrompter() # Will load the vit h for default
>>> # You can load a custom SAM type for modifying the config
>>> prompter = VisualPrompter(SamConfig('vit_b'))
>>> image = torch.rand(3, 25, 30)
>>> prompter.set_image(image)
>>> boxes = Boxes(
...    torch.tensor(
...         [[[[0, 0], [0, 10], [10, 0], [10, 10]]]],
...         device=prompter.device,
...         dtype=torch.float32
...    ),
...    mode='xyxy'
... )
>>> prediction = prompter.predict(boxes=boxes)
>>> prediction.logits.shape
torch.Size([1, 3, 256, 256])
compile(*, fullgraph=False, dynamic=False, backend='inductor', mode=None, options=None, disable=False)[source]

Apply torch.compile(…)/dynamo API into the VisualPrompter API.

Note

For more information about the dynamo API check the official docs https://pytorch.org/docs/stable/generated/torch.compile.html

Parameters:
  • fullgraph (bool, optional) – Whether it is ok to break model into several subgraphs Default: False

  • dynamic (bool, optional) – Use dynamic shape tracing Default: False

  • backend (str, optional) – backend to be used Default: "inductor"

  • mode (Optional[str], optional) – Can be either “default”, “reduce-overhead” or “max-autotune” Default: None

  • options (Optional[dict[Any, Any]], optional) – A dictionary of options to pass to the backend. Default: None

  • disable (bool, optional) – Turn torch.compile() into a no-op for testing Default: False

Return type:

None

Example

>>> # prompter = VisualPrompter()
>>> # prompter.compile() # You should have torch >= 2.0.0 installed
>>> # Use the prompter methods ...
predict(keypoints=None, keypoints_labels=None, boxes=None, masks=None, multimask_output=True, output_original_size=True)[source]

Predict masks for the given image based on the input prompts.

Parameters:
  • keypoints (Union[Keypoints, Tensor, None], optional) – Point prompts to the model. Each point is in (X,Y) in pixels. Shape \((K, N, 2)\). Where N is the number of points and K the number of prompts. Default: None

  • keypoints_labels (Optional[Tensor], optional) – Labels for the point prompts. 1 indicates a foreground point and 0 indicates a background point. Shape \((K, N)\). Where N is the number of points, and K the number of prompts. Default: None

  • boxes (Union[Boxes, Tensor, None], optional) – A box prompt to the model. If a torch.Tensor, should be in a xyxy mode. Shape \((K, 4)\) Default: None

  • masks (Optional[Tensor], optional) – A low resolution mask input to the model, typically coming from a previous prediction iteration. Has shape \((K, 1, H, W)\), where for SAM, H=W=256. Default: None

  • multimask_output (bool, optional) – If true, the model will return three masks. For ambiguous input prompts (such as a single click), this will often produce better masks than a single prediction. If only a single mask is needed, the model’s predicted quality score can be used to select the best mask. For non-ambiguous prompts, such as multiple input prompts, multimask_output=False can give better results. Default: True

  • output_original_size (bool, optional) – If true, the logits of SegmentationResults will be post-process to match the original input image size. Default: True

Return type:

SegmentationResults

Returns:

A prediction with the logits and scores (IoU of each predicted mask)

preprocess_image(x, mean=None, std=None)[source]

Normalize and F.pad a torch.Tensor.

For F.normalize the tensor: will prioritize the mean and std passed as argument, if None will use the default Sam Dataset values.

For F.pad the tensor: Will F.pad the torch.Tensor into the right and bottom to match with the size of self.model.image_encoder.img_size

Parameters:
  • x (Tensor) – The image to be preprocessed

  • mean (Optional[Tensor], optional) – Mean for each channel. Default: None

  • std (Optional[Tensor], optional) – Standard deviations for each channel. Default: None

Return type:

Tensor

Returns:

The image preprocessed (normalized if has mean and str available and padded to encoder size)

preprocess_prompts(keypoints=None, keypoints_labels=None, boxes=None, masks=None)[source]

Validate and preprocess the given prompts to be aligned with the input image.

Return type:

Prompts

reset_image()[source]

Clear cached image state and prompt-transform metadata.

This method invalidates previously computed image embeddings and resets all size/transform bookkeeping so a new call to set_image() starts from a clean state.

In practice, this resets: - transformed-image parameters, - original/input/encoder spatial sizes, - cached image embeddings, - is_image_set status flag.

Return type:

None

set_image(image, mean=None, std=None)[source]

Set the embeddings from the given image with image_decoder of the model.

Prepare the given image with the selected transforms and the preprocess method.

Parameters:
  • image (Tensor) – RGB image. Normally images with range of [0-1], the model preprocess F.normalize the pixel values with the mean and std defined in its initialization. Expected to be into a float32 dtype. Shape \((3, H, W)\).

  • mean (Optional[Tensor], optional) – mean value of dataset for normalization. Default: None

  • std (Optional[Tensor], optional) – standard deviation of dataset for normalization. Default: None

Return type:

None

Edge Detection

class kornia.contrib.EdgeDetector(model, pre_processor, post_processor, name=None)[source]

EdgeDetector is a module that wraps an edge detection model.

This is a high-level API that wraps edge detection models like kornia.models.DexiNed.

Parameters:
  • model (Module) – The edge detection model.

  • pre_processor (Module) – Pre-processing module (e.g., ResizePreProcessor).

  • post_processor (Module) – Post-processing module (e.g., ResizePostProcessor).

  • name (Optional[str], optional) – Optional name for the detector. Default: None

Example

>>> from kornia.models.dexined import DexiNed
>>> from kornia.models.processors import ResizePreProcessor, ResizePostProcessor
>>> model = DexiNed(pretrained=True)
>>> detector = EdgeDetector(model, ResizePreProcessor(352, 352), ResizePostProcessor())
>>> img = torch.rand(1, 3, 320, 320)
>>> out = detector(img)

Face Detection

class kornia.contrib.FaceDetector(top_k=5000, confidence_threshold=0.3, nms_threshold=0.3, keep_top_k=750)[source]

Detect faces in a given image using YuNet model.

This is a high-level API that wraps the kornia.models.YuNet model for face detection. By default, it uses the method described in [FYP+21].

Parameters:
  • top_k (int, optional) – the maximum number of detections to return before the nms. Default: 5000

  • confidence_threshold (float, optional) – the threshold used to discard detections. Default: 0.3

  • nms_threshold (float, optional) – the threshold used by the nms for iou. Default: 0.3

  • keep_top_k (int, optional) – the maximum number of detections to return after the nms. Default: 750

Returns:

A list of B tensors with shape \((N,15)\) to be used with kornia.contrib.FaceDetectorResult.

Example

>>> img = torch.rand(1, 3, 320, 320)
>>> detect = FaceDetector()
>>> res = detect(img)
class kornia.contrib.FaceKeypoint(value)[source]

Define the keypoints detected in a face.

The left/right convention is based on the screen viewer.

EYE_LEFT = 0
EYE_RIGHT = 1
MOUTH_LEFT = 3
MOUTH_RIGHT = 4
NOSE = 2
class kornia.contrib.FaceDetectorResult(data)[source]

Encapsulate the results obtained by the kornia.contrib.FaceDetector.

Parameters:

data (Tensor) – the encoded results coming from the feature detector with shape \((14,)\).

property bottom_left: Tensor

The [x y] position of the top-left coordinate of the bounding box.

property bottom_right: Tensor

The [x y] position of the bottom-right coordinate of the bounding box.

get_keypoint(keypoint)[source]

Get the [x y] position of a given facial keypoint.

Parameters:

keypoint (FaceKeypoint) – the keypoint type to return the position.

Return type:

Tensor

property height: Tensor

The bounding box height.

property score: Tensor

The detection score.

to(device=None, dtype=None)[source]

Like torch.nn.Module.to() method.

Return type:

FaceDetectorResult

property top_left: Tensor

The [x y] position of the top-left coordinate of the bounding box.

property top_right: Tensor

The [x y] position of the top-left coordinate of the bounding box.

property width: Tensor

The bounding box width.

property xmax: Tensor

The bounding box bottom-right x-coordinate.

property xmin: Tensor

The bounding box top-left x-coordinate.

property ymax: Tensor

The bounding box bottom-right y-coordinate.

property ymin: Tensor

The bounding box top-left y-coordinate.

Interactive Demo

Visit the Kornia face detection demo on the Hugging Face Spaces.

Object Detection

class kornia.contrib.object_detection.BoundingBoxDataFormat(value)[source]

Enum class that maps bounding box data format.

XYWH = 0
XYXY = 1
CXCYWH = 2
CENTER_XYWH = 2
class kornia.contrib.object_detection.BoundingBox(data, data_format)[source]

Bounding box data class.

Useful for representing bounding boxes in different formats for object detection.

Parameters:
data: tuple[float, float, float, float]
data_format: BoundingBoxDataFormat
class kornia.contrib.object_detection.ObjectDetectorResult(class_id, confidence, bbox)[source]

Object detection result.

Parameters:
  • class_id (int) – class id of the detected object.

  • confidence (float) – confidence score of the detected object.

  • bbox (BoundingBox) – bounding box of the detected object in xywh format.

bbox: BoundingBox
class_id: int
confidence: float
class kornia.contrib.object_detection.ObjectDetector(model, pre_processor, post_processor)[source]

Wrap an object detection model and perform pre-processing and post-processing.

__init__(model, pre_processor, post_processor)[source]

Initialize ObjectDetector.

Parameters:
  • model (Module) – The object detection model.

  • pre_processor (Module) – Pre-processing module (e.g., ResizePreProcessor).

  • post_processor (Module) – Post-processing module (e.g., DETRPostProcessor).

compile(*, fullgraph=False, dynamic=False, backend='inductor', mode=None, options=None, disable=False)[source]

Compile the internal object detection model with torch.compile().

Return type:

None

forward(images)[source]

Detect objects in a given list of images.

Parameters:

images (Union[Tensor, list[Tensor]]) – If list of RGB images. Each image is a torch.Tensor with shape \((3, H, W)\). If torch.Tensor, a torch.Tensor with shape \((B, 3, H, W)\).

Return type:

Union[Tensor, list[Tensor]]

Returns:

list of detections found in each image. For item in a batch, shape is \((D, 6)\), where \(D\) is the number of detections in the given image, \(6\) represents class id, score, and xywh bounding box.

static from_config(config)[source]

Build ObjectDetector from config.

This is a placeholder to satisfy the abstract method requirement. Use kornia.contrib.object_detection.RTDETRDetectorBuilder.build() or instantiate ObjectDetector directly.

Parameters:

config (Any) – Configuration object (not used, kept for interface compatibility).

Return type:

ObjectDetector

Returns:

ObjectDetector instance.

name: str = 'detection'
save(images, detections=None, directory=None)[source]

Save the output image(s) to a directory.

Parameters:
  • images (Union[Tensor, list[Tensor]]) – input torch.Tensor.

  • detections (Optional[Tensor], optional) – detection torch.Tensor. Default: None

  • directory (Optional[str], optional) – directory to save the images. Default: None

Return type:

None

to_onnx(onnx_name=None, image_size=640, include_pre_and_post_processor=True, save=True, additional_metadata=None, **kwargs)[source]

Export an RT-DETR object detection model to ONNX format.

Either model_name or config must be provided. If neither is provided, a default pretrained model (rtdetr_r18vd) will be built.

Parameters:
  • onnx_name (Optional[str], optional) – The name of the output ONNX file. If not provided, a default name in the format “Kornia-<ClassName>.onnx” will be used. Default: None

  • image_size (Optional[int], optional) – The size to which input images will be resized during preprocessing. If None, image_size will be dynamic. For RTDETR, recommended scales include [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]. Default: 640

  • include_pre_and_post_processor (bool, optional) – Whether to include the pre-processor and post-processor in the exported model. Default: True

  • save (bool, optional) – If to save the model or load it. Default: True

  • additional_metadata (Optional[list[tuple[str, str]]], optional) – Additional metadata to add to the ONNX model. Default: None

  • kwargs (Any) – Additional arguments to convert to onnx.

Return type:

ModelProto

visualize(images, detections=None, output_type='torch')[source]

Very simple drawing.

Needs to be more fancy later.

Return type:

Union[Tensor, list[Tensor], list[Image]]

class kornia.contrib.object_detection.ResizePreProcessor(height, width, interpolation_mode='bilinear')[source]

Resize a list of image tensors to the given size.

Additionally, also returns the original image sizes for further post-processing.

Parameters:
  • height (int) – Height of the resized image.

  • width (int) – Width of the resized image.

  • interpolation_mode (str, optional) – Interpolation mode for image resizing. Supported values: nearest, bilinear, bicubic, area, nearest-exact. Default: “bilinear”.

Example

>>> import torch
>>> from kornia.models.processors import ResizePreProcessor
>>> processor = ResizePreProcessor(height=224, width=224)
>>> imgs = torch.randn(2, 3, 480, 640)
>>> resized, sizes = processor(imgs)
>>> print(resized.shape, sizes.shape)
torch.Size([2, 3, 224, 224]) torch.Size([2, 2])
forward(imgs)[source]

Resize input images to the target size.

Parameters:

imgs (Union[Tensor, List[Tensor]]) – Input images, either a tensor of shape \((B, C, H, W)\) or a list of tensors of shape \((C, H, W)\).

Returns:

  • resized_imgs: Resized images as a tensor of shape \((B, C, H_{\text{new}}, W_{\text{new}})\).

  • original_sizes: Original image sizes of shape \((B, 2)\) containing (height, width).

Return type:

Tuple containing

kornia.contrib.object_detection.results_from_detections(detections, format)[source]

Convert a detection torch.Tensor to a list of ObjectDetectorResult.

Parameters:
  • detections (Tensor) – torch.Tensor with shape \((D, 6)\), where \(D\) is the number of detections in the given image, \(6\) represents class id, score, and xywh bounding box.

  • format (str | BoundingBoxDataFormat) – detection format.

Return type:

list[ObjectDetectorResult]

Returns:

list of ObjectDetectorResult.

Real-Time Detection Transformer (RT-DETR)

class kornia.models.rt_detr.RTDETRModelType(value)[source]

Enum class that maps RT-DETR model type.

resnet18d = 0
resnet34d = 1
resnet50d = 2
resnet101d = 3
hgnetv2_l = 4
hgnetv2_x = 5
resnet50d_m = 6
class kornia.models.rt_detr.RTDETRConfig(model_type, num_classes, input_size=640, checkpoint=None, neck_hidden_dim=None, neck_dim_feedforward=None, neck_expansion=None, head_hidden_dim=256, head_num_queries=300, head_num_decoder_layers=None, confidence_threshold=0.3)[source]

Configuration to construct RT-DETR model.

Parameters:
checkpoint: str | None = None
confidence_threshold: float = 0.3
static from_name(model_name, num_classes=80)[source]

Load model without pretrained weights.

Parameters:
  • model_name (str) – ‘rtdetr_r18vd’, ‘rtdetr_r34vd’, ‘rtdetr_r50vd_m’, ‘rtdetr_r50vd’, ‘rtdetr_r101vd’.

  • num_classes (int, optional) – Number of classes to detect. Default: 80

Return type:

RTDETRConfig

head_hidden_dim: int = 256
head_num_decoder_layers: int | None = None
head_num_queries: int = 300
input_size: int = 640
model_type: RTDETRModelType | str | int
neck_dim_feedforward: int | None = None
neck_expansion: float | None = None
neck_hidden_dim: int | None = None
num_classes: int
class kornia.models.rt_detr.RTDETR(backbone, encoder, decoder)[source]

RT-DETR Object Detection model, as described in https://arxiv.org/abs/2304.08069.

__init__(backbone, encoder, decoder)[source]

Construct RT-DETR Object Detection model.

Parameters:
  • backbone (ResNetD | PPHGNetV2) – backbone network for feature extraction.

  • encoder (HybridEncoder) – neck network for feature fusion.

  • decoder (RTDETRHead) – head network to decode features into detection results.

forward(images)[source]

Detect objects in an image.

Parameters:

images (Tensor) – images to be detected. Shape \((N, C, H, W)\).

Return type:

tuple[Tensor, Tensor]

Returns:

  • logits - Tensor of shape \((N, Q, K)\), where \(Q\) is the number of queries, \(K\) is the number of classes.

  • boxes - Tensor of shape \((N, Q, 4)\), where \(Q\) is the number of queries.

static from_config(config)[source]

Construct RT-DETR Object Detection model from a config object.

Parameters:

config (RTDETRConfig) – configuration object for RT-DETR.

Return type:

RTDETR

Note

For config.neck_hidden_dim, config.neck_dim_feedforward, config.neck_expansion, and config.head_num_decoder_layers, if they are None, their values will be replaced with the default values depending on the config.model_type. See the source code for the default values.

load_checkpoint(checkpoint, device=None)[source]

Load checkpoint from a given url or file.

Parameters:
  • checkpoint (str) – The url or filepath for the respective checkpoint

  • device (Optional[device], optional) – The desired device to load the weights and move the model Default: None

Return type:

None

class kornia.models.rt_detr.DETRPostProcessor(confidence_threshold=None, num_classes=80, num_top_queries=300, confidence_filtering=True, filter_as_zero=False)[source]

Convert raw DETR model outputs into final bounding box detections.

This module applies the softmax function to scores and transforms normalized bounding box coordinates into the pixel coordinate system of the input image.

Parameters:
  • num_classes (int, optional) – The number of object classes. Default: 80

  • confidence_threshold (Optional[float], optional) – The threshold to filter out low-confidence detections. Default: None

  • num_top_queries (int, optional) – The number of top queries to consider for each image. Default: 300

  • confidence_filtering (bool, optional) – Whether to apply confidence-based filtering. Default: True

  • filter_as_zero (bool, optional) – If True, boxes below the confidence threshold are set to zero instead of being removed. Default: False

forward(logits, boxes, original_sizes)[source]

Post-process outputs from DETR.

Parameters:
  • logits (Tensor) – tensor with shape \((N, Q, K)\), where \(N\) is the batch size, \(Q\) is the number of queries, \(K\) is the number of classes.

  • boxes (Tensor) – tensor with shape \((N, Q, 4)\), where \(N\) is the batch size, \(Q\) is the number of queries.

  • original_sizes (Tensor) – tensor with shape \((N, 2)\), where \(N\) is the batch size and each element represents the image size of (img_height, img_width).

Return type:

Union[Tensor, list[Tensor]]

Returns:

Processed detections. For each image, the detections have shape (D, 6), where D is the number of detections in that image, 6 represent (class_id, confidence_score, x, y, w, h).

Image Segmentation

kornia.contrib.connected_components(image, num_iterations=100)[source]

Compute the Connected-component labelling (CCL) algorithm.

https://github.com/kornia/data/raw/main/cells_segmented.png

The implementation is an adaptation of the following repository:

https://gist.github.com/efirdc/5d8bd66859e574c683a504a4690ae8bc

Warning

This is an experimental API subject to changes and optimization improvements.

Note

See a working example here.

Parameters:
  • image (Tensor) – the binarized input image with shape \((*, 1, H, W)\). The image must be in floating point with range [0, 1].

  • num_iterations (int, optional) – the number of iterations to make the algorithm to converge. Default: 100

Return type:

Tensor

Returns:

The labels image with the same shape of the input image.

Example

>>> img = torch.rand(2, 1, 4, 5)
>>> img_labels = connected_components(img, num_iterations=100)

Segment Anything (SAM)

class kornia.models.sam.SamModelType(value)[source]

Map the SAM model types.

vit_h = 0
vit_l = 1
vit_b = 2
mobile_sam = 3
class kornia.models.sam.SamConfig(model_type=None, checkpoint=None, pretrained=False, encoder_embed_dim=None, encoder_depth=None, encoder_num_heads=None, encoder_global_attn_indexes=None)[source]

Encapsulate the Config to build a SAM model.

Parameters:
  • model_type (Union[str, int, SamModelType, None], optional) –

    the available models are: Default: None

    • 0, ‘vit_h’ or kornia.contrib.sam.SamModelType.vit_h()

    • 1, ‘vit_l’ or kornia.contrib.sam.SamModelType.vit_l()

    • 2, ‘vit_b’ or kornia.contrib.sam.SamModelType.vit_b()

    • 3, ‘mobile_sam’, or kornia.contrib.sam.SamModelType.mobile_sam()

  • checkpoint (Optional[str], optional) – URL or a path for a file with the weights of the model Default: None

  • encoder_embed_dim (Optional[int], optional) – Patch embedding dimension. Default: None

  • encoder_depth (Optional[int], optional) – Depth of ViT. Default: None

  • encoder_num_heads (Optional[int], optional) – Number of attention heads in each ViT block. Default: None

  • encoder_global_attn_indexes (Optional[tuple[int, ...]], optional) – Encoder indexes for blocks using global attention. Default: None

checkpoint: str | None = None
encoder_depth: int | None = None
encoder_embed_dim: int | None = None
encoder_global_attn_indexes: tuple[int, ...] | None = None
encoder_num_heads: int | None = None
model_type: str | int | SamModelType | None = None
pretrained: bool = False
class kornia.models.sam.Sam(image_encoder, prompt_encoder, mask_decoder)[source]

Implement the Segment Anything Model (SAM) wrapper.

This class coordinates the image encoder, prompt encoder, and mask decoder.

__init__(image_encoder, prompt_encoder, mask_decoder)[source]

SAM predicts object masks from an image and input prompts.

Parameters:
  • image_encoder (ImageEncoderViT | TinyViT) – The backbone used to encode the image into image embeddings that allow for efficient mask prediction.

  • prompt_encoder (PromptEncoder) – Encodes various types of input prompts.

  • mask_decoder (MaskDecoder) – Predicts masks from the image embeddings and encoded prompts.

forward(images, batched_prompts, multimask_output)[source]

Predicts masks end-to-end from provided images and prompts.

This method expects that the images have already been pre-processed, at least been normalized, resized and padded to be compatible with the self.image_encoder.

Note

For each image \((3, H, W)\), it is possible to input a batch (\(K\)) of \(N\) prompts, the results are batched by the number of prompts batch. So given a prompt with \(K=5\), and \(N=10\), the results will look like \(5xCxHxW\) where \(C\) is determined by multimask_output. And within each of these masks \((5xC)\), it should be possible to find \(N\) instances if the model succeed.

Parameters:
  • images (Tensor) – The image as a torch tensor in \((B, 3, H, W)\) format, already transformed for input to the model.

  • batched_prompts (list[dict[str, Any]]) –

    A list over the batch of images (list length should be \(B\)), each a dictionary with

    the following keys. If it does not have the respective prompt, it should not be included in this dictionary. The options are:

    • ”points”: tuple of (torch.Tensor, torch.Tensor) within the coordinate keypoints and their respective labels. The tuple should look like (keypoints, labels), where the keypoints (a tensor) are a batched point prompts for this image, with shape \((K, N, 2)\). Already transformed to the input frame of the model. The labels (a tensor) are a batched labels for point prompts, with shape \((K, N)\). Where 1 indicates a foreground point and 0 indicates a background point.

    • ”boxes”: (torch.Tensor) Batched box inputs, with shape \((K, 4)\). Already transformed to the input frame of the model.

    • ”mask_inputs”: (torch.Tensor) Batched mask inputs to the model, in the form \((K, 1, H, W)\).

  • multimask_output (bool) – Whether the model should predict multiple disambiguating masks, or return a single mask.

Returns:

  • logits: Low resolution logits with shape \((K, C, H, W)\). Can be passed as mask input to subsequent iterations of prediction. Where \(K\) is the number of input prompts, \(C\) is determined by multimask_output, and \(H=W=256\) are the model output size.

  • scores: The model’s predictions of mask quality (iou prediction), in shape BxC.

Return type:

A list over input images, where each element is as SegmentationResults the following

static from_config(config)[source]

Build/load the SAM model based on it’s config.

Parameters:

config (SamConfig) – The SamConfig data structure. If the model_type is available, build from it, otherwise will use the parameters set.

Return type:

Sam

Returns:

The respective SAM model

Example

>>> from kornia.models.sam import SamConfig
>>> sam_model = Sam.from_config(SamConfig('vit_b'))
load_checkpoint(checkpoint, device=None)[source]

Load checkpoint from a given url or file.

Parameters:
  • checkpoint (str) – The url or filepath for the respective checkpoint

  • device (Optional[device], optional) – The desired device to load the weights and move the model Default: None

Return type:

None

Image Patches

kornia.contrib.compute_padding(original_size, window_size, stride=None)[source]

Compute required padding to ensure chaining of extract_tensor_patches() and combine_tensor_patches() produces expected result.

Parameters:
  • original_size (Union[int, Tuple[int, int]]) – the size of the original torch.Tensor.

  • window_size (Union[int, Tuple[int, int]]) – the size of the sliding window used while extracting patches.

  • stride (Union[int, Tuple[int, int], None], optional) – The stride of the sliding window. Optional: if not specified, window_size will be used. Default: None

Returns:

(top, bottom, left, right)

Return type:

The required padding as a tuple of four ints

Example

>>> image = torch.arange(12).view(1, 1, 4, 3)
>>> padding = compute_padding((4,3), (3,3))
>>> out = extract_tensor_patches(image, window_size=(3, 3), stride=(3, 3), padding=padding)
>>> combine_tensor_patches(out, original_size=(4, 3), window_size=(3, 3), stride=(3, 3), unpadding=padding)
tensor([[[[ 0,  1,  2],
          [ 3,  4,  5],
          [ 6,  7,  8],
          [ 9, 10, 11]]]])

Note

This function will be implicitly used in extract_tensor_patches() and combine_tensor_patches() if allow_auto_(un)padding is set to True.

kornia.contrib.extract_tensor_patches(input, window_size, stride=1, padding=0, allow_auto_padding=False)[source]

Extract patches from tensors and stacks them.

See ExtractTensorPatches for details.

Parameters:
  • input (Tensor) – torch.Tensor image where to extract the patches with shape \((B, C, H, W)\).

  • window_size (Union[int, Tuple[int, int]]) – the size of the sliding window and the output patch size.

  • stride (Union[int, Tuple[int, int]], optional) – stride of the sliding window. Default: 1

  • padding (Union[int, Tuple[int, int], Tuple[int, int, int, int]], optional) – Zero-padding added to both side of the input. Default: 0

  • allow_auto_padding (bool, optional) – whether to allow automatic padding if the window and stride do not fit into the image. Default: False

Return type:

Tensor

Returns:

the torch.Tensor with the extracted patches with shape \((B, N, C, H_{out}, W_{out})\).

Examples

>>> input = torch.arange(9.).view(1, 1, 3, 3)
>>> patches = extract_tensor_patches(input, (2, 3))
>>> input
tensor([[[[0., 1., 2.],
          [3., 4., 5.],
          [6., 7., 8.]]]])
>>> patches[:, -1]
tensor([[[[3., 4., 5.],
          [6., 7., 8.]]]])
kornia.contrib.combine_tensor_patches(patches, original_size, window_size, stride, allow_auto_unpadding=False, unpadding=0, eps=1e-8)[source]

Restore input from patches.

See CombineTensorPatches for details.

Parameters:
  • patches (Tensor) – patched torch.Tensor with shape \((B, N, C, H_{out}, W_{out})\).

  • original_size (Union[int, Tuple[int, int]]) – the size of the original torch.Tensor and the output size.

  • window_size (Union[int, Tuple[int, int]]) – the size of the sliding window used while extracting patches.

  • stride (Union[int, Tuple[int, int]]) – stride of the sliding window.

  • unpadding (Union[int, Tuple[int, int], Tuple[int, int, int, int]], optional) – remove the padding added to both side of the input. Default: 0

  • allow_auto_unpadding (bool, optional) – whether to allow automatic unpadding of the input if the window and stride do not fit into the original_size. Default: False

  • eps (float, optional) – small value used to prevent division by zero. Default: 1e-8

Return type:

Tensor

Returns:

The combined patches in an image torch.Tensor with shape \((B, C, H, W)\).

Example

>>> out = extract_tensor_patches(torch.arange(16).view(1, 1, 4, 4), window_size=(2, 2), stride=(2, 2))
>>> combine_tensor_patches(out, original_size=(4, 4), window_size=(2, 2), stride=(2, 2))
tensor([[[[ 0,  1,  2,  3],
          [ 4,  5,  6,  7],
          [ 8,  9, 10, 11],
          [12, 13, 14, 15]]]])

Note

This function is supposed to be used in conjunction with extract_tensor_patches().

class kornia.contrib.ExtractTensorPatches(window_size, stride=1, padding=0, allow_auto_padding=False)[source]

nn.Module that extract patches from tensors and torch.stack them.

In the simplest case, the output value of the operator with input size \((B, C, H, W)\) is \((B, N, C, H_{out}, W_{out})\).

where
  • \(B\) is the batch size.

  • \(N\) denotes the total number of extracted patches stacked in

  • \(C\) denotes the number of input channels.

  • \(H\), \(W\) the input height and width of the input in pixels.

  • \(H_{out}\), \(W_{out}\) denote to denote to the patch size defined in the function signature. left-right and top-bottom order.

  • window_size is the size of the sliding window and controls the shape of the output torch.Tensor and defines the shape of the output patch.

  • stride controls the stride to apply to the sliding window and regulates the overlapping between the extracted patches.

  • padding controls the amount of implicit torch.zeros-paddings on both sizes at each dimension.

  • allow_auto_padding allows automatic calculation of the padding required to fit the window and stride into the image.

The parameters window_size, stride and padding can be either:

  • a single int – in which case the same value is used for the height and width dimension.

  • a tuple of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension.

padding can also be a tuple of four ints – in which case, the first two ints are for the height dimension while the last two ints are for the width dimension.

Parameters:
  • input – torch.Tensor image where to extract the patches with shape \((B, C, H, W)\).

  • window_size (Union[int, Tuple[int, int]]) – the size of the sliding window and the output patch size.

  • stride (Union[int, Tuple[int, int]], optional) – stride of the sliding window. Default: 1

  • padding (Union[int, Tuple[int, int], Tuple[int, int, int, int]], optional) – Zero-padding added to both side of the input. Default: 0

  • allow_auto_adding – whether to allow automatic padding if the window and stride do not fit into the image.

Shape:
  • Input: \((B, C, H, W)\)

  • Output: \((B, N, C, H_{out}, W_{out})\)

Returns:

the torch.Tensor with the extracted patches.

Examples

>>> input = torch.arange(9.).view(1, 1, 3, 3)
>>> patches = extract_tensor_patches(input, (2, 3))
>>> input
tensor([[[[0., 1., 2.],
          [3., 4., 5.],
          [6., 7., 8.]]]])
>>> patches[:, -1]
tensor([[[[3., 4., 5.],
          [6., 7., 8.]]]])
class kornia.contrib.CombineTensorPatches(original_size, window_size, stride=None, unpadding=0, allow_auto_unpadding=False)[source]

nn.Module that combines patches back into full tensors.

In the simplest case, the output value of the operator with input size \((B, N, C, H_{out}, W_{out})\) is \((B, C, H, W)\).

where
  • \(B\) is the batch size.

  • \(N\) denotes the total number of extracted patches stacked in

  • \(C\) denotes the number of input channels.

  • \(H\), \(W\) the input height and width of the input in pixels.

  • \(H_{out}\), \(W_{out}\) denote to denote to the patch size defined in the function signature. left-right and top-bottom order.

  • original_size is the size of the original image prior to extracting torch.Tensor patches and defines the shape of the output patch.

  • window_size is the size of the sliding window used while extracting torch.Tensor patches.

  • stride controls the stride to apply to the sliding window and regulates the overlapping between the extracted patches.

  • unpadding is the amount of padding to be removed. If specified, this value must be the same as padding used while extracting torch.Tensor patches.

  • allow_auto_unpadding allows automatic calculation of the padding required to fit the window and stride into the image. This must be used if the allow_auto_padding flag was used for extracting the patches.

The parameters original_size, window_size, stride, and unpadding can be either:

  • a single int – in which case the same value is used for the height and width dimension.

  • a tuple of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension.

unpadding can also be a tuple of four ints – in which case, the first two ints are for the height dimension while the last two ints are for the width dimension.

Parameters:
  • patches – patched torch.Tensor with shape \((B, N, C, H_{out}, W_{out})\).

  • original_size (Tuple[int, int]) – the size of the original torch.Tensor and the output size.

  • window_size (Union[int, Tuple[int, int]]) – the size of the sliding window used while extracting patches.

  • stride (Union[int, Tuple[int, int], None], optional) – stride of the sliding window. Default: None

  • unpadding (Union[int, Tuple[int, int], Tuple[int, int, int, int]], optional) – remove the padding added to both side of the input. Default: 0

  • allow_auto_unpadding (bool, optional) – whether to allow automatic unpadding of the input if the window and stride do not fit into the original_size. Default: False

  • eps – small value used to prevent division by zero.

Shape:
  • Input: \((B, N, C, H_{out}, W_{out})\)

  • Output: \((B, C, H, W)\)

Example

>>> out = extract_tensor_patches(torch.arange(16).view(1, 1, 4, 4), window_size=(2, 2), stride=(2, 2))
>>> combine_tensor_patches(out, original_size=(4, 4), window_size=(2, 2), stride=(2, 2))
tensor([[[[ 0,  1,  2,  3],
          [ 4,  5,  6,  7],
          [ 8,  9, 10, 11],
          [12, 13, 14, 15]]]])

Note

This function is supposed to be used in conjunction with ExtractTensorPatches.

Image Classification

class kornia.models.vit.VisionTransformer(image_size=224, patch_size=16, in_channels=3, embed_dim=768, depth=12, num_heads=12, dropout_rate=0.0, dropout_attn=0.0, backbone=None)[source]

Vision transformer (ViT) module.

The module is expected to be used as operator for different vision tasks.

The method is inspired from existing implementations of the paper [DBK+21].

Warning

This is an experimental API subject to changes in favor of flexibility.

Parameters:
  • image_size (int, optional) – the size of the input image. Default: 224

  • patch_size (int, optional) – the size of the patch to compute the embedding. Default: 16

  • in_channels (int, optional) – the number of channels for the input. Default: 3

  • embed_dim (int, optional) – the embedding dimension inside the transformer encoder. Default: 768

  • depth (int, optional) – the depth of the transformer. Default: 12

  • num_heads (int, optional) – the number of attention heads. Default: 12

  • dropout_rate (float, optional) – dropout rate. Default: 0.0

  • dropout_attn (float, optional) – attention dropout rate. Default: 0.0

  • backbone (Module | None, optional) – an nn.Module to compute the image patches embeddings. Default: None

Example

>>> img = torch.rand(1, 3, 224, 224)
>>> vit = VisionTransformer(image_size=224, patch_size=16)
>>> vit(img).shape
torch.Size([1, 197, 768])
property encoder_results: list[Tensor]

Return intermediate outputs captured by the transformer encoder.

Returns:

List of tensors produced by the encoder blocks. Each tensor stores token embeddings for a layer, typically shaped \((B, N, D)\), where \(B\) is batch size, \(N\) is token count, and \(D\) is embedding dimension.

forward(x)[source]

Encode an image batch into Vision Transformer token embeddings.

Parameters:

x (Tensor) – Image tensor with shape \((B, C, H, W)\), where \(B\) is batch size, \(C\) must match self.in_channels, and \(H\) and \(W\) are expected to match self.image_size.

Return type:

Tensor

Returns:

Normalized token embedding tensor produced by patch embedding and the transformer encoder. The output shape follows the encoder layout, usually \((B, N, D)\).

static from_config(variant, pretrained=False, **kwargs)[source]

Build ViT model based on the given config string.

The format is vit_{size}/{patch_size}. E.g. vit_b/16 means ViT-Base, patch size 16x16. If pretrained=True, AugReg weights are loaded. The weights are hosted on HuggingFace’s model hub: https://huggingface.co/kornia.

Note

The available weights are: vit_l/16, vit_b/16, vit_s/16, vit_ti/16, vit_b/32, vit_s/32.

Parameters:
  • variant (str) – ViT model variant e.g. vit_b/16.

  • pretrained (bool, optional) – whether to load pre-trained AugReg weights. Default: False

  • kwargs (Any) – other keyword arguments that will be passed to kornia.models.vit.VisionTransformer().

Return type:

VisionTransformer

Returns:

The respective ViT model

Example

>>> from kornia.models.vit import VisionTransformer
>>> vit_model = VisionTransformer.from_config("vit_b/16", pretrained=True)
class kornia.models.vit_mobile.MobileViT(mode='xxs', in_channels=3, patch_size=(2, 2), dropout=0.0)[source]

Module MobileViT. Default arguments is for MobileViT XXS.

Paper: https://arxiv.org/abs/2110.02178 Based on: https://github.com/chinhsuanwu/mobilevit-pytorch

Parameters:
  • mode (str, optional) – ‘xxs’, ‘xs’ or ‘s’, defaults to ‘xxs’. Default: "xxs"

  • in_channels (int, optional) – the number of channels for the input image. Default: 3

  • patch_size (Tuple[int, int], optional) – image_size must be divisible by patch_size. Default: (2, 2)

  • dropout (float, optional) – dropout ratio in Transformer. Default: 0.0

Example

>>> img = torch.rand(1, 3, 256, 256)
>>> mvit = MobileViT(mode='xxs')
>>> mvit(img).shape
torch.Size([1, 320, 8, 8])
class kornia.contrib.TinyViT(img_size=224, in_chans=3, num_classes=1000, embed_dims=(96, 192, 384, 768), depths=(2, 2, 6, 2), num_heads=(3, 6, 12, 24), window_sizes=(7, 7, 14, 7), mlp_ratio=4.0, drop_rate=0.0, drop_path_rate=0.0, use_checkpoint=False, mbconv_expand_ratio=4.0, local_conv_size=3, activation=nn.GELU, mobile_sam=False)[source]

TinyViT model, as described in https://arxiv.org/abs/2207.10666.

Parameters:
  • img_size (int, optional) – Size of input image. Default: 224

  • in_chans (int, optional) – Number of input image’s channels. Default: 3

  • num_classes (int, optional) – Number of output classes. Default: 1000

  • embed_dims (Sequence[int], optional) – List of embedding dimensions. Default: (96, 192, 384, 768)

  • depths (Sequence[int], optional) – List of block count for each downsampling stage Default: (2, 2, 6, 2)

  • num_heads (Sequence[int], optional) – List of attention heads used in self-attention for each downsampling stage. Default: (3, 6, 12, 24)

  • window_sizes (Sequence[int], optional) – List of self-attention’s window size for each downsampling stage. Default: (7, 7, 14, 7)

  • mlp_ratio (float, optional) – Ratio of MLP dimension to embedding dimension in self-attention. Default: 4.0

  • drop_rate (float, optional) – Dropout rate. Default: 0.0

  • drop_path_rate (float, optional) – Stochastic depth rate. Default: 0.0

  • use_checkpoint (bool, optional) – Whether to use activation checkpointing to trade compute for memory. Default: False

  • mbconv_expand_ratio (float, optional) – Expansion ratio used in MBConv block. Default: 4.0

  • local_conv_size (int, optional) – Kernel size of convolution used in TinyViTBlock Default: 3

  • activation (type[Module], optional) – activation function. Default: nn.GELU

  • mobile_same – Whether to use modifications for MobileSAM.

forward(x)[source]

Classify images if mobile_sam=False, produce feature maps if mobile_sam=True.

Return type:

Tensor

static from_config(variant, pretrained=False, **kwargs)[source]

Create a TinyViT model from pre-defined variants.

Parameters:
  • variant (str) – TinyViT variant. Possible values: '5m', '11m', '21m'.

  • pretrained (bool | str, optional) – whether to use pre-trained weights. Possible values: False, True, 'in22k', 'in1k'. For TinyViT-21M (variant='21m'), 'in1k_384', 'in1k_512' are also available. Default: False

  • **kwargs (Any) – other keyword arguments that will be passed to TinyViT.

Return type:

TinyViT

Note

When img_size is different from the pre-trained size, bicubic interpolation will be performed on attention biases. When using pretrained=True, ImageNet-1k checkpoint ('in1k') is used. For feature extraction or fine-tuning, ImageNet-22k checkpoint ('in22k') is preferred.

Image Stitching

class kornia.contrib.ImageStitcher(matcher, estimator='ransac', blending_method='naive')[source]

Stitch two images with overlapping fields of view.

Parameters:
  • matcher (Module) – image feature matching module.

  • estimator (str, optional) – method to compute homography, either “vanilla” or “ransac”. “ransac” is slower with a better accuracy. Default: "ransac"

  • blending_method (str, optional) – method to blend two images together. Only “naive” is currently supported. Default: "naive"

Note

Current implementation requires strict image ordering from left to right.

IS = ImageStitcher(KF.LoFTR(pretrained='outdoor'), estimator='ransac').cuda()
# Compute the stitched result with less GPU memory cost.
with torch.inference_mode():
    out = IS(img_left, img_right)
# Show the result
plt.imshow(K.tensor_to_image(out))

Lambda

class kornia.contrib.Lambda(func)[source]

Applies user-defined lambda as a transform.

Parameters:

func (Callable[..., Tensor]) – Callable function.

Returns:

The output of the user-defined lambda.

Example

>>> import kornia
>>> x = torch.rand(1, 3, 5, 5)
>>> f = Lambda(lambda x: kornia.color.rgb_to_grayscale(x))
>>> f(x).shape
torch.Size([1, 1, 5, 5])

Distance Transform

kornia.contrib.distance_transform(image, kernel_size=3, h=0.35)[source]

Approximates the Euclidean distance transform of images/volumes using cascaded convolution operations.

The value at each pixel/voxel represents the distance to the nearest non-zero element. It uses the method described in [PDP20]. The transformation is applied independently across the channel dimension.

Parameters:
  • image (Tensor) – Image or volume with shape \((B,C,H,W)\) or \((B,C,D,H,W)\).

  • kernel_size (int, optional) – size of the convolution kernel. Must be an odd number. Default: 3

  • h (float, optional) – value that influence the approximation of the min function. Default: 0.35

Return type:

Tensor

Returns:

tensor with the same shape as input.

Example

>>> # 2D example:
>>> tensor = torch.zeros(1, 1, 5, 5)
>>> tensor[:,:, 1, 2] = 1
>>> dt = distance_transform(tensor)
>>> # 3D example:
>>> volume = torch.zeros(1, 1, 5, 5, 5)
>>> volume[:, :, 2, 2, 2] = 1
>>> dt = distance_transform(volume)
kornia.contrib.diamond_square(output_size, roughness=0.5, random_scale=1.0, random_fn=torch.rand, normalize_range=None, device=None, dtype=None)[source]

Generate Plasma Fractal Images using the diamond square algorithm.

See: https://en.wikipedia.org/wiki/Diamond-square_algorithm

Parameters:
  • output_size (Tuple[int, int, int, int]) – a tuple of integers with the BxCxHxW of the image to be generated.

  • roughness (Union[float, Tensor], optional) – the scale value to apply at each recursion step. Default: 0.5

  • random_scale (Union[float, Tensor], optional) – the initial value of the scale for recursion. Default: 1.0

  • random_fn (Callable[..., Tensor], optional) – the callable function to use to sample a random torch.Tensor. Default: torch.rand

  • normalize_range (Optional[Tuple[float, float]], optional) – whether to F.normalize using min-max the output map. In case of a range is specified, min-max norm is applied between the provided range. Default: None

  • device (Optional[device], optional) – the torch device to place the output map. Default: None

  • dtype (Optional[dtype], optional) – the torch dtype to place the output map. Default: None

Return type:

Tensor

Returns:

A torch.Tensor with shape \((B,C,H,W)\) containing the fractal image.

class kornia.contrib.DistanceTransform(kernel_size=3, h=0.35)[source]

Module that approximates the Euclidean distance transform of images/volumes using convolutions.

Parameters:
  • kernel_size (int, optional) – size of the convolution kernel. Default: 3

  • h (float, optional) – value that influence the approximation of the min function. Default: 0.35

KMeans

class kornia.contrib.KMeans(num_clusters, cluster_centers, tolerance=10e-4, max_iterations=0, seed=None)[source]

Implements the kmeans clustering algorithm with euclidean distance as similarity measure.

Parameters:
  • num_clusters (int) – number of clusters the data has to be assigned to

  • cluster_centers (Tensor | None) – torch.Tensor of starting cluster centres can be passed instead of num_clusters

  • tolerance (float, optional) – float value. the algorithm terminates if the shift in centers is less than tolerance Default: 10e-4

  • max_iterations (int, optional) – number of iterations to run the algorithm for Default: 0

  • seed (int | None, optional) – number to set torch manual seed for reproducibility Default: None

Example

>>> kmeans = kornia.contrib.KMeans(3, None, 10e-4, 100, 0)
>>> kmeans.fit(torch.rand((1000, 5)))
>>> predictions = kmeans.predict(torch.rand((10, 5)))
property cluster_assignments: Tensor

Return cluster labels assigned during the most recent fit call.

Returns:

A 1D tensor with shape \((N,)\), where N is the number of samples given to fit(). Each value is the cluster index assigned to the corresponding sample.

Raises:

TypeError – If fit has not been run yet.

property cluster_centers: Tensor

Return the current cluster centers.

Returns:

  • C is the number of clusters.

  • D is the feature dimension of each sample.

If fit() has already been called, this returns the learned final centers. Otherwise, it returns the initialization provided during construction.

Return type:

A tensor with shape \((C, D)\)

Raises:

TypeError – If no initial centers were provided and fit has not been run.

fit(X)[source]

Fit iterative KMeans clustering till a threshold for shift in cluster centers or a maximum no of iterations have reached.

Parameters:

X (Tensor) – 2D input torch.Tensor to be clustered

Return type:

None

predict(x)[source]

Find the cluster center closest to each point in x.

Parameters:

x (Tensor) – 2D torch.Tensor

Return type:

Tensor

Returns:

1D torch.Tensor containing cluster id assigned to each data point in x