kornia.contrib¶

Models¶

Base¶

class kornia.contrib.models.base.ModelBase(*args, **kwargs)¶

Abstract model class with some utilities function.

compile(*, fullgraph=False, dynamic=False, backend='inductor', mode=None, options={}, disable=False)¶

Compile this Module’s forward using torch.compile().

This Module’s __call__ method is compiled and all arguments are passed as-is to torch.compile().

See torch.compile() for details on the arguments for this function.

Return type:: ModelBase[TypeVar(ModelConfig)]

abstract static from_config(config)¶

This function should build/load the model.

Parameters:: config (TypeVar(ModelConfig)) – The specifications for the model be build/loaded
Return type:: ModelBase[TypeVar(ModelConfig)]

load_checkpoint(checkpoint, device=None)¶

Load checkpoint from a given url or file.

Parameters:

checkpoint (str) – The url or filepath for the respective checkpoint
device (Optional[device], optional) – The desired device to load the weights and move the model Default: None

Return type:

None

EfficientViT¶

class kornia.contrib.models.efficient_vit.EfficientViT(backbone)¶

EfficientViT backbone model.

__init__(backbone)¶: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(images)¶

Extract features from the input images.

Parameters:: images (Tensor) – input images tensor of shape \((B, C, H, W)\).
Returns:: a dictionary containing the features.
Return type:: Dict[str, Tensor]

static from_config(config)¶

Build the EfficientViT model from a configuration object.

Parameters:: config (EfficientViTConfig) – EfficientViT configuration object. See EfficientViTConfig.
Returns:: the EfficientViT model.
Return type:: EfficientViT

load_checkpoint(checkpoint, device=None)¶

Load checkpoint from a given url or file.

Parameters:

checkpoint (str) – The url or filepath for the respective checkpoint
device (Optional[device], optional) – The desired device to load the weights and move the model Default: None

Return type:

None

class kornia.contrib.models.efficient_vit.EfficientViTConfig(checkpoint=<factory>)¶

Configuration to construct EfficientViT model.

Model weights can be loaded from a checkpoint URL or local path. The model weights are hosted on HuggingFace’s model hub: https://huggingface.co/kornia.

Parameters:: checkpoint (str, optional) – URL or local path of model weights. Default: <factory>

checkpoint: str¶

classmethod from_pretrained(model_type, resolution)¶

Return a configuration object from a pre-trained model.

Parameters:

model_type (Literal['b1', 'b2', 'b3']) – model type, one of "b1", "b2", "b3".
resolution (Literal[224, 256, 288]) – input resolution, one of 224, 256, 288.

Return type:

EfficientViTConfig

Backbones¶

class kornia.contrib.models.efficient_vit.backbone.EfficientViTBackbone(width_list, depth_list, in_channels=3, dim=32, expand_ratio=4, norm='bn2d', act_func='hswish')¶

static build_local_block(in_channels, out_channels, stride, expand_ratio, norm, act_func, fewer_norm=False)¶

Return type:: Module

forward(x)¶

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: dict[str, Tensor]

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_b0(**kwargs)¶

Return type:: EfficientViTBackbone

kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_b1(**kwargs)¶

Return type:: EfficientViTBackbone

kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_b2(**kwargs)¶

Return type:: EfficientViTBackbone

kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_b3(**kwargs)¶

Return type:: EfficientViTBackbone

class kornia.contrib.models.efficient_vit.backbone.EfficientViTLargeBackbone(width_list, depth_list, in_channels=3, qkv_dim=32, norm='bn2d', act_func='gelu')¶

static build_local_block(stage_id, in_channels, out_channels, stride, expand_ratio, norm, act_func, fewer_norm=False)¶

Return type:: Module

forward(x)¶

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: dict[str, Tensor]

Note

kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_l0(**kwargs)¶

Return type:: EfficientViTLargeBackbone

kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_l1(**kwargs)¶

Return type:: EfficientViTLargeBackbone

kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_l2(**kwargs)¶

Return type:: EfficientViTLargeBackbone

kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_l3(**kwargs)¶

Return type:: EfficientViTLargeBackbone

Structures¶

class kornia.contrib.models.SegmentationResults(logits, scores, mask_threshold=0.0)¶

Encapsulate the results obtained by a Segmentation model.

Parameters:

logits (Tensor) – Results logits with shape \((B, C, H, W)\), where \(C\) refers to the number of predicted masks
scores (Tensor) – The scores from the logits. Shape \((B, C)\)
mask_threshold (float, optional) – The threshold value to generate the binary_masks from the logits Default: 0.0

property binary_masks: Tensor¶

Binary mask generated from logits considering the mask_threshold.

Shape will be the same of logits \((B, C, H, W)\) where \(C\) is the number masks predicted.

Note

If you run original_res_logits, this will generate the masks based on the original resolution logits. Otherwise, this will use the low resolution logits (self.logits).

logits: Tensor¶

mask_threshold: float = 0.0¶

original_res_logits(input_size, original_size, image_size_encoder)¶

Remove padding and upscale the logits to the original image size.

Resize to image encoder input -> remove padding (bottom and right) -> Resize to original size

Note

This method set a internal original_res_logits which will be used if available for the binary masks.

Parameters:

input_size (tuple[int, int]) – The size of the image input to the model, in (H, W) format. Used to remove padding.
original_size (tuple[int, int]) – The original size of the image before resizing for input to the model, in (H, W) format.
image_size_encoder (Optional[tuple[int, int]]) – The size of the input image for image encoder, in (H, W) format. Used to resize the logits back to encoder resolution before remove the padding.

Return type:

Tensor

Returns:

Batched logits in \((K, C, H, W)\) format, where (H, W) is given by original_size.

scores: Tensor¶

squeeze(dim=0)¶

Realize a squeeze for the dim given for all properties.

Return type:: SegmentationResults

class kornia.contrib.models.Prompts(points=None, boxes=None, masks=None)¶

Encapsulate the prompts inputs for a Model.

Parameters:

points (Optional[tuple[Tensor, Tensor]], optional) – A tuple with the keypoints (coordinates x, y) and their respective labels. Shape \((K, N, 2)\) for the keypoints, and \((K, N)\) Default: None
boxes (Optional[Tensor], optional) – Batched box inputs, with shape \((K, 4)\). Expected to be into xyxy format. Default: None
masks (Optional[Tensor], optional) – Batched mask prompts to the model with shape \((K, 1, H, W)\) Default: None

boxes: Optional[Tensor] = None¶

property keypoints: Tensor | None¶: The keypoints from the points

property keypoints_labels: Tensor | None¶: The keypoints labels from the points

masks: Optional[Tensor] = None¶

points: Optional[tuple[Tensor, Tensor]] = None¶

VisualPrompter¶

class kornia.contrib.visual_prompter.VisualPrompter(config=SamConfig(model_type='vit_h', pretrained=True), device=None, dtype=None)¶

This class allow the user to run multiple query with multiple prompts for a model.

At the moment, we just support the SAM model. The model is loaded based on the given config.

For default the images are transformed to have their long side with size of the image_encoder.img_size. This Prompter class ensure to transform the images and the prompts before prediction. Also, the image is passed automatically for the method preprocess_image, which is responsible for normalize the image and pad it to have the right size for the SAM model \(( ext{image_encoder.img_size}, ext{image_encoder.img_size})\). For default the image is normalized by the mean and standard deviation of the SAM dataset values.

Parameters:

config (SamConfig, optional) – A model config to generate the model. Now just the SAM model is supported. Default: SamConfig(model_type="vit_h", pretrained=True)
device (Optional[device], optional) – The desired device to use the model. Default: None
dtype (Optional[dtype], optional) – The desired dtype to use the model. Default: None

Example

>>> # prompter = VisualPrompter() # Will load the vit h for default
>>> # You can load a custom SAM type for modifying the config
>>> prompter = VisualPrompter(SamConfig('vit_b'))
>>> image = torch.rand(3, 25, 30)
>>> prompter.set_image(image)
>>> boxes = Boxes(
...    torch.tensor(
...         [[[[0, 0], [0, 10], [10, 0], [10, 10]]]],
...         device=prompter.device,
...         dtype=torch.float32
...    ),
...    mode='xyxy'
... )
>>> prediction = prompter.predict(boxes=boxes)
>>> prediction.logits.shape
torch.Size([1, 3, 256, 256])

compile(*, fullgraph=False, dynamic=False, backend='inductor', mode=None, options={}, disable=False)¶

Applies torch.compile(…)/dynamo API into the VisualPrompter API.

Note

For more information about the dynamo API check the official docs https://pytorch.org/docs/stable/generated/torch.compile.html

Parameters:

fullgraph (bool, optional) – Whether it is ok to break model into several subgraphs Default: False
dynamic (bool, optional) – Use dynamic shape tracing Default: False
backend (str, optional) – backend to be used Default: "inductor"
mode (Optional[str], optional) – Can be either “default”, “reduce-overhead” or “max-autotune” Default: None
options (dict[Any, Any], optional) – A dictionary of options to pass to the backend. Default: {}
disable (bool, optional) – Turn torch.compile() into a no-op for testing Default: False

Return type:

None

Example

>>> # prompter = VisualPrompter()
>>> # prompter.compile() # You should have torch >= 2.0.0 installed
>>> # Use the prompter methods ...

predict(keypoints=None, keypoints_labels=None, boxes=None, masks=None, multimask_output=True, output_original_size=True)¶

Predict masks for the given image based on the input prompts.

Parameters:

keypoints (Union[Keypoints, Tensor, None], optional) – Point prompts to the model. Each point is in (X,Y) in pixels. Shape \((K, N, 2)\). Where N is the number of points and K the number of prompts. Default: None
keypoint_labels – Labels for the point prompts. 1 indicates a foreground point and 0 indicates a background point. Shape \((K, N)\). Where N is the number of points, and K the number of prompts.
boxes (Union[Boxes, Tensor, None], optional) – A box prompt to the model. If a tensor, should be in a xyxy mode. Shape \((K, 4)\) Default: None
masks (Optional[Tensor], optional) – A low resolution mask input to the model, typically coming from a previous prediction iteration. Has shape \((K, 1, H, W)\), where for SAM, H=W=256. Default: None
multimask_output (bool, optional) – If true, the model will return three masks. For ambiguous input prompts (such as a single click), this will often produce better masks than a single prediction. If only a single mask is needed, the model’s predicted quality score can be used to select the best mask. For non-ambiguous prompts, such as multiple input prompts, multimask_output=False can give better results. Default: True
output_original_size (bool, optional) – If true, the logits of SegmentationResults will be post-process to match the original input image size. Default: True

Return type:

SegmentationResults

Returns:

A prediction with the logits and scores (IoU of each predicted mask)

preprocess_image(x, mean=None, std=None)¶

Normalize and pad a tensor.

For normalize the tensor: will prioritize the mean and std passed as argument, if None will use the default Sam Dataset values.

For pad the tensor: Will pad the tensor into the right and bottom to match with the size of self.model.image_encoder.img_size

Parameters:

x (Tensor) – The image to be preprocessed
mean (Optional[Tensor], optional) – Mean for each channel. Default: None
std (Optional[Tensor], optional) – Standard deviations for each channel. Default: None

Return type:

Tensor

Returns:

The image preprocessed (normalized if has mean and str available and padded to encoder size)

preprocess_prompts(keypoints=None, keypoints_labels=None, boxes=None, masks=None)¶

Validate and preprocess the given prompts to be aligned with the input image.

Return type:: Prompts

set_image(image, mean=None, std=None)¶

Set the embeddings from the given image with image_decoder of the model.

Prepare the given image with the selected transforms and the preprocess method.

Parameters:: image (Tensor) – RGB image. Normally images with range of [0-1], the model preprocess normalize the pixel values with the mean and std defined in its initialization. Expected to be into a float32 dtype. Shape \((3, H, W)\).
Return type:: None

Edge Detection¶

class kornia.contrib.EdgeDetector¶

Detect edges in a given image using a CNN.

By default, it uses the method described in [SRS20].

Returns:: A tensor of shape \((B,1,H,W)\).

Example

>>> img = torch.rand(1, 3, 320, 320)
>>> detect = EdgeDetector()
>>> out = detect(img)
>>> out.shape
torch.Size([1, 1, 320, 320])

Face Detection¶

class kornia.contrib.FaceDetector(top_k=5000, confidence_threshold=0.3, nms_threshold=0.3, keep_top_k=750)¶

Detect faces in a given image using a CNN.

By default, it uses the method described in [FYP+21].

Parameters:

top_k (int, optional) – the maximum number of detections to return before the nms. Default: 5000
confidence_threshold (float, optional) – the threshold used to discard detections. Default: 0.3
nms_threshold (float, optional) – the threshold used by the nms for iou. Default: 0.3
keep_top_k (int, optional) – the maximum number of detections to return after the nms. Default: 750

Returns:

A list of B tensors with shape \((N,15)\) to be used with kornia.contrib.FaceDetectorResult.

Example

>>> img = torch.rand(1, 3, 320, 320)
>>> detect = FaceDetector()
>>> res = detect(img)

class kornia.contrib.FaceKeypoint(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)¶

Define the keypoints detected in a face.

The left/right convention is based on the screen viewer.

EYE_LEFT = 0¶

EYE_RIGHT = 1¶

MOUTH_LEFT = 3¶

MOUTH_RIGHT = 4¶

NOSE = 2¶

class kornia.contrib.FaceDetectorResult(data)¶

Encapsulate the results obtained by the kornia.contrib.FaceDetector.

Parameters:: data (Tensor) – the encoded results coming from the feature detector with shape \((14,)\).

property bottom_left: Tensor¶: The [x y] position of the top-left coordinate of the bounding box.

property bottom_right: Tensor¶: The [x y] position of the bottom-right coordinate of the bounding box.

get_keypoint(keypoint)¶

The [x y] position of a given facial keypoint.

Parameters:: keypoint (FaceKeypoint) – the keypoint type to return the position.
Return type:: Tensor

property height: Tensor¶: The bounding box height.

property score: Tensor¶: The detection score.

to(device=None, dtype=None)¶

Like torch.nn.Module.to() method.

Return type:: FaceDetectorResult

property top_left: Tensor¶: The [x y] position of the top-left coordinate of the bounding box.

property top_right: Tensor¶: The [x y] position of the top-left coordinate of the bounding box.

property width: Tensor¶: The bounding box width.

property xmax: Tensor¶: The bounding box bottom-right x-coordinate.

property xmin: Tensor¶: The bounding box top-left x-coordinate.

property ymax: Tensor¶: The bounding box bottom-right y-coordinate.

property ymin: Tensor¶: The bounding box top-left y-coordinate.

Interactive Demo¶

Visit the Kornia face detection demo on the Hugging Face Spaces.

Object Detection¶

class kornia.contrib.object_detection.BoundingBoxDataFormat(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)¶

Enum class that maps bounding box data format.

XYWH = 0¶

XYXY = 1¶

CXCYWH = 2¶

CENTER_XYWH = 2¶

class kornia.contrib.object_detection.BoundingBox(data, data_format)¶

Bounding box data class.

Useful for representing bounding boxes in different formats for object detection.

Parameters:

data (tuple[float, float, float, float]) – tuple of bounding box data. The length of the tuple depends on the data format.
data_format (BoundingBoxDataFormat) – bounding box data format.

data: tuple[float, float, float, float]¶

data_format: BoundingBoxDataFormat¶

class kornia.contrib.object_detection.ObjectDetectorResult(class_id, confidence, bbox)¶

Object detection result.

Parameters:

class_id (int) – class id of the detected object.
confidence (float) – confidence score of the detected object.
bbox (BoundingBox) – bounding box of the detected object in xywh format.

bbox: BoundingBox¶

class_id: int¶

confidence: float¶

class kornia.contrib.object_detection.ObjectDetector(model, pre_processor, post_processor)¶

This class wraps an object detection model and performs pre-processing and post-processing.

__init__(model, pre_processor, post_processor)¶

Construct an Object Detector object.

Parameters:

model (Module) – an object detection model.
pre_processor (Module) – a pre-processing module
post_processor (Module) – a post-processing module.

compile(*, fullgraph=False, dynamic=False, backend='inductor', mode=None, options=None, disable=False)¶

Compile the internal object detection model with torch.compile().

Return type:: None

forward(images)¶

Detect objects in a given list of images.

Parameters:: images (list[Tensor]) – list of RGB images. Each image is a Tensor with shape \((3, H, W)\).
Return type:: list[Tensor]
Returns:: list of detections found in each image. For item in a batch, shape is \((D, 6)\), where \(D\) is the number of detections in the given image, \(6\) represents class id, score, and xywh bounding box.

class kornia.contrib.object_detection.ResizePreProcessor(size, interpolation_mode='bilinear')¶

This module resizes a list of image tensors to the given size.

Additionally, also returns the original image sizes for further post-processing.

forward(imgs)¶

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: tuple[Tensor, list[ImageSize]]

Note

kornia.contrib.object_detection.results_from_detections(detections, format)¶

Convert a detection tensor to a list of ObjectDetectorResult.

Parameters:: detections (Tensor) – tensor with shape \((D, 6)\), where \(D\) is the number of detections in the given image, \(6\) represents class id, score, and xywh bounding box.
Return type:: list[ObjectDetectorResult]
Returns:: list of ObjectDetectorResult.

Real-Time Detection Transformer (RT-DETR)¶

class kornia.contrib.models.rt_detr.RTDETRModelType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)¶

Enum class that maps RT-DETR model type.

resnet18d = 0¶

resnet34d = 1¶

resnet50d = 2¶

resnet101d = 3¶

hgnetv2_l = 4¶

hgnetv2_x = 5¶

class kornia.contrib.models.rt_detr.RTDETRConfig(model_type, num_classes, checkpoint=None, neck_hidden_dim=None, neck_dim_feedforward=None, neck_expansion=None, head_hidden_dim=256, head_num_queries=300, head_num_decoder_layers=None, confidence_threshold=0.3)¶

Configuration to construct RT-DETR model.

Parameters:

model_type (RTDETRModelType | str | int) –
model variant. Available models are
- ResNetD-18: 0, 'resnet18d' or RTDETRModelType.resnet18d
- ResNetD-34: 1, 'resnet34d' or RTDETRModelType.resnet34d
- ResNetD-50: 2, 'resnet50d' or RTDETRModelType.resnet50d
- ResNetD-101: 3, 'resnet101d' or RTDETRModelType.resnet101d
- HGNetV2-L: 4, 'hgnetv2_l' or RTDETRModelType.hgnetv2_l
- HGNetV2-X: 5, 'hgnetv2_x' or RTDETRModelType.hgnetv2_x
num_classes (int) – number of classes.
checkpoint (Optional[str], optional) – URL or local path of model weights. Default: None
neck_hidden_dim (Optional[int], optional) – hidden dim for neck. Default: None
neck_dim_feedforward (Optional[int], optional) – feed-forward network dim for neck. Default: None
neck_expansion (Optional[float], optional) – expansion ratio for neck. Default: None
head_hidden_dim (int, optional) – hidden dim for head. Default: 256
head_num_queries (int, optional) – number of queries for Deformable DETR transformer decoder. Default: 300
head_num_decoder_layers (Optional[int], optional) – number of decoder layers for Deformable DETR transformer decoder. Default: None

checkpoint: Optional[str] = None¶

confidence_threshold: float = 0.3¶

head_hidden_dim: int = 256¶

head_num_decoder_layers: Optional[int] = None¶

head_num_queries: int = 300¶

model_type: RTDETRModelType | str | int¶

neck_dim_feedforward: Optional[int] = None¶

neck_expansion: Optional[float] = None¶

neck_hidden_dim: Optional[int] = None¶

num_classes: int¶

class kornia.contrib.models.rt_detr.RTDETR(backbone, neck, head)¶

RT-DETR Object Detection model, as described in https://arxiv.org/abs/2304.08069.

__init__(backbone, neck, head)¶

Construct RT-DETR Object Detection model.

Parameters:

backbone (ResNetD | PPHGNetV2) – backbone network for feature extraction.
neck (HybridEncoder) – neck network for feature fusion.
head (RTDETRHead) – head network to decode features into detection results.

forward(images)¶

Detect objects in an image.

Parameters:

images (Tensor) – images to be detected. Shape \((N, C, H, W)\).

Return type:

tuple[Tensor, Tensor]

Returns:

logits - Tensor of shape \((N, Q, K)\), where \(Q\) is the number of queries, \(K\) is the number of classes.
boxes - Tensor of shape \((N, Q, 4)\), where \(Q\) is the number of queries.

static from_config(config)¶

Construct RT-DETR Object Detection model from a config object.

Parameters:: config (RTDETRConfig) – configuration object for RT-DETR.
Return type:: RTDETR

Note

For config.neck_hidden_dim, config.neck_dim_feedforward, config.neck_expansion, and config.head_num_decoder_layers, if they are None, their values will be replaced with the default values depending on the config.model_type. See the source code for the default values.

load_checkpoint(checkpoint, device=None)¶

Load checkpoint from a given url or file.

Parameters:

checkpoint (str) – The url or filepath for the respective checkpoint
device (Optional[device], optional) – The desired device to load the weights and move the model Default: None

Return type:

None

class kornia.contrib.models.rt_detr.DETRPostProcessor(confidence_threshold)¶

forward(logits, boxes, original_sizes)¶

Post-process outputs from DETR.

Parameters:

logits (Tensor) – tensor with shape \((N, Q, K)\), where \(N\) is the batch size, \(Q\) is the number of queries, \(K\) is the number of classes.
boxes (Tensor) – tensor with shape \((N, Q, 4)\), where \(N\) is the batch size, \(Q\) is the number of queries.
original_sizes (list[ImageSize]) – list of tuples, each tuple represent (img_height, img_width).

Return type:

list[Tensor]

Returns:

Processed detections. For each image, the detections have shape (D, 6), where D is the number of detections in that image, 6 represent (class_id, confidence_score, x, y, w, h).

Image Segmentation¶

kornia.contrib.connected_components(image, num_iterations=100)¶

Computes the Connected-component labelling (CCL) algorithm.

https://github.com/kornia/data/raw/main/cells_segmented.png

The implementation is an adaptation of the following repository:

https://gist.github.com/efirdc/5d8bd66859e574c683a504a4690ae8bc

Warning

This is an experimental API subject to changes and optimization improvements.

Note

See a working example here.

Parameters:

image (Tensor) – the binarized input image with shape \((*, 1, H, W)\). The image must be in floating point with range [0, 1].
num_iterations (int, optional) – the number of iterations to make the algorithm to converge. Default: 100

Return type:

Tensor

Returns:

The labels image with the same shape of the input image.

Example

>>> img = torch.rand(2, 1, 4, 5)
>>> img_labels = connected_components(img, num_iterations=100)

Segment Anything (SAM)¶

class kornia.contrib.models.sam.SamModelType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)¶

Map the SAM model types.

vit_h = 0¶

vit_l = 1¶

vit_b = 2¶

mobile_sam = 3¶

class kornia.contrib.models.sam.SamConfig(model_type=None, checkpoint=None, pretrained=False, encoder_embed_dim=None, encoder_depth=None, encoder_num_heads=None, encoder_global_attn_indexes=None)¶

Encapsulate the Config to build a SAM model.

Parameters:

model_type (Union[str, int, SamModelType, None], optional) –
the available models are: Default: None
- 0, ‘vit_h’ or kornia.contrib.sam.SamModelType.vit_h()
- 1, ‘vit_l’ or kornia.contrib.sam.SamModelType.vit_l()
- 2, ‘vit_b’ or kornia.contrib.sam.SamModelType.vit_b()
- 3, ‘mobile_sam’, or kornia.contrib.sam.SamModelType.mobile_sam()
checkpoint (Optional[str], optional) – URL or a path for a file with the weights of the model Default: None
encoder_embed_dim (Optional[int], optional) – Patch embedding dimension. Default: None
encoder_depth (Optional[int], optional) – Depth of ViT. Default: None
encoder_num_heads (Optional[int], optional) – Number of attention heads in each ViT block. Default: None
encoder_global_attn_indexes (Optional[tuple[int, ...]], optional) – Encoder indexes for blocks using global attention. Default: None

checkpoint: Optional[str] = None¶

encoder_depth: Optional[int] = None¶

encoder_embed_dim: Optional[int] = None¶

encoder_global_attn_indexes: Optional[tuple[int, ...]] = None¶

encoder_num_heads: Optional[int] = None¶

model_type: Union[str, int, SamModelType, None] = None¶

pretrained: bool = False¶

class kornia.contrib.models.sam.Sam(image_encoder, prompt_encoder, mask_decoder)¶

__init__(image_encoder, prompt_encoder, mask_decoder)¶

SAM predicts object masks from an image and input prompts.

Parameters:

image_encoder (ImageEncoderViT | TinyViT) – The backbone used to encode the image into image embeddings that allow for efficient mask prediction.
prompt_encoder (PromptEncoder) – Encodes various types of input prompts.
mask_decoder (MaskDecoder) – Predicts masks from the image embeddings and encoded prompts.

forward(images, batched_prompts, multimask_output)¶

Predicts masks end-to-end from provided images and prompts.

This method expects that the images have already been pre-processed, at least been normalized, resized and padded to be compatible with the self.image_encoder.

Note

For each image \((3, H, W)\), it is possible to input a batch (\(K\)) of \(N\) prompts, the results are batched by the number of prompts batch. So given a prompt with \(K=5\), and \(N=10\), the results will look like \(5xCxHxW\) where \(C\) is determined by multimask_output. And within each of these masks \((5xC)\), it should be possible to find \(N\) instances if the model succeed.

Parameters:

images (Tensor) – The image as a torch tensor in \((B, 3, H, W)\) format, already transformed for input to the model.
batched_prompts (list[dict[str, Any]]) –

A list over the batch of images (list length should be \(B\)), each a dictionary with
the following keys. If it does not have the respective prompt, it should not be included in this dictionary. The options are:
- ”points”: tuple of (Tensor, Tensor) within the coordinate keypoints and their respective labels.
  the tuple should look like (keypoints, labels), where:
  - The keypoints (a tensor) are a batched point prompts for this image, with shape \((K, N, 2)\). Already transformed to the input frame of the model.
  - The labels (a tensor) are a batched labels for point prompts, with shape \((K, N)\). Where 1 indicates a foreground point and 0 indicates a background point.
- ”boxes”: (Tensor) Batched box inputs, with shape \((K, 4)\). Already transformed to the input
  frame of the model.
- ”mask_inputs”: (Tensor) Batched mask inputs to the model, in the form \((K, 1, H, W)\).
multimask_output (bool) – Whether the model should predict multiple disambiguating masks, or return a single mask.

Return type:

list[SegmentationResults]

Returns:

A list over input images, where each element is as SegmentationResults the following.

logits: Low resolution logits with shape \((K, C, H, W)\). Can be passed as mask input to
subsequent iterations of prediction. Where \(K\) is the number of input prompts, \(C\) is determined by multimask_output, and \(H=W=256\) are the model output size.
scores: The model’s predictions of mask quality (iou prediction), in shape BxC.

static from_config(config)¶

Build/load the SAM model based on it’s config.

Parameters:: config (SamConfig) – The SamConfig data structure. If the model_type is available, build from it, otherwise will use the parameters set.
Return type:: Sam
Returns:: The respective SAM model

Example

>>> from kornia.contrib.models.sam import SamConfig
>>> sam_model = Sam.from_config(SamConfig('vit_b'))

load_checkpoint(checkpoint, device=None)¶

Load checkpoint from a given url or file.

Parameters:

checkpoint (str) – The url or filepath for the respective checkpoint
device (Optional[device], optional) – The desired device to load the weights and move the model Default: None

Return type:

None

Image Patches¶

kornia.contrib.compute_padding(original_size, window_size, stride=None)¶

Compute required padding to ensure chaining of extract_tensor_patches() and combine_tensor_patches() produces expected result.

Parameters:

original_size (Union[int, Tuple[int, int]]) – the size of the original tensor.
window_size (Union[int, Tuple[int, int]]) – the size of the sliding window used while extracting patches.
stride (Union[int, Tuple[int, int], None], optional) – The stride of the sliding window. Optional: if not specified, window_size will be used. Default: None

Returns:

(top, bottom, left, right)

Return type:

The required padding as a tuple of four ints

Example

>>> image = torch.arange(12).view(1, 1, 4, 3)
>>> padding = compute_padding((4,3), (3,3))
>>> out = extract_tensor_patches(image, window_size=(3, 3), stride=(3, 3), padding=padding)
>>> combine_tensor_patches(out, original_size=(4, 3), window_size=(3, 3), stride=(3, 3), unpadding=padding)
tensor([[[[ 0,  1,  2],
          [ 3,  4,  5],
          [ 6,  7,  8],
          [ 9, 10, 11]]]])

Note

This function will be implicitly used in extract_tensor_patches() and combine_tensor_patches() if allow_auto_(un)padding is set to True.

kornia.contrib.extract_tensor_patches(input, window_size, stride=1, padding=0, allow_auto_padding=False)¶

Function that extract patches from tensors and stacks them.

See ExtractTensorPatches for details.

Parameters:

input (Tensor) – tensor image where to extract the patches with shape \((B, C, H, W)\).
window_size (Union[int, Tuple[int, int]]) – the size of the sliding window and the output patch size.
stride (Union[int, Tuple[int, int]], optional) – stride of the sliding window. Default: 1
padding (Union[int, Tuple[int, int], Tuple[int, int, int, int]], optional) – Zero-padding added to both side of the input. Default: 0
allow_auto_adding – whether to allow automatic padding if the window and stride do not fit into the image.

Return type:

Tensor

Returns:

the tensor with the extracted patches with shape \((B, N, C, H_{out}, W_{out})\).

Examples

>>> input = torch.arange(9.).view(1, 1, 3, 3)
>>> patches = extract_tensor_patches(input, (2, 3))
>>> input
tensor([[[[0., 1., 2.],
          [3., 4., 5.],
          [6., 7., 8.]]]])
>>> patches[:, -1]
tensor([[[[3., 4., 5.],
          [6., 7., 8.]]]])

kornia.contrib.combine_tensor_patches(patches, original_size, window_size, stride, allow_auto_unpadding=False, unpadding=0, eps=1e-8)¶

Restore input from patches.

See CombineTensorPatches for details.

Parameters:

patches (Tensor) – patched tensor with shape \((B, N, C, H_{out}, W_{out})\).
original_size (Union[int, Tuple[int, int]]) – the size of the original tensor and the output size.
window_size (Union[int, Tuple[int, int]]) – the size of the sliding window used while extracting patches.
stride (Union[int, Tuple[int, int]]) – stride of the sliding window.
unpadding (Union[int, Tuple[int, int], Tuple[int, int, int, int]], optional) – remove the padding added to both side of the input. Default: 0
allow_auto_unpadding (bool, optional) – whether to allow automatic unpadding of the input if the window and stride do not fit into the original_size. Default: False
eps (float, optional) – small value used to prevent division by zero. Default: 1e-8

Return type:

Tensor

Returns:

The combined patches in an image tensor with shape \((B, C, H, W)\).

Example

>>> out = extract_tensor_patches(torch.arange(16).view(1, 1, 4, 4), window_size=(2, 2), stride=(2, 2))
>>> combine_tensor_patches(out, original_size=(4, 4), window_size=(2, 2), stride=(2, 2))
tensor([[[[ 0,  1,  2,  3],
          [ 4,  5,  6,  7],
          [ 8,  9, 10, 11],
          [12, 13, 14, 15]]]])

Note

This function is supposed to be used in conjunction with extract_tensor_patches().

class kornia.contrib.ExtractTensorPatches(window_size, stride=1, padding=0, allow_auto_padding=False)¶

Module that extract patches from tensors and stack them.

In the simplest case, the output value of the operator with input size \((B, C, H, W)\) is \((B, N, C, H_{out}, W_{out})\).

where

\(B\) is the batch size.
\(N\) denotes the total number of extracted patches stacked in
\(C\) denotes the number of input channels.
\(H\), \(W\) the input height and width of the input in pixels.
\(H_{out}\), \(W_{out}\) denote to denote to the patch size defined in the function signature. left-right and top-bottom order.

window_size is the size of the sliding window and controls the shape of the output tensor and defines the shape of the output patch.
stride controls the stride to apply to the sliding window and regulates the overlapping between the extracted patches.
padding controls the amount of implicit zeros-paddings on both sizes at each dimension.
allow_auto_padding allows automatic calculation of the padding required to fit the window and stride into the image.

The parameters window_size, stride and padding can be either:

a single int – in which case the same value is used for the height and width dimension.

a tuple of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension.

padding can also be a tuple of four ints – in which case, the first two ints are for the height dimension while the last two ints are for the width dimension.

Parameters:

input – tensor image where to extract the patches with shape \((B, C, H, W)\).
window_size (Union[int, Tuple[int, int]]) – the size of the sliding window and the output patch size.
stride (Union[int, Tuple[int, int]], optional) – stride of the sliding window. Default: 1
padding (Union[int, Tuple[int, int], Tuple[int, int, int, int]], optional) – Zero-padding added to both side of the input. Default: 0
allow_auto_adding – whether to allow automatic padding if the window and stride do not fit into the image.

Shape:

Input: \((B, C, H, W)\)
Output: \((B, N, C, H_{out}, W_{out})\)

Returns:: the tensor with the extracted patches.

Examples

>>> input = torch.arange(9.).view(1, 1, 3, 3)
>>> patches = extract_tensor_patches(input, (2, 3))
>>> input
tensor([[[[0., 1., 2.],
          [3., 4., 5.],
          [6., 7., 8.]]]])
>>> patches[:, -1]
tensor([[[[3., 4., 5.],
          [6., 7., 8.]]]])

class kornia.contrib.CombineTensorPatches(original_size, window_size, stride=None, unpadding=0, allow_auto_unpadding=False)¶

Module that combines patches back into full tensors.

In the simplest case, the output value of the operator with input size \((B, N, C, H_{out}, W_{out})\) is \((B, C, H, W)\).

where

\(B\) is the batch size.
\(N\) denotes the total number of extracted patches stacked in
\(C\) denotes the number of input channels.
\(H\), \(W\) the input height and width of the input in pixels.
\(H_{out}\), \(W_{out}\) denote to denote to the patch size defined in the function signature. left-right and top-bottom order.

original_size is the size of the original image prior to extracting tensor patches and defines the shape of the output patch.
window_size is the size of the sliding window used while extracting tensor patches.
stride controls the stride to apply to the sliding window and regulates the overlapping between the extracted patches.
unpadding is the amount of padding to be removed. If specified, this value must be the same as padding used while extracting tensor patches.
allow_auto_unpadding allows automatic calculation of the padding required to fit the window and stride into the image. This must be used if the allow_auto_padding flag was used for extracting the patches.

The parameters original_size, window_size, stride, and unpadding can be either:

a single int – in which case the same value is used for the height and width dimension.

a tuple of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension.

unpadding can also be a tuple of four ints – in which case, the first two ints are for the height dimension while the last two ints are for the width dimension.

Parameters:

patches – patched tensor with shape \((B, N, C, H_{out}, W_{out})\).
original_size (Tuple[int, int]) – the size of the original tensor and the output size.
window_size (Union[int, Tuple[int, int]]) – the size of the sliding window used while extracting patches.
stride (Union[int, Tuple[int, int], None], optional) – stride of the sliding window. Default: None
unpadding (Union[int, Tuple[int, int], Tuple[int, int, int, int]], optional) – remove the padding added to both side of the input. Default: 0
allow_auto_unpadding (bool, optional) – whether to allow automatic unpadding of the input if the window and stride do not fit into the original_size. Default: False
eps – small value used to prevent division by zero.

Shape:

Input: \((B, N, C, H_{out}, W_{out})\)
Output: \((B, C, H, W)\)

Example

>>> out = extract_tensor_patches(torch.arange(16).view(1, 1, 4, 4), window_size=(2, 2), stride=(2, 2))
>>> combine_tensor_patches(out, original_size=(4, 4), window_size=(2, 2), stride=(2, 2))
tensor([[[[ 0,  1,  2,  3],
          [ 4,  5,  6,  7],
          [ 8,  9, 10, 11],
          [12, 13, 14, 15]]]])

Note

This function is supposed to be used in conjunction with ExtractTensorPatches.

Image Classification¶

class kornia.contrib.VisionTransformer(image_size=224, patch_size=16, in_channels=3, embed_dim=768, depth=12, num_heads=12, dropout_rate=0.0, dropout_attn=0.0, backbone=None)¶

Vision transformer (ViT) module.

The module is expected to be used as operator for different vision tasks.

The method is inspired from existing implementations of the paper [DBK+21].

Warning

This is an experimental API subject to changes in favor of flexibility.

Parameters:

image_size (int, optional) – the size of the input image. Default: 224
patch_size (int, optional) – the size of the patch to compute the embedding. Default: 16
in_channels (int, optional) – the number of channels for the input. Default: 3
embed_dim (int, optional) – the embedding dimension inside the transformer encoder. Default: 768
depth (int, optional) – the depth of the transformer. Default: 12
num_heads (int, optional) – the number of attention heads. Default: 12
dropout_rate (float, optional) – dropout rate. Default: 0.0
dropout_attn (float, optional) – attention dropout rate. Default: 0.0
backbone (Module | None, optional) – an nn.Module to compute the image patches embeddings. Default: None

Example

>>> img = torch.rand(1, 3, 224, 224)
>>> vit = VisionTransformer(image_size=224, patch_size=16)
>>> vit(img).shape
torch.Size([1, 197, 768])

forward(x)¶

Define the computation performed at every call.

Should be overridden by all subclasses. :rtype: Tensor

Note

static from_config(variant, pretrained=False, **kwargs)¶

Build ViT model based on the given config string. The format is vit_{size}/{patch_size}. E.g. vit_b/16 means ViT-Base, patch size 16x16. If pretrained=True, AugReg weights are loaded. The weights are hosted on HuggingFace’s model hub: https://huggingface.co/kornia.

Note

The available weights are: vit_l/16, vit_b/16, vit_s/16, vit_ti/16, vit_b/32, vit_s/32.

Parameters:

variant (str) – ViT model variant e.g. vit_b/16.
pretrained (bool, optional) – whether to load pre-trained AugReg weights. Default: False
kwargs (Any) – other keyword arguments that will be passed to kornia.contrib.vit.VisionTransformer().

Return type:

VisionTransformer

Returns:

The respective ViT model

Example

>>> from kornia.contrib import VisionTransformer
>>> vit_model = VisionTransformer.from_config("vit_b/16", pretrained=True)

class kornia.contrib.MobileViT(mode='xxs', in_channels=3, patch_size=(2, 2), dropout=0.0)¶

Module MobileViT. Default arguments is for MobileViT XXS.

Paper: https://arxiv.org/abs/2110.02178 Based on: https://github.com/chinhsuanwu/mobilevit-pytorch

Parameters:

mode (str, optional) – ‘xxs’, ‘xs’ or ‘s’, defaults to ‘xxs’. Default: "xxs"
in_channels (int, optional) – the number of channels for the input image. Default: 3
patch_size (Tuple[int, int], optional) – image_size must be divisible by patch_size. Default: (2, 2)
dropout (float, optional) – dropout ratio in Transformer. Default: 0.0

Example

>>> img = torch.rand(1, 3, 256, 256)
>>> mvit = MobileViT(mode='xxs')
>>> mvit(img).shape
torch.Size([1, 320, 8, 8])

class kornia.contrib.TinyViT(img_size=224, in_chans=3, num_classes=1000, embed_dims=[96, 192, 384, 768], depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_sizes=[7, 7, 14, 7], mlp_ratio=4.0, drop_rate=0.0, drop_path_rate=0.0, use_checkpoint=False, mbconv_expand_ratio=4.0, local_conv_size=3, activation=nn.GELU, mobile_sam=False)¶

TinyViT model, as described in https://arxiv.org/abs/2207.10666

Parameters:

img_size (int, optional) – Size of input image. Default: 224
in_chans (int, optional) – Number of input image’s channels. Default: 3
num_classes (int, optional) – Number of output classes. Default: 1000
embed_dims (list[int], optional) – List of embedding dimensions. Default: [96, 192, 384, 768]
depths (list[int], optional) – List of block count for each downsampling stage Default: [2, 2, 6, 2]
num_heads (list[int], optional) – List of attention heads used in self-attention for each downsampling stage. Default: [3, 6, 12, 24]
window_sizes (list[int], optional) – List of self-attention’s window size for each downsampling stage. Default: [7, 7, 14, 7]
mlp_ratio (float, optional) – Ratio of MLP dimension to embedding dimension in self-attention. Default: 4.0
drop_rate (float, optional) – Dropout rate. Default: 0.0
drop_path_rate (float, optional) – Stochastic depth rate. Default: 0.0
use_checkpoint (bool, optional) – Whether to use activation checkpointing to trade compute for memory. Default: False
mbconv_expand_ratio (float, optional) – Expansion ratio used in MBConv block. Default: 4.0
local_conv_size (int, optional) – Kernel size of convolution used in TinyViTBlock Default: 3
activation (type[Module], optional) – activation function. Default: nn.GELU
mobile_same – Whether to use modifications for MobileSAM.

forward(x)¶

Classify images if mobile_sam=False, produce feature maps if mobile_sam=True.

Return type:: Tensor

static from_config(variant, pretrained=False, **kwargs)¶

Create a TinyViT model from pre-defined variants.

Parameters:

variant (str) – TinyViT variant. Possible values: '5m', '11m', '21m'.
pretrained (bool | str, optional) – whether to use pre-trained weights. Possible values: False, True, 'in22k', 'in1k'. For TinyViT-21M (variant='21m'), 'in1k_384', 'in1k_512' are also available. Default: False
**kwargs (Any) – other keyword arguments that will be passed to TinyViT.

Return type:

TinyViT

Note

When img_size is different from the pre-trained size, bicubic interpolation will be performed on attention biases. When using pretrained=True, ImageNet-1k checkpoint ('in1k') is used. For feature extraction or fine-tuning, ImageNet-22k checkpoint ('in22k') is preferred.

class kornia.contrib.ClassificationHead(embed_size=768, num_classes=10)¶

Module to be used as a classification head.

Parameters:

embed_size (int, optional) – the logits tensor coming from the networks. Default: 768
num_classes (int, optional) – an integer representing the numbers of classes to classify. Default: 10

Example

>>> feat = torch.rand(1, 256, 256)
>>> head = ClassificationHead(256, 10)
>>> head(feat).shape
torch.Size([1, 10])

Image Stitching¶

class kornia.contrib.ImageStitcher(matcher, estimator='ransac', blending_method='naive')¶

Stitch two images with overlapping fields of view.

Parameters:

matcher (Module) – image feature matching module.
estimator (str, optional) – method to compute homography, either “vanilla” or “ransac”. “ransac” is slower with a better accuracy. Default: "ransac"
blending_method (str, optional) – method to blend two images together. Only “naive” is currently supported. Default: "naive"

Note

Current implementation requires strict image ordering from left to right.

IS = ImageStitcher(KF.LoFTR(pretrained='outdoor'), estimator='ransac').cuda()
# Compute the stitched result with less GPU memory cost.
with torch.inference_mode():
    out = IS(img_left, img_right)
# Show the result
plt.imshow(K.tensor_to_image(out))

Lambda¶

class kornia.contrib.Lambda(func)¶

Applies user-defined lambda as a transform.

Parameters:: func (Callable[..., Tensor]) – Callable function.
Returns:: The output of the user-defined lambda.

Example

>>> import kornia
>>> x = torch.rand(1, 3, 5, 5)
>>> f = Lambda(lambda x: kornia.color.rgb_to_grayscale(x))
>>> f(x).shape
torch.Size([1, 1, 5, 5])

Distance Transform¶

kornia.contrib.distance_transform(image, kernel_size=3, h=0.35)¶

Approximates the Manhattan distance transform of images using cascaded convolution operations.

The value at each pixel in the output represents the distance to the nearest non-zero pixel in the image image. It uses the method described in [PDP20]. The transformation is applied independently across the channel dimension of the images.

Parameters:

image (Tensor) – Image with shape \((B,C,H,W)\).
kernel_size (int, optional) – size of the convolution kernel. Default: 3
h (float, optional) – value that influence the approximation of the min function. Default: 0.35

Return type:

Tensor

Returns:

tensor with shape \((B,C,H,W)\).

Example

>>> tensor = torch.zeros(1, 1, 5, 5)
>>> tensor[:,:, 1, 2] = 1
>>> dt = kornia.contrib.distance_transform(tensor)

kornia.contrib.diamond_square(output_size, roughness=0.5, random_scale=1.0, random_fn=torch.rand, normalize_range=None, device=None, dtype=None)¶

Generates Plasma Fractal Images using the diamond square algorithm.

See: https://en.wikipedia.org/wiki/Diamond-square_algorithm

Parameters:

output_size (Tuple[int, int, int, int]) – a tuple of integers with the BxCxHxW of the image to be generated.
roughness (Union[float, Tensor], optional) – the scale value to apply at each recursion step. Default: 0.5
random_scale (Union[float, Tensor], optional) – the initial value of the scale for recursion. Default: 1.0
random_fn (Callable[..., Tensor], optional) – the callable function to use to sample a random tensor. Default: torch.rand
normalize_range (Optional[Tuple[float, float]], optional) – whether to normalize using min-max the output map. In case of a range is specified, min-max norm is applied between the provided range. Default: None
device (Optional[device], optional) – the torch device to place the output map. Default: None
dtype (Optional[dtype], optional) – the torch dtype to place the output map. Default: None

Return type:

Tensor

Returns:

A tensor with shape \((B,C,H,W)\) containing the fractal image.

class kornia.contrib.DistanceTransform(kernel_size=3, h=0.35)¶

Module that approximates the Manhattan (city block) distance transform of images using convolutions.

Parameters:

kernel_size (int, optional) – size of the convolution kernel. Default: 3
h (float, optional) – value that influence the approximation of the min function. Default: 0.35

KMeans¶

class kornia.contrib.KMeans(num_clusters, cluster_centers, tolerance=10e-4, max_iterations=0, seed=None)¶

Implements the kmeans clustering algorithm with euclidean distance as similarity measure.

Parameters:

num_clusters (int) – number of clusters the data has to be assigned to
cluster_centers (Tensor | None) – tensor of starting cluster centres can be passed instead of num_clusters
tolerance (float, optional) – float value. the algorithm terminates if the shift in centers is less than tolerance Default: 10e-4
max_iterations (int, optional) – number of iterations to run the algorithm for Default: 0
seed (int | None, optional) – number to set torch manual seed for reproducibility Default: None

Example

>>> kmeans = kornia.contrib.KMeans(3, None, 10e-4, 100, 0)
>>> kmeans.fit(torch.rand((1000, 5)))
>>> predictions = kmeans.predict(torch.rand((10, 5)))

fit(X)¶

Iterative KMeans clustering till a threshold for shift in cluster centers or a maximum no of iterations have reached.

Parameters:: X (Tensor) – 2D input tensor to be clustered
Return type:: None

predict(x)¶

Find the cluster center closest to each point in x.

Parameters:: x (Tensor) – 2D tensor
Return type:: Tensor
Returns:: 1D tensor containing cluster id assigned to each data point in x