kornia.contrib#
Models#
Base#
- class kornia.contrib.models.base.ModelBase(*args, **kwargs)#
Abstract model class with some utilities function.
- compile(*, fullgraph=False, dynamic=False, backend='inductor', mode=None, options={}, disable=False)#
Compile this Module’s forward using
torch.compile()
.This Module’s __call__ method is compiled and all arguments are passed as-is to
torch.compile()
.See
torch.compile()
for details on the arguments for this function.
- abstract static from_config(config)#
This function should build/load the model.
EfficientViT#
- class kornia.contrib.models.efficient_vit.EfficientViT(backbone)#
EfficientViT backbone model.
- __init__(backbone)#
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(images)#
Extract features from the input images.
- static from_config(config)#
Build the EfficientViT model from a configuration object.
- Parameters:
config (
EfficientViTConfig
) – EfficientViT configuration object. SeeEfficientViTConfig
.- Returns:
the EfficientViT model.
- Return type:
- class kornia.contrib.models.efficient_vit.EfficientViTConfig(checkpoint=<factory>)#
Configuration to construct EfficientViT model.
Model weights can be loaded from a checkpoint URL or local path. The model weights are hosted on HuggingFace’s model hub: https://huggingface.co/kornia.
- Parameters:
checkpoint (
str
, optional) – URL or local path of model weights. Default:<factory>
- classmethod from_pretrained(model_type, resolution)#
Return a configuration object from a pre-trained model.
- Parameters:
- Return type:
Backbones#
- class kornia.contrib.models.efficient_vit.backbone.EfficientViTBackbone(width_list, depth_list, in_channels=3, dim=32, expand_ratio=4, norm='bn2d', act_func='hswish')#
- static build_local_block(in_channels, out_channels, stride, expand_ratio, norm, act_func, fewer_norm=False)#
- Return type:
- forward(x)#
Define the computation performed at every call.
Should be overridden by all subclasses. :rtype:
dict
[str
,Tensor
]Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_b0(**kwargs)#
- Return type:
- kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_b1(**kwargs)#
- Return type:
- kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_b2(**kwargs)#
- Return type:
- kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_b3(**kwargs)#
- Return type:
- class kornia.contrib.models.efficient_vit.backbone.EfficientViTLargeBackbone(width_list, depth_list, in_channels=3, qkv_dim=32, norm='bn2d', act_func='gelu')#
- static build_local_block(stage_id, in_channels, out_channels, stride, expand_ratio, norm, act_func, fewer_norm=False)#
- Return type:
- forward(x)#
Define the computation performed at every call.
Should be overridden by all subclasses. :rtype:
dict
[str
,Tensor
]Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_l0(**kwargs)#
- Return type:
- kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_l1(**kwargs)#
- Return type:
- kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_l2(**kwargs)#
- Return type:
- kornia.contrib.models.efficient_vit.backbone.efficientvit_backbone_l3(**kwargs)#
- Return type:
Structures#
- class kornia.contrib.models.SegmentationResults(logits, scores, mask_threshold=0.0)#
Encapsulate the results obtained by a Segmentation model.
- Parameters:
- property binary_masks: Tensor#
Binary mask generated from logits considering the mask_threshold.
Shape will be the same of logits \((B, C, H, W)\) where \(C\) is the number masks predicted.
Note
If you run original_res_logits, this will generate the masks based on the original resolution logits. Otherwise, this will use the low resolution logits (self.logits).
- original_res_logits(input_size, original_size, image_size_encoder)#
Remove padding and upscale the logits to the original image size.
Resize to image encoder input -> remove padding (bottom and right) -> Resize to original size
Note
This method set a internal original_res_logits which will be used if available for the binary masks.
- Parameters:
input_size (
tuple
[int
,int
]) – The size of the image input to the model, in (H, W) format. Used to remove padding.original_size (
tuple
[int
,int
]) – The original size of the image before resizing for input to the model, in (H, W) format.image_size_encoder (
Optional
[tuple
[int
,int
]]) – The size of the input image for image encoder, in (H, W) format. Used to resize the logits back to encoder resolution before remove the padding.
- Return type:
- Returns:
Batched logits in \((K, C, H, W)\) format, where (H, W) is given by original_size.
- squeeze(dim=0)#
Realize a squeeze for the dim given for all properties.
- Return type:
- class kornia.contrib.models.Prompts(points=None, boxes=None, masks=None)#
Encapsulate the prompts inputs for a Model.
- Parameters:
points (
Optional
[tuple
[Tensor
,Tensor
]], optional) – A tuple with the keypoints (coordinates x, y) and their respective labels. Shape \((K, N, 2)\) for the keypoints, and \((K, N)\) Default:None
boxes (
Optional
[Tensor
], optional) – Batched box inputs, with shape \((K, 4)\). Expected to be into xyxy format. Default:None
masks (
Optional
[Tensor
], optional) – Batched mask prompts to the model with shape \((K, 1, H, W)\) Default:None
VisualPrompter#
- class kornia.contrib.visual_prompter.VisualPrompter(config=SamConfig(model_type='vit_h', pretrained=True), device=None, dtype=None)#
This class allow the user to run multiple query with multiple prompts for a model.
At the moment, we just support the SAM model. The model is loaded based on the given config.
For default the images are transformed to have their long side with size of the image_encoder.img_size. This Prompter class ensure to transform the images and the prompts before prediction. Also, the image is passed automatically for the method preprocess_image, which is responsible for normalize the image and pad it to have the right size for the SAM model \(( ext{image_encoder.img_size}, ext{image_encoder.img_size})\). For default the image is normalized by the mean and standard deviation of the SAM dataset values.
- Parameters:
config (
SamConfig
, optional) – A model config to generate the model. Now just the SAM model is supported. Default:SamConfig(model_type="vit_h", pretrained=True)
device (
Optional
[device
], optional) – The desired device to use the model. Default:None
dtype (
Optional
[dtype
], optional) – The desired dtype to use the model. Default:None
Example
>>> # prompter = VisualPrompter() # Will load the vit h for default >>> # You can load a custom SAM type for modifying the config >>> prompter = VisualPrompter(SamConfig('vit_b')) >>> image = torch.rand(3, 25, 30) >>> prompter.set_image(image) >>> boxes = Boxes( ... torch.tensor( ... [[[[0, 0], [0, 10], [10, 0], [10, 10]]]], ... device=prompter.device, ... dtype=torch.float32 ... ), ... mode='xyxy' ... ) >>> prediction = prompter.predict(boxes=boxes) >>> prediction.logits.shape torch.Size([1, 3, 256, 256])
- compile(*, fullgraph=False, dynamic=False, backend='inductor', mode=None, options={}, disable=False)#
Applies torch.compile(…)/dynamo API into the VisualPrompter API.
Note
For more information about the dynamo API check the official docs https://pytorch.org/docs/stable/generated/torch.compile.html
- Parameters:
fullgraph (
bool
, optional) – Whether it is ok to break model into several subgraphs Default:False
dynamic (
bool
, optional) – Use dynamic shape tracing Default:False
backend (
str
, optional) – backend to be used Default:"inductor"
mode (
Optional
[str
], optional) – Can be either “default”, “reduce-overhead” or “max-autotune” Default:None
options (
dict
[Any
,Any
], optional) – A dictionary of options to pass to the backend. Default:{}
disable (
bool
, optional) – Turn torch.compile() into a no-op for testing Default:False
- Return type:
Example
>>> # prompter = VisualPrompter() >>> # prompter.compile() # You should have torch >= 2.0.0 installed >>> # Use the prompter methods ...
- predict(keypoints=None, keypoints_labels=None, boxes=None, masks=None, multimask_output=True, output_original_size=True)#
Predict masks for the given image based on the input prompts.
- Parameters:
keypoints (
Union
[Keypoints
,Tensor
,None
], optional) – Point prompts to the model. Each point is in (X,Y) in pixels. Shape \((K, N, 2)\). Where N is the number of points and K the number of prompts. Default:None
keypoint_labels – Labels for the point prompts. 1 indicates a foreground point and 0 indicates a background point. Shape \((K, N)\). Where N is the number of points, and K the number of prompts.
boxes (
Union
[Boxes
,Tensor
,None
], optional) – A box prompt to the model. If a tensor, should be in a xyxy mode. Shape \((K, 4)\) Default:None
masks (
Optional
[Tensor
], optional) – A low resolution mask input to the model, typically coming from a previous prediction iteration. Has shape \((K, 1, H, W)\), where for SAM, H=W=256. Default:None
multimask_output (
bool
, optional) – If true, the model will return three masks. For ambiguous input prompts (such as a single click), this will often produce better masks than a single prediction. If only a single mask is needed, the model’s predicted quality score can be used to select the best mask. For non-ambiguous prompts, such as multiple input prompts, multimask_output=False can give better results. Default:True
output_original_size (
bool
, optional) – If true, the logits of SegmentationResults will be post-process to match the original input image size. Default:True
- Return type:
- Returns:
A prediction with the logits and scores (IoU of each predicted mask)
- preprocess_image(x, mean=None, std=None)#
Normalize and pad a tensor.
For normalize the tensor: will prioritize the mean and std passed as argument, if None will use the default Sam Dataset values.
For pad the tensor: Will pad the tensor into the right and bottom to match with the size of self.model.image_encoder.img_size
- Parameters:
- Return type:
- Returns:
The image preprocessed (normalized if has mean and str available and padded to encoder size)
- preprocess_prompts(keypoints=None, keypoints_labels=None, boxes=None, masks=None)#
Validate and preprocess the given prompts to be aligned with the input image.
- Return type:
- set_image(image, mean=None, std=None)#
Set the embeddings from the given image with image_decoder of the model.
Prepare the given image with the selected transforms and the preprocess method.
Edge Detection#
- class kornia.contrib.EdgeDetector#
Detect edges in a given image using a CNN.
By default, it uses the method described in [SRS20].
- Returns:
A tensor of shape \((B,1,H,W)\).
Example
>>> img = torch.rand(1, 3, 320, 320) >>> detect = EdgeDetector() >>> out = detect(img) >>> out.shape torch.Size([1, 1, 320, 320])
Face Detection#
- class kornia.contrib.FaceDetector(top_k=5000, confidence_threshold=0.3, nms_threshold=0.3, keep_top_k=750)#
Detect faces in a given image using a CNN.
By default, it uses the method described in [FYP+21].
- Parameters:
top_k (
int
, optional) – the maximum number of detections to return before the nms. Default:5000
confidence_threshold (
float
, optional) – the threshold used to discard detections. Default:0.3
nms_threshold (
float
, optional) – the threshold used by the nms for iou. Default:0.3
keep_top_k (
int
, optional) – the maximum number of detections to return after the nms. Default:750
- Returns:
A list of B tensors with shape \((N,15)\) to be used with
kornia.contrib.FaceDetectorResult
.
Example
>>> img = torch.rand(1, 3, 320, 320) >>> detect = FaceDetector() >>> res = detect(img)
- class kornia.contrib.FaceKeypoint(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)#
Define the keypoints detected in a face.
The left/right convention is based on the screen viewer.
- EYE_LEFT = 0#
- EYE_RIGHT = 1#
- MOUTH_LEFT = 3#
- MOUTH_RIGHT = 4#
- NOSE = 2#
- class kornia.contrib.FaceDetectorResult(data)#
Encapsulate the results obtained by the
kornia.contrib.FaceDetector
.- Parameters:
data (
Tensor
) – the encoded results coming from the feature detector with shape \((14,)\).
- property bottom_right: Tensor#
The [x y] position of the bottom-right coordinate of the bounding box.
- get_keypoint(keypoint)#
The [x y] position of a given facial keypoint.
- Parameters:
keypoint (
FaceKeypoint
) – the keypoint type to return the position.- Return type:
- to(device=None, dtype=None)#
Like
torch.nn.Module.to()
method.- Return type:
Interactive Demo#
Visit the Kornia face detection demo on the Hugging Face Spaces.
Object Detection#
- class kornia.contrib.object_detection.BoundingBoxDataFormat(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)#
Enum class that maps bounding box data format.
- XYWH = 0#
- XYXY = 1#
- CXCYWH = 2#
- CENTER_XYWH = 2#
- class kornia.contrib.object_detection.BoundingBox(data, data_format)#
Bounding box data class.
Useful for representing bounding boxes in different formats for object detection.
- Parameters:
-
data_format:
BoundingBoxDataFormat
#
- class kornia.contrib.object_detection.ObjectDetectorResult(class_id, confidence, bbox)#
Object detection result.
- Parameters:
class_id (
int
) – class id of the detected object.confidence (
float
) – confidence score of the detected object.bbox (
BoundingBox
) – bounding box of the detected object in xywh format.
-
bbox:
BoundingBox
#
- class kornia.contrib.object_detection.ObjectDetector(model, pre_processor, post_processor)#
This class wraps an object detection model and performs pre-processing and post-processing.
- __init__(model, pre_processor, post_processor)#
Construct an Object Detector object.
- compile(*, fullgraph=False, dynamic=False, backend='inductor', mode=None, options=None, disable=False)#
Compile the internal object detection model with
torch.compile()
.- Return type:
- forward(images)#
Detect objects in a given list of images.
- Parameters:
images (
list
[Tensor
]) – list of RGB images. Each image is a Tensor with shape \((3, H, W)\).- Return type:
- Returns:
list of detections found in each image. For item in a batch, shape is \((D, 6)\), where \(D\) is the number of detections in the given image, \(6\) represents class id, score, and xywh bounding box.
- class kornia.contrib.object_detection.ResizePreProcessor(size, interpolation_mode='bilinear')#
This module resizes a list of image tensors to the given size.
Additionally, also returns the original image sizes for further post-processing.
- forward(imgs)#
Define the computation performed at every call.
Should be overridden by all subclasses. :rtype:
tuple
[Tensor
,list
[ImageSize
]]Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- kornia.contrib.object_detection.results_from_detections(detections, format)#
Convert a detection tensor to a list of
ObjectDetectorResult
.- Parameters:
detections (
Tensor
) – tensor with shape \((D, 6)\), where \(D\) is the number of detections in the given image, \(6\) represents class id, score, and xywh bounding box.- Return type:
- Returns:
list of
ObjectDetectorResult
.
Real-Time Detection Transformer (RT-DETR)#
- class kornia.contrib.models.rt_detr.RTDETRModelType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)#
Enum class that maps RT-DETR model type.
- resnet18d = 0#
- resnet34d = 1#
- resnet50d = 2#
- resnet101d = 3#
- hgnetv2_l = 4#
- hgnetv2_x = 5#
- class kornia.contrib.models.rt_detr.RTDETRConfig(model_type, num_classes, checkpoint=None, neck_hidden_dim=None, neck_dim_feedforward=None, neck_expansion=None, head_hidden_dim=256, head_num_queries=300, head_num_decoder_layers=None, confidence_threshold=0.3)#
Configuration to construct RT-DETR model.
- Parameters:
model_type (
RTDETRModelType
|str
|int
) –model variant. Available models are
ResNetD-18:
0
,'resnet18d'
orRTDETRModelType.resnet18d
ResNetD-34:
1
,'resnet34d'
orRTDETRModelType.resnet34d
ResNetD-50:
2
,'resnet50d'
orRTDETRModelType.resnet50d
ResNetD-101:
3
,'resnet101d'
orRTDETRModelType.resnet101d
HGNetV2-L:
4
,'hgnetv2_l'
orRTDETRModelType.hgnetv2_l
HGNetV2-X:
5
,'hgnetv2_x'
orRTDETRModelType.hgnetv2_x
num_classes (
int
) – number of classes.checkpoint (
Optional
[str
], optional) – URL or local path of model weights. Default:None
neck_hidden_dim (
Optional
[int
], optional) – hidden dim for neck. Default:None
neck_dim_feedforward (
Optional
[int
], optional) – feed-forward network dim for neck. Default:None
neck_expansion (
Optional
[float
], optional) – expansion ratio for neck. Default:None
head_hidden_dim (
int
, optional) – hidden dim for head. Default:256
head_num_queries (
int
, optional) – number of queries for Deformable DETR transformer decoder. Default:300
head_num_decoder_layers (
Optional
[int
], optional) – number of decoder layers for Deformable DETR transformer decoder. Default:None
-
model_type:
RTDETRModelType
|str
|int
#
- class kornia.contrib.models.rt_detr.RTDETR(backbone, neck, head)#
RT-DETR Object Detection model, as described in https://arxiv.org/abs/2304.08069.
- __init__(backbone, neck, head)#
Construct RT-DETR Object Detection model.
- Parameters:
backbone (
ResNetD
|PPHGNetV2
) – backbone network for feature extraction.neck (
HybridEncoder
) – neck network for feature fusion.head (
RTDETRHead
) – head network to decode features into detection results.
- forward(images)#
Detect objects in an image.
- Parameters:
images (
Tensor
) – images to be detected. Shape \((N, C, H, W)\).- Return type:
- Returns:
logits - Tensor of shape \((N, Q, K)\), where \(Q\) is the number of queries, \(K\) is the number of classes.
boxes - Tensor of shape \((N, Q, 4)\), where \(Q\) is the number of queries.
- static from_config(config)#
Construct RT-DETR Object Detection model from a config object.
- Parameters:
config (
RTDETRConfig
) – configuration object for RT-DETR.- Return type:
Note
For
config.neck_hidden_dim
,config.neck_dim_feedforward
,config.neck_expansion
, andconfig.head_num_decoder_layers
, if they areNone
, their values will be replaced with the default values depending on theconfig.model_type
. See the source code for the default values.
- class kornia.contrib.models.rt_detr.DETRPostProcessor(confidence_threshold)#
- forward(logits, boxes, original_sizes)#
Post-process outputs from DETR.
- Parameters:
logits (
Tensor
) – tensor with shape \((N, Q, K)\), where \(N\) is the batch size, \(Q\) is the number of queries, \(K\) is the number of classes.boxes (
Tensor
) – tensor with shape \((N, Q, 4)\), where \(N\) is the batch size, \(Q\) is the number of queries.original_sizes (
list
[ImageSize
]) – list of tuples, each tuple represent (img_height, img_width).
- Return type:
- Returns:
Processed detections. For each image, the detections have shape (D, 6), where D is the number of detections in that image, 6 represent (class_id, confidence_score, x, y, w, h).
Image Segmentation#
- kornia.contrib.connected_components(image, num_iterations=100)#
Computes the Connected-component labelling (CCL) algorithm.
The implementation is an adaptation of the following repository:
https://gist.github.com/efirdc/5d8bd66859e574c683a504a4690ae8bc
Warning
This is an experimental API subject to changes and optimization improvements.
Note
See a working example here.
- Parameters:
- Return type:
- Returns:
The labels image with the same shape of the input image.
Example
>>> img = torch.rand(2, 1, 4, 5) >>> img_labels = connected_components(img, num_iterations=100)
Segment Anything (SAM)#
- class kornia.contrib.models.sam.SamModelType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)#
Map the SAM model types.
- vit_h = 0#
- vit_l = 1#
- vit_b = 2#
- mobile_sam = 3#
- class kornia.contrib.models.sam.SamConfig(model_type=None, checkpoint=None, pretrained=False, encoder_embed_dim=None, encoder_depth=None, encoder_num_heads=None, encoder_global_attn_indexes=None)#
Encapsulate the Config to build a SAM model.
- Parameters:
model_type (
Union
[str
,int
,SamModelType
,None
], optional) –the available models are: Default:
None
0, ‘vit_h’ or
kornia.contrib.sam.SamModelType.vit_h()
1, ‘vit_l’ or
kornia.contrib.sam.SamModelType.vit_l()
2, ‘vit_b’ or
kornia.contrib.sam.SamModelType.vit_b()
3, ‘mobile_sam’, or
kornia.contrib.sam.SamModelType.mobile_sam()
checkpoint (
Optional
[str
], optional) – URL or a path for a file with the weights of the model Default:None
encoder_embed_dim (
Optional
[int
], optional) – Patch embedding dimension. Default:None
encoder_depth (
Optional
[int
], optional) – Depth of ViT. Default:None
encoder_num_heads (
Optional
[int
], optional) – Number of attention heads in each ViT block. Default:None
encoder_global_attn_indexes (
Optional
[tuple
[int
,...
]], optional) – Encoder indexes for blocks using global attention. Default:None
- class kornia.contrib.models.sam.Sam(image_encoder, prompt_encoder, mask_decoder)#
- __init__(image_encoder, prompt_encoder, mask_decoder)#
SAM predicts object masks from an image and input prompts.
- Parameters:
image_encoder (
ImageEncoderViT
|TinyViT
) – The backbone used to encode the image into image embeddings that allow for efficient mask prediction.prompt_encoder (
PromptEncoder
) – Encodes various types of input prompts.mask_decoder (
MaskDecoder
) – Predicts masks from the image embeddings and encoded prompts.
- forward(images, batched_prompts, multimask_output)#
Predicts masks end-to-end from provided images and prompts.
This method expects that the images have already been pre-processed, at least been normalized, resized and padded to be compatible with the self.image_encoder.
Note
For each image \((3, H, W)\), it is possible to input a batch (\(K\)) of \(N\) prompts, the results are batched by the number of prompts batch. So given a prompt with \(K=5\), and \(N=10\), the results will look like \(5xCxHxW\) where \(C\) is determined by multimask_output. And within each of these masks \((5xC)\), it should be possible to find \(N\) instances if the model succeed.
- Parameters:
images (
Tensor
) – The image as a torch tensor in \((B, 3, H, W)\) format, already transformed for input to the model.batched_prompts (
list
[dict
[str
,Any
]]) –- A list over the batch of images (list length should be \(B\)), each a dictionary with
the following keys. If it does not have the respective prompt, it should not be included in this dictionary. The options are:
- ”points”: tuple of (Tensor, Tensor) within the coordinate keypoints and their respective labels.
the tuple should look like (keypoints, labels), where:
The keypoints (a tensor) are a batched point prompts for this image, with shape \((K, N, 2)\). Already transformed to the input frame of the model.
The labels (a tensor) are a batched labels for point prompts, with shape \((K, N)\). Where 1 indicates a foreground point and 0 indicates a background point.
- ”boxes”: (Tensor) Batched box inputs, with shape \((K, 4)\). Already transformed to the input
frame of the model.
”mask_inputs”: (Tensor) Batched mask inputs to the model, in the form \((K, 1, H, W)\).
multimask_output (
bool
) – Whether the model should predict multiple disambiguating masks, or return a single mask.
- Return type:
- Returns:
- A list over input images, where each element is as SegmentationResults the following.
- logits: Low resolution logits with shape \((K, C, H, W)\). Can be passed as mask input to
subsequent iterations of prediction. Where \(K\) is the number of input prompts, \(C\) is determined by multimask_output, and \(H=W=256\) are the model output size.
scores: The model’s predictions of mask quality (iou prediction), in shape BxC.
- static from_config(config)#
Build/load the SAM model based on it’s config.
- Parameters:
config (
SamConfig
) – The SamConfig data structure. If the model_type is available, build from it, otherwise will use the parameters set.- Return type:
- Returns:
The respective SAM model
Example
>>> from kornia.contrib.models.sam import SamConfig >>> sam_model = Sam.from_config(SamConfig('vit_b'))
Image Patches#
- kornia.contrib.compute_padding(original_size, window_size, stride=None)#
Compute required padding to ensure chaining of
extract_tensor_patches()
andcombine_tensor_patches()
produces expected result.- Parameters:
original_size (
Union
[int
,Tuple
[int
,int
]]) – the size of the original tensor.window_size (
Union
[int
,Tuple
[int
,int
]]) – the size of the sliding window used while extracting patches.stride (
Union
[int
,Tuple
[int
,int
],None
], optional) – The stride of the sliding window. Optional: if not specified, window_size will be used. Default:None
- Returns:
(top, bottom, left, right)
- Return type:
The required padding as a tuple of four ints
Example
>>> image = torch.arange(12).view(1, 1, 4, 3) >>> padding = compute_padding((4,3), (3,3)) >>> out = extract_tensor_patches(image, window_size=(3, 3), stride=(3, 3), padding=padding) >>> combine_tensor_patches(out, original_size=(4, 3), window_size=(3, 3), stride=(3, 3), unpadding=padding) tensor([[[[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]]]])
Note
This function will be implicitly used in
extract_tensor_patches()
andcombine_tensor_patches()
if allow_auto_(un)padding is set to True.
- kornia.contrib.extract_tensor_patches(input, window_size, stride=1, padding=0, allow_auto_padding=False)#
Function that extract patches from tensors and stacks them.
See
ExtractTensorPatches
for details.- Parameters:
input (
Tensor
) – tensor image where to extract the patches with shape \((B, C, H, W)\).window_size (
Union
[int
,Tuple
[int
,int
]]) – the size of the sliding window and the output patch size.stride (
Union
[int
,Tuple
[int
,int
]], optional) – stride of the sliding window. Default:1
padding (
Union
[int
,Tuple
[int
,int
],Tuple
[int
,int
,int
,int
]], optional) – Zero-padding added to both side of the input. Default:0
allow_auto_adding – whether to allow automatic padding if the window and stride do not fit into the image.
- Return type:
- Returns:
the tensor with the extracted patches with shape \((B, N, C, H_{out}, W_{out})\).
Examples
>>> input = torch.arange(9.).view(1, 1, 3, 3) >>> patches = extract_tensor_patches(input, (2, 3)) >>> input tensor([[[[0., 1., 2.], [3., 4., 5.], [6., 7., 8.]]]]) >>> patches[:, -1] tensor([[[[3., 4., 5.], [6., 7., 8.]]]])
- kornia.contrib.combine_tensor_patches(patches, original_size, window_size, stride, allow_auto_unpadding=False, unpadding=0, eps=1e-8)#
Restore input from patches.
See
CombineTensorPatches
for details.- Parameters:
patches (
Tensor
) – patched tensor with shape \((B, N, C, H_{out}, W_{out})\).original_size (
Union
[int
,Tuple
[int
,int
]]) – the size of the original tensor and the output size.window_size (
Union
[int
,Tuple
[int
,int
]]) – the size of the sliding window used while extracting patches.stride (
Union
[int
,Tuple
[int
,int
]]) – stride of the sliding window.unpadding (
Union
[int
,Tuple
[int
,int
],Tuple
[int
,int
,int
,int
]], optional) – remove the padding added to both side of the input. Default:0
allow_auto_unpadding (
bool
, optional) – whether to allow automatic unpadding of the input if the window and stride do not fit into the original_size. Default:False
eps (
float
, optional) – small value used to prevent division by zero. Default:1e-8
- Return type:
- Returns:
The combined patches in an image tensor with shape \((B, C, H, W)\).
Example
>>> out = extract_tensor_patches(torch.arange(16).view(1, 1, 4, 4), window_size=(2, 2), stride=(2, 2)) >>> combine_tensor_patches(out, original_size=(4, 4), window_size=(2, 2), stride=(2, 2)) tensor([[[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]]]])
Note
This function is supposed to be used in conjunction with
extract_tensor_patches()
.
- class kornia.contrib.ExtractTensorPatches(window_size, stride=1, padding=0, allow_auto_padding=False)#
Module that extract patches from tensors and stack them.
In the simplest case, the output value of the operator with input size \((B, C, H, W)\) is \((B, N, C, H_{out}, W_{out})\).
- where
\(B\) is the batch size.
\(N\) denotes the total number of extracted patches stacked in
\(C\) denotes the number of input channels.
\(H\), \(W\) the input height and width of the input in pixels.
\(H_{out}\), \(W_{out}\) denote to denote to the patch size defined in the function signature. left-right and top-bottom order.
window_size
is the size of the sliding window and controls the shape of the output tensor and defines the shape of the output patch.stride
controls the stride to apply to the sliding window and regulates the overlapping between the extracted patches.padding
controls the amount of implicit zeros-paddings on both sizes at each dimension.allow_auto_padding
allows automatic calculation of the padding required to fit the window and stride into the image.
The parameters
window_size
,stride
andpadding
can be either:a single
int
– in which case the same value is used for the height and width dimension.a
tuple
of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension.
padding
can also be atuple
of four ints – in which case, the first two ints are for the height dimension while the last two ints are for the width dimension.- Parameters:
input – tensor image where to extract the patches with shape \((B, C, H, W)\).
window_size (
Union
[int
,Tuple
[int
,int
]]) – the size of the sliding window and the output patch size.stride (
Union
[int
,Tuple
[int
,int
]], optional) – stride of the sliding window. Default:1
padding (
Union
[int
,Tuple
[int
,int
],Tuple
[int
,int
,int
,int
]], optional) – Zero-padding added to both side of the input. Default:0
allow_auto_adding – whether to allow automatic padding if the window and stride do not fit into the image.
- Shape:
Input: \((B, C, H, W)\)
Output: \((B, N, C, H_{out}, W_{out})\)
- Returns:
the tensor with the extracted patches.
Examples
>>> input = torch.arange(9.).view(1, 1, 3, 3) >>> patches = extract_tensor_patches(input, (2, 3)) >>> input tensor([[[[0., 1., 2.], [3., 4., 5.], [6., 7., 8.]]]]) >>> patches[:, -1] tensor([[[[3., 4., 5.], [6., 7., 8.]]]])
- class kornia.contrib.CombineTensorPatches(original_size, window_size, stride=None, unpadding=0, allow_auto_unpadding=False)#
Module that combines patches back into full tensors.
In the simplest case, the output value of the operator with input size \((B, N, C, H_{out}, W_{out})\) is \((B, C, H, W)\).
- where
\(B\) is the batch size.
\(N\) denotes the total number of extracted patches stacked in
\(C\) denotes the number of input channels.
\(H\), \(W\) the input height and width of the input in pixels.
\(H_{out}\), \(W_{out}\) denote to denote to the patch size defined in the function signature. left-right and top-bottom order.
original_size
is the size of the original image prior to extracting tensor patches and defines the shape of the output patch.window_size
is the size of the sliding window used while extracting tensor patches.stride
controls the stride to apply to the sliding window and regulates the overlapping between the extracted patches.unpadding
is the amount of padding to be removed. If specified, this value must be the same as padding used while extracting tensor patches.allow_auto_unpadding
allows automatic calculation of the padding required to fit the window and stride into the image. This must be used if the allow_auto_padding flag was used for extracting the patches.
The parameters
original_size
,window_size
,stride
, andunpadding
can be either:a single
int
– in which case the same value is used for the height and width dimension.a
tuple
of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension.
unpadding
can also be atuple
of four ints – in which case, the first two ints are for the height dimension while the last two ints are for the width dimension.- Parameters:
patches – patched tensor with shape \((B, N, C, H_{out}, W_{out})\).
original_size (
Tuple
[int
,int
]) – the size of the original tensor and the output size.window_size (
Union
[int
,Tuple
[int
,int
]]) – the size of the sliding window used while extracting patches.stride (
Union
[int
,Tuple
[int
,int
],None
], optional) – stride of the sliding window. Default:None
unpadding (
Union
[int
,Tuple
[int
,int
],Tuple
[int
,int
,int
,int
]], optional) – remove the padding added to both side of the input. Default:0
allow_auto_unpadding (
bool
, optional) – whether to allow automatic unpadding of the input if the window and stride do not fit into the original_size. Default:False
eps – small value used to prevent division by zero.
- Shape:
Input: \((B, N, C, H_{out}, W_{out})\)
Output: \((B, C, H, W)\)
Example
>>> out = extract_tensor_patches(torch.arange(16).view(1, 1, 4, 4), window_size=(2, 2), stride=(2, 2)) >>> combine_tensor_patches(out, original_size=(4, 4), window_size=(2, 2), stride=(2, 2)) tensor([[[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]]]])
Note
This function is supposed to be used in conjunction with
ExtractTensorPatches
.
Image Classification#
- class kornia.contrib.VisionTransformer(image_size=224, patch_size=16, in_channels=3, embed_dim=768, depth=12, num_heads=12, dropout_rate=0.0, dropout_attn=0.0, backbone=None)#
Vision transformer (ViT) module.
The module is expected to be used as operator for different vision tasks.
The method is inspired from existing implementations of the paper [DBK+21].
Warning
This is an experimental API subject to changes in favor of flexibility.
- Parameters:
image_size (
int
, optional) – the size of the input image. Default:224
patch_size (
int
, optional) – the size of the patch to compute the embedding. Default:16
in_channels (
int
, optional) – the number of channels for the input. Default:3
embed_dim (
int
, optional) – the embedding dimension inside the transformer encoder. Default:768
depth (
int
, optional) – the depth of the transformer. Default:12
num_heads (
int
, optional) – the number of attention heads. Default:12
dropout_rate (
float
, optional) – dropout rate. Default:0.0
dropout_attn (
float
, optional) – attention dropout rate. Default:0.0
backbone (
Module
|None
, optional) – an nn.Module to compute the image patches embeddings. Default:None
Example
>>> img = torch.rand(1, 3, 224, 224) >>> vit = VisionTransformer(image_size=224, patch_size=16) >>> vit(img).shape torch.Size([1, 197, 768])
- forward(x)#
Define the computation performed at every call.
Should be overridden by all subclasses. :rtype:
Tensor
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- static from_config(variant, pretrained=False, **kwargs)#
Build ViT model based on the given config string. The format is
vit_{size}/{patch_size}
. E.g.vit_b/16
means ViT-Base, patch size 16x16. Ifpretrained=True
, AugReg weights are loaded. The weights are hosted on HuggingFace’s model hub: https://huggingface.co/kornia.Note
The available weights are:
vit_l/16
,vit_b/16
,vit_s/16
,vit_ti/16
,vit_b/32
,vit_s/32
.- Parameters:
variant (
str
) – ViT model variant e.g.vit_b/16
.pretrained (
bool
, optional) – whether to load pre-trained AugReg weights. Default:False
kwargs (
Any
) – other keyword arguments that will be passed tokornia.contrib.vit.VisionTransformer()
.
- Return type:
- Returns:
The respective ViT model
Example
>>> from kornia.contrib import VisionTransformer >>> vit_model = VisionTransformer.from_config("vit_b/16", pretrained=True)
- class kornia.contrib.MobileViT(mode='xxs', in_channels=3, patch_size=(2, 2), dropout=0.0)#
Module MobileViT. Default arguments is for MobileViT XXS.
Paper: https://arxiv.org/abs/2110.02178 Based on: https://github.com/chinhsuanwu/mobilevit-pytorch
- Parameters:
mode (
str
, optional) – ‘xxs’, ‘xs’ or ‘s’, defaults to ‘xxs’. Default:"xxs"
in_channels (
int
, optional) – the number of channels for the input image. Default:3
patch_size (
Tuple
[int
,int
], optional) – image_size must be divisible by patch_size. Default:(2, 2)
dropout (
float
, optional) – dropout ratio in Transformer. Default:0.0
Example
>>> img = torch.rand(1, 3, 256, 256) >>> mvit = MobileViT(mode='xxs') >>> mvit(img).shape torch.Size([1, 320, 8, 8])
- class kornia.contrib.TinyViT(img_size=224, in_chans=3, num_classes=1000, embed_dims=[96, 192, 384, 768], depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_sizes=[7, 7, 14, 7], mlp_ratio=4.0, drop_rate=0.0, drop_path_rate=0.0, use_checkpoint=False, mbconv_expand_ratio=4.0, local_conv_size=3, activation=nn.GELU, mobile_sam=False)#
TinyViT model, as described in https://arxiv.org/abs/2207.10666
- Parameters:
img_size (
int
, optional) – Size of input image. Default:224
in_chans (
int
, optional) – Number of input image’s channels. Default:3
num_classes (
int
, optional) – Number of output classes. Default:1000
embed_dims (
list
[int
], optional) – List of embedding dimensions. Default:[96, 192, 384, 768]
depths (
list
[int
], optional) – List of block count for each downsampling stage Default:[2, 2, 6, 2]
num_heads (
list
[int
], optional) – List of attention heads used in self-attention for each downsampling stage. Default:[3, 6, 12, 24]
window_sizes (
list
[int
], optional) – List of self-attention’s window size for each downsampling stage. Default:[7, 7, 14, 7]
mlp_ratio (
float
, optional) – Ratio of MLP dimension to embedding dimension in self-attention. Default:4.0
drop_rate (
float
, optional) – Dropout rate. Default:0.0
drop_path_rate (
float
, optional) – Stochastic depth rate. Default:0.0
use_checkpoint (
bool
, optional) – Whether to use activation checkpointing to trade compute for memory. Default:False
mbconv_expand_ratio (
float
, optional) – Expansion ratio used in MBConv block. Default:4.0
local_conv_size (
int
, optional) – Kernel size of convolution used in TinyViTBlock Default:3
activation (
type
[Module
], optional) – activation function. Default:nn.GELU
mobile_same – Whether to use modifications for MobileSAM.
- forward(x)#
Classify images if
mobile_sam=False
, produce feature maps ifmobile_sam=True
.- Return type:
- static from_config(variant, pretrained=False, **kwargs)#
Create a TinyViT model from pre-defined variants.
- Parameters:
variant (
str
) – TinyViT variant. Possible values:'5m'
,'11m'
,'21m'
.pretrained (
bool
|str
, optional) – whether to use pre-trained weights. Possible values:False
,True
,'in22k'
,'in1k'
. For TinyViT-21M (variant='21m'
),'in1k_384'
,'in1k_512'
are also available. Default:False
**kwargs (
Any
) – other keyword arguments that will be passed toTinyViT
.
- Return type:
Note
When
img_size
is different from the pre-trained size, bicubic interpolation will be performed on attention biases. When usingpretrained=True
, ImageNet-1k checkpoint ('in1k'
) is used. For feature extraction or fine-tuning, ImageNet-22k checkpoint ('in22k'
) is preferred.
- class kornia.contrib.ClassificationHead(embed_size=768, num_classes=10)#
Module to be used as a classification head.
- Parameters:
Example
>>> feat = torch.rand(1, 256, 256) >>> head = ClassificationHead(256, 10) >>> head(feat).shape torch.Size([1, 10])
Image Stitching#
- class kornia.contrib.ImageStitcher(matcher, estimator='ransac', blending_method='naive')#
Stitch two images with overlapping fields of view.
- Parameters:
matcher (
Module
) – image feature matching module.estimator (
str
, optional) – method to compute homography, either “vanilla” or “ransac”. “ransac” is slower with a better accuracy. Default:"ransac"
blending_method (
str
, optional) – method to blend two images together. Only “naive” is currently supported. Default:"naive"
Note
Current implementation requires strict image ordering from left to right.
IS = ImageStitcher(KF.LoFTR(pretrained='outdoor'), estimator='ransac').cuda() # Compute the stitched result with less GPU memory cost. with torch.inference_mode(): out = IS(img_left, img_right) # Show the result plt.imshow(K.tensor_to_image(out))
Lambda#
- class kornia.contrib.Lambda(func)#
Applies user-defined lambda as a transform.
- Parameters:
- Returns:
The output of the user-defined lambda.
Example
>>> import kornia >>> x = torch.rand(1, 3, 5, 5) >>> f = Lambda(lambda x: kornia.color.rgb_to_grayscale(x)) >>> f(x).shape torch.Size([1, 1, 5, 5])
Distance Transform#
- kornia.contrib.distance_transform(image, kernel_size=3, h=0.35)#
Approximates the Manhattan distance transform of images using cascaded convolution operations.
The value at each pixel in the output represents the distance to the nearest non-zero pixel in the image image. It uses the method described in [PDP20]. The transformation is applied independently across the channel dimension of the images.
- Parameters:
- Return type:
- Returns:
tensor with shape \((B,C,H,W)\).
Example
>>> tensor = torch.zeros(1, 1, 5, 5) >>> tensor[:,:, 1, 2] = 1 >>> dt = kornia.contrib.distance_transform(tensor)
- kornia.contrib.diamond_square(output_size, roughness=0.5, random_scale=1.0, random_fn=torch.rand, normalize_range=None, device=None, dtype=None)#
Generates Plasma Fractal Images using the diamond square algorithm.
See: https://en.wikipedia.org/wiki/Diamond-square_algorithm
- Parameters:
output_size (
Tuple
[int
,int
,int
,int
]) – a tuple of integers with the BxCxHxW of the image to be generated.roughness (
Union
[float
,Tensor
], optional) – the scale value to apply at each recursion step. Default:0.5
random_scale (
Union
[float
,Tensor
], optional) – the initial value of the scale for recursion. Default:1.0
random_fn (
Callable
[...
,Tensor
], optional) – the callable function to use to sample a random tensor. Default:torch.rand
normalize_range (
Optional
[Tuple
[float
,float
]], optional) – whether to normalize using min-max the output map. In case of a range is specified, min-max norm is applied between the provided range. Default:None
device (
Optional
[device
], optional) – the torch device to place the output map. Default:None
dtype (
Optional
[dtype
], optional) – the torch dtype to place the output map. Default:None
- Return type:
- Returns:
A tensor with shape \((B,C,H,W)\) containing the fractal image.
- class kornia.contrib.DistanceTransform(kernel_size=3, h=0.35)#
Module that approximates the Manhattan (city block) distance transform of images using convolutions.
KMeans#
- class kornia.contrib.KMeans(num_clusters, cluster_centers, tolerance=10e-4, max_iterations=0, seed=None)#
Implements the kmeans clustering algorithm with euclidean distance as similarity measure.
- Parameters:
num_clusters (
int
) – number of clusters the data has to be assigned tocluster_centers (
Tensor
|None
) – tensor of starting cluster centres can be passed instead of num_clusterstolerance (
float
, optional) – float value. the algorithm terminates if the shift in centers is less than tolerance Default:10e-4
max_iterations (
int
, optional) – number of iterations to run the algorithm for Default:0
seed (
int
|None
, optional) – number to set torch manual seed for reproducibility Default:None
Example
>>> kmeans = kornia.contrib.KMeans(3, None, 10e-4, 100, 0) >>> kmeans.fit(torch.rand((1000, 5))) >>> predictions = kmeans.predict(torch.rand((10, 5)))
- fit(X)#
Iterative KMeans clustering till a threshold for shift in cluster centers or a maximum no of iterations have reached.