Table 1: Encoder network configuration

Type Layer Kernel size, stride, padding Output shape (d,w,h)
Conv. block1 Conv2d 3,1,1 (16,224,224)
Conv2d 3,1,1 (16,224,224)
Conv2d 3,2,2 (32,112,112)
Conv. block2 Conv2d 3,1,1 (32,112,112)
Conv2d 3,1,1 (32,112,112)
Conv2d 3,2,1 (64,56,56)
Conv2d 3,1,1 (64,56,56)
Conv2d 3,1,1 (64,56,56)
Conv. block3 Conv2d 3,2,1 (128,28,28)
Conv2d 3,1,1 (128,28,28)
Conv2d 3,1,1 (128,28,28)
Conv. block4 Conv2d 5,2,2 (256,14,14)
Conv2d 3,1,1 (256,14,14)
Conv2d 3,1,1 (256,14,14)
Conv. block5 Conv2d 5,2,2 (512,7,7)
Conv2d 3,1,1 (512,7,7)
Conv2d 3,1,1 (512,7,7)
Conv2d 3,1,1 (512,7,7)
Conv. block6 Conv2d 3,2,2 (256,5,5)
Conv2d 3,2,1 (256,3,3)
Conv2d 3,2,1 (256,2,2)
Linear (input, output) (1024, 256) (256)