Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Fisheye Omnidirectional View in Autonomous Driving

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 55 Anzeige

Fisheye Omnidirectional View in Autonomous Driving

Herunterladen, um offline zu lesen

fisheye ,omnidirectional ,panorama ,surround view ,autonomous driving , deep learning , transform , distortion , projection , depth, pose , detection , segmentation , spherical , equirectangular, blending

fisheye ,omnidirectional ,panorama ,surround view ,autonomous driving , deep learning , transform , distortion , projection , depth, pose , detection , segmentation , spherical , equirectangular, blending

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Fisheye Omnidirectional View in Autonomous Driving (20)

Anzeige

Weitere von Yu Huang (20)

Aktuellste (20)

Anzeige

Fisheye Omnidirectional View in Autonomous Driving

  1. 1. Fisheye/Ominidirectional View in Autonomous Driving YuHuang Yu.huang07@gmail.com Sunnyvale,California
  2. 2. Outline • Graph-Based Classification of Omnidirectional Images • Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery • Spherical CNNs • Scene Understanding Networks for AD based on Around View Monitoring System • Eliminating the Blind Spot: Adapting 3D Object Detection and Mono Depth Estimation to 360◦ Panoramic Imagery • SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images • FisheyeMODNet: Moving Object detection on Surround-view Cameras for AD • OmniDRL: Robust Pedestrian Detection using DRL on Omnidirectional Cameras • WoodScape: A multi-task, multi-camera fisheye dataset for AD • FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Mono Fisheye Camera for AD
  3. 3. Graph-Based Classification of Omnidirectional Images • Omnidirectional cameras are widely used in such areas as robotics and virtual reality as they provide a wide field of view. • Their images are often processed with classical methods, which might unfortunately lead to non-optimal solutions as these methods are designed for planar images that have different geometrical properties than omnidirectional ones. • Here, image classification by taking into account the specific geometry of omnidirectional cameras with graph-based representations. • In particular, deep learning architectures for data on graphs. • It is a principled way of graph construction such that convolutional filters respond similarly for the same pattern on different positions of the image regardless of lens distortions. • Reference: “Graph-based Isometry Invariant Representation Learning”, ICML, 2017
  4. 4. Graph-Based Classification of Omnidirectional Images • Transformation Invariant Graph-based Network (TIGraNet): • It takes as input images that are represented as signals on a grid graph and gives classification labels as output. • Briefly this approach proposes a network of alternatively stacked spectral convolutional and dynamic pooling layers, which creates features that are equivariant to the isometric transformation. • Further, the output of the last layer is processed by a statistical layer, which makes the equivariant representation of data invariant to isometric transformations. • Finally, the resulting feature vector is fed to a number of fully-connected layers and a softmax layer, which outputs the probability distribution that the signal belongs to each of the given classes. • This transformation-invariant classification algorithm is extended to omnidirectional images by incorporating the knowledge about the camera lens geometry in the graph structure.
  5. 5. Graph-Based Classification of Omnidirectional Images The graph construction method makes response of the filter similar regardless of different position of the pattern on an image from an omnidirectional camera.
  6. 6. Graph-Based Classification of Omnidirectional Images TIGraNet architecture. The network is composed of an alternation of spectral convolution layers Fl and dynamic pooling layers Pl, followed by a statistical layer H, multiple fully-connected layers (FC) and a softmax operator (SM). The input of the network is an image that is represented as a signal y0 on the grid-graph with Laplacian matrix L. The output of the system is a label that corresponds to the most likely class for the input sample.
  7. 7. Graph-Based Classification of Omnidirectional Images Example of the gnomonic projection. An object from tangent plane Ti is projected to the sphere at tangency point X0,i, which is defined by spherical coordinates φi , θi . The point Xk,I is defined by coordinates (xk,i , yk,i ) on the plane. Example of the equirectangular representation of the image. On the left, the figure depicts the original image on the tangent plane Ti, on the right, projected to the points of the sphere. To build an equirectangular image the values points on the discrete regular grid are often approximated from the values of projected points by interpolation.
  8. 8. Graph-Based Classification of Omnidirectional Images a) Choose pattern p0 , .., p4 from an object on tangent plane Te at equator (φe = 0, θe = 0) (red points) and then, b) move this object on the sphere by moving the tangent plane Ti to point (φi,θi). c) Thus, the filter localized at tangency point (φi , θi ) uses values pi,1 , pi,3 (blue points) which we can obtain by interpolation. The goal is to develop a transformation invariant system, which can recognize the same object on different planes Ti that are tangent to S at different points (φi , θi ) without any extra training. The challenge of building such a system is to design a proper graph signal representation that allow compensating for the distortion effects that appear on different elevations of S.
  9. 9. Graph-Based Classification of Omnidirectional Images Comparison to the state-of-the-art methods on the ETH- 80 datasets. Select the architecture of different methods to feature similar number of convolutional filters and neurons in the fully-connected layers.
  10. 10. Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery • While 360° cameras offer tremendous new possibilities in vision, graphics, and augmented reality, the spherical images they produce make core feature extraction non-trivial. • Convolutional neural networks (CNNs) trained on images from perspective cameras yield “flat" filters, yet 360° images cannot be projected to a single plane without significant distortion. • A naive solution that repeatedly projects the viewing sphere to all tangent planes is accurate, but much too computationally intensive for real problems. • Flat2Sphere learns a spherical convolutional network that translates a planar CNN to process 360° imagery directly in its equirectangular projection. • This approach learns to reproduce the flat filter outputs on 360° data, sensitive to the varying distortion effects across the viewing sphere. • The key benefits are 1) efficient feature extraction for 360° images and video, and 2) the ability to leverage powerful pre-trained networks researchers have carefully honed (together with massive labeled image training sets) for perspective images.
  11. 11. Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery Strategies for applying CNNs to 360° images. Top: The 1st strategy unwraps the 360° input into a single planar image using a global projection (equirectangular), then applies the CNN on the distorted planar image. Bottom: The 2nd strategy samples multiple tangent planar projections to obtain multiple perspective images, to which the CNN is applied independently to obtain local results for the original 360° image. Strategy I is fast but inaccurate; Strategy II is accurate but slow. The approach learns to replicate flat filters on spherical imagery, offering both speed and accuracy.
  12. 12. Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery Spherical convolution differs from ordinary CNN. (a) The kernel weight in spherical convolution is tied only along each row, and each kernel convolves along the row to generate 1D output. Note that the kernel size differs at different rows and layers, and it expands near the top and bottom of the image. (b) Inverse perspective projections P−1 to equirectangular projections at different polar angles θ. The same square image will distort to different sizes and shapes depending on θ.
  13. 13. Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery Object detection examples on 360° PASCAL test images. Images show the top 40% of equirectangular projection; black regions are undefined pixels. Text gives predicted label, multi-class probability, and IoU, resp.
  14. 14. Spherical CNNs • Convolutional Neural Networks (CNNs) have become the method of choice for learning problems involving 2D planar images. • However, a number of problems of recent interest have created a demand for models that can analyze spherical images. • Examples include omnidirectional vision for drones, robots, and autonomous cars, molecular regression problems, and global weather and climate modelling. • A naive application of convolutional networks to a planar projection of the spherical signal is destined to fail, because the space-varying distortions introduced by such a projection will make translational weight sharing ineffective. • In this work there are building blocks for constructing spherical CNNs. • It defines the spherical cross-correlation that is both expressive and rotation-equivariant. • The spherical correlation satisfies a generalized Fourier theorem, which allows to compute it efficiently using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm.
  15. 15. Spherical CNNs • S2 and SO(3) correlation by analogy to the classical planar Z2 correlation. • The planar correlation can be understood as follows: • The value of the output feature map at translation x ∈ Z2 is computed as an inner product between the input feature map and a filter, shifted by x. • Similarly, the spherical correlation can be understood as follows: • The value of the output feature map evaluated at rotation R ∈ SO(3) is computed • as an inner product between the input feature map and a filter, rotated by R. • For functions on the sphere and rotation group, there is an analogous transform, which is referred to as generalized Fourier transform (GFT) and a corresponding fast algorithm (GFFT).
  16. 16. Spherical CNNs Spherical correlation in the spectrum. The signal f and the locally-supported filter ψ are Fourier transformed, block-wise tensored, summed over input channels, and finally inverse transformed. Note that because the filter is locally supported, it is faster to use a matrix multiplication (DFT) than an FFT algorithm for it. It parameterizes the sphere using spherical coordinates α, β, and SO(3) with ZYZ-Euler angles α, β, γ.
  17. 17. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System • Modern driver assistance systems rely on a wide range of sensors (RADAR, LIDAR, ultrasound and cameras) for scene understanding and prediction. • These sensors are typically used for detecting traffic participants and scene elements required for navigation. • Relying on camera based systems, specifically Around View Monitoring (AVM) system has great potential to achieve these goals in both parking and driving modes with decreased costs. • This is a new end-to-end solution for delimiting the safe drivable area for each frame by means of identifying the closest obstacle in each direction from the driving vehicle; • It calculates the distance to the nearest obstacles and is incorporated into a unified end- to- end architecture capable of joint object detection, curb detection and safe drivable area detection. • Augmentation of the base architecture with 3D object detection.
  18. 18. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System This approach for detecting the curb and the free drivable area is inspired by a Stixel representation of the world. Originally, the network takes as input each vertical column of an image. The input columns that the network used had width 24, overlapped over 23 pixels. Each column would then be passed through a convolutional network to output one-of-k labels, with k being the height dimension. As a result, it would learn to classify the position of the bottom pixel of the obstacle corresponding to that column. The union of all columns would build either the curb or the free drivable area of the scene.
  19. 19. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System • In this architecture, due to the overlapping between the columns, more than 95% of the computation is redundant. • Motivated by this observation, replace the column-wise network implementation with an end-to-end architecture. • This network encoded the image into a deep feature map using multiple convolutional layers and then used multiple upsampling layers to generate a feature map having the same resolution as the input image. • Crop hardcoded regions of the image corresponding to the pixel columns augmented with the neighboring area of 23 pixels. • As a result, the regions of interest for cropping the upsampled feature map are 23 pixels wide and 720 (height) pixels tall. • Slide this window horizontally over the image at each x-coordinate. • The resulting crops are then resized to a fixed length (e.g.7x7) in the ROI pooling layer and are classified to one-of-k classes (k is the height of the image), to ultimately predict the bottom point.
  20. 20. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System Bottom prediction architecture using ROI pooling for each column Use a single shot method for the final classification layer of the bottom prediction task. Moreover, to make the network more efficient, replace the decoder part of the network corresponding to the multiple upsample layers with a single dense horizontal upsampling layer. The resulting feature map generated from the encoder after applying multiple convolutions with stride > 1 has a resolution of [width/16, height/16], being reduced 16 times the original image size.
  21. 21. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System Finally, add another fully connected layer on top of the horizontal upsampling layer to make a linear combination of each column’s input. A softmax is used to classify each of the resulted columns to one-of-k categories, where k is the height of the image being predicted. Each column classification subtask automatically takes into account the pixels displayed in the proximity of the center column being classified and represents the final bottom prediction. Bottom-Net architecture
  22. 22. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System Unified architectures which combine the bottom prediction and the object detection networks usually take advantage of shared computation of the encoder for better training optimization and runtime performance.
  23. 23. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System The final architecture consists of two branches, for object orientation estimation based on angle discretization and for object dimensions regression, respectively. 3D-Net architecture
  24. 24. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System Side view detections. (left) left view. (right) right view.
  25. 25. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System Captured frame from the high accuracy solution.
  26. 26. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images • Omnidirectional cameras offer great benefits over classical cameras wherever a wide field of view is essential, such as in virtual reality applications or in autonomous robots. • Unfortunately, standard convolutional neural networks are not well suited for this scenario as the natural projection surface is a sphere which cannot be unwrapped to a plane without introducing significant distortions, particularly in the polar regions. • SphereNet is a deep learning framework which encodes invariance against such distortions explicitly into convolutional neural networks. • Towards this goal, SphereNet adapts the sampling locations of the convolutional filters, effectively reversing distortions, and wraps the filters around the sphere. • By building on regular convolutions, SphereNet enables the transfer of existing perspective convolutional neural network models to the omnidirectional case. • On the tasks of image classification and object detection, it exploits two newly created semi- synthetic and real-world omnidirectional datasets.
  27. 27. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images Overview. (a+b) Capturing images with fisheye or 360◦ action camera results in images which are best represented on the sphere. (c) Using regular convolutions (e.g., with 3 × 3 filter kernels) on the rectified equirectangular representation (see Fig. 2b) suffers from distortions of the sampling locations (red) close to the poles. (d) In contrast, our SphereNet kernel exploits projections (red) of the sampling pattern on the tangent plane (blue), yielding filter outputs which are invariant to latitudinal rotations.
  28. 28. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images Kernel Sampling Pattern at φ = 0 (blue) and φ = 1.2 (red) in spherical (a) and equirectangular (b) representation. Note the distortion of the kernel at φ = 1.2 in (b).
  29. 29. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images Uniform Sphere Sampling. Comparison of an equirectangular sampling grid on the sphere with N = 200 points (a) to an approximation of evenly distributing N = 127 sampling points on a sphere with the Saff - Kuijlaars method(b, c). Note that the sampling points at the poles are much more evenly spaced in the uniform sphere sampling (b) compared to the equirectangular representation (a) which oversamples the image in these regions.
  30. 30. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images • SphereNet can be integrated into a convolutional neural network for image classification by adapting the sampling locations of the convolution and pooling kernels. • Furthermore, it is straightforward to additionally utilize a uniform sphere sampling, which is compared to nearest neighbor and bilinear interpolation on an equirectangular representation in the experiments. • The integration of SphereNet into an image classification network does not introduce novel model parameters and no changes to the training of the network are required. • In order to perform object detection on the sphere, the Spherical Single Shot MultiBox Detector (Sphere-SSD) adapts the Single Shot MultiBox Detector (SSD) to objects located on tangent planes of a sphere. • SSD exploits a fully convolutional architecture, predicting category scores and box offsets for a set of default anchor boxes of different scales and aspect ratios. • Sphere-SSD uses a weighted sum between a localization loss and confidence loss. • However, in contrast to the original SSD, anchor boxes are now placed on tangent planes of the sphere and are defined in terms of spherical coordinates of their respective tangent plane, the width/height of the box on the tangent plane as well as an in-plane rotation.
  31. 31. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images Spherical Anchor Boxes are gnomonic projections of 2D bounding boxes of various scales, aspect ratios and orientations on tangent planes of the sphere. The figure visualizes anchors of the same orientation at different scales and aspect ratios on a 16 × 8 feature map on a sphere (a) and an equirectangular grid (b).
  32. 32. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images Detection Results on FlyingCars Dataset. The ground truth is shown in green, SphereNet (NN) results in red.
  33. 33. Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360◦ Panoramic Imagery • Recent automotive vision work has focused on processing forward-facing cameras. • However, future autonomous vehicles will not be viable without a more comprehensive surround sensing, akin to a human driver, as can be provided by 360◦ panoramic cameras. • Here is an approach to adapt contemporary deep network architectures developed on conventional rectilinear imagery to work on equirectangular 360◦ panoramic imagery. • To address the lack of annotated panoramic automotive datasets availability, it adapts a contemporary automotive dataset, via style and projection transformations, to facilitate the cross-domain retraining of contemporary algorithms for panoramic imagery. • Following this approach, it retrains and adapts existing architectures to recover scene depth and 3D pose of vehicles from monocular panoramic imagery without any panoramic training labels or calibration parameters. • This approach is evaluated qualitatively on crowd-sourced panoramic images and quantitatively using an automotive environment simulator to provide the first benchmark for such techniques within panoramic imagery.
  34. 34. Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360◦ Panoramic Imagery Panoramic images are typically represented using an equirectangular projection (A); in contrast, a conventional camera uses a rectilinear projection. In this projection, the image-space coordinates are proportional to latitude and longitude of observed points rather than the usual projection onto a focal plane. Adaptig 3D Object Detection and Depth Estimation to Panoramic Imagery 3 monocular depth (B) and to recover the full 3D pose of vehicles (B) from panoramic imagery.
  35. 35. Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360◦ Panoramic Imagery Convolutions are computed seamlessly across horizontal image boundaries using the padding approach.
  36. 36. Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360◦ Panoramic Imagery Monocular depth recovery and 3D object detection with our approach. Left: Real-world images. Right: Synthetic images.
  37. 37. FisheyeMODNet: Moving Object detection on Surround-view Cameras for Autonomous Driving • Moving Object Detection is an important task for achieving robust autonomous driving. • An autonomous front vehicle has to estimate collision risk with other interacting objects in the environment and calculate an optional trajectory. • Collision risk is typically higher for moving objects than static ones due to the need to estimate the future states and poses of the objects for decision making. • This is particularly important for near-range objects around the vehicle which are typically detected by a fisheye surround-view system that captures a 360◦ view of the scene. • This work is a CNN architecture for moving object detection using fisheye images that were captured in autonomous driving environment. • To target embedded deployment, it designs a lightweight encoder sharing weights across sequential images.
  38. 38. FisheyeMODNet: Moving Object detection on Surround-view Cameras for Autonomous Driving Images from the surround-view camera network showing near field sensing and wide field of view. Four fish- eye cameras (marked green) provide 360◦ surround view.
  39. 39. FisheyeMODNet: Moving Object detection on Surround-view Cameras for Autonomous Driving Network Architecture adapted from ShuffleSeg base network. Two sequential images encoding the motion information across time are utilized train the network end-to-end for MOD.
  40. 40. OmniDRL: Robust Pedestrian Detection using Deep Reinforcement Learning on Omnidirectional Cameras • Pedestrian detection is one of the most explored topics in computer vision and robotics. • Deep Reinforcement Learning has proved to be within the SoA in terms of both detection in perspective cameras and robotics applications. • However, for detection in omnidirectional cameras, the literature is still scarce, mostly because of their high levels of distortion. • This is an efficient technique for robust pedestrian detection in omnidirectional images. • The method uses deep RL that takes advantage of the distortion in the image. • By considering the 3D bounding boxes and their distorted projections into the image, this method is able to provide the pedestrian’s position in the world, in contrast to the image positions provided by most SoA methods for perspective cameras. • The method avoids the need of pre-processing steps to remove the distortion, which is computation- ally expensive.
  41. 41. OmniDRL: Robust Pedestrian Detection using Deep Reinforcement Learning on Omnidirectional Cameras Illustration of the method, using a Multi-task network, for pedestrian detection in omnidirectional cameras. The input is an omnidirectional image with an initial state of the bounding box, represented in the world coordinate system. Using this information, a set of possible actions are applied in order to detect the pedestrian in the 3D environment. After the trigger is activated, the line segments of 3D bounding box estimated are projected to the omnidirectional image. Then, the IoU between the ground truth and our estimation is computed in the image coordinates.
  42. 42. OmniDRL: Robust Pedestrian Detection using Deep Reinforcement Learning on Omnidirectional Cameras Depiction of the scheme of the proposed network, where the first convolutional layers are shared, and then split into branches (DQN and Classification).
  43. 43. OmniDRL: Robust Pedestrian Detection using Deep Reinforcement Learning on Omnidirectional Cameras This figure shows the image formation using unified central catadioptric cameras. (a) the projection of a point R ∈ R3 onto the normalized image plane {i−, i+} (intermediate projection on the unitary sphere {n− , n+ }). (b) the projection of 3D straight line segments for images using this model (x1 and x2 are the edges of the line’s segment).
  44. 44. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving • Fisheye cameras are commonly employed for obtaining a large field of view in surveillance, augmented reality and in particular automotive applications. • In spite of their prevalence, there are few public datasets for detailed evaluation of computer vision algorithms on fisheye images. • The 1st extensive fisheye automotive dataset, WoodScape, named after Robert Wood who invented the fisheye camera in 1906. • WoodScape comprises of 4 surround view cameras and nine tasks including segmentation, depth estimation, 3D bounding box detection and soiling detection. • Semantic annotation of 40 classes at the instance level is provided for over 10,000 images and annotation for other tasks are provided for over 100,000 images.
  45. 45. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving WoodScape, the first fisheye image dataset dedicated to autonomous driving. It contains four cameras covering 360° accompanied by a HD laser scanner, IMU and GNSS. Annotations are made available for nine tasks, notably 3D object detection, depth estimation (overlaid on front camera) and semantic segmentation.
  46. 46. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving Comparison of fisheye models.
  47. 47. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving Undistorting the fisheye image: (a) Rectilinear correction; (b) Piecewise linear correction; (c) Cylindrical correction. Left: raw image; Right: undistorted image.
  48. 48. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving Segmentation using ENet (top) and Object detection using Faster RCNN (bottom).
  49. 49. FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving • Fisheye cameras are commonly used in applications like autonomous driving and surveillance to provide a large field of view (> 180◦). • However, they come at the cost of strong non-linear distortion which require more complex algorithms. • Here is Euclidean distance estimation on fisheye cameras for automotive scenes. • Obtaining accurate and dense depth supervision is difficult in practice, but self-supervised learning approaches show promising results and could potentially overcome the problem. • This is a self-supervised scale-aware framework for learning Euclidean distance and ego- motion from raw monocular fisheye videos without applying rectification. • While it is possible to perform piece-wise linear approximation of fisheye projection surface and apply standard rectilinear models, it has its own set of issues like re-sampling distortion and discontinuities in transition regions.
  50. 50. FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving Overview: the 1st row represents ego masks , Mt-1, Mt+1, indicates which pixel coordinates are valid when constructing It−1 from It and It from It+1 respectively. The 2nd row indicates the masking of static pixels computed after 2 epochs, where black pixels are filtered from the photometric loss (i.e. σ = 0). It prevents dynamic objects at similar speed as the ego car and low texture regions from contaminating the loss. The masks are computed for forward and backward sequences from the input sequence S and reconstructed images. The 3rd row represents the distance estimates corresponding to their input frames. Finally, the vehicle’s odometry data is used to resolve the scale factor issue.
  51. 51. FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving • The overall self-supervised SfM from motion objective consists of a photometric loss term Lp imposed between the reconstructed target image Iˆt and the target image It, and a distance regularization term Ls ensuring edge-aware smoothing in the distance estimates. • Finally, Ldc a cross-sequence distance consistency derived from the chain of frames in the training sequence S. • To prevent the training objective getting stuck in the local minima due to the gradient locality of the bilinear sampler, adopt 4 scales to train the network. • The distance estimation network is mainly based on the U-net architecture, an encoder- decoder network with skip connections. • After testing different variants of ResNet family, chose a ResNet18 as the encoder. • The key aspect is replacing with deformable convolutions since regular CNNs are inherently limited in modeling large, unknown geometric distortions due to their fixed structures, such as fixed filter kernels, fixed receptive field sizes, and fixed pooling kernels.
  52. 52. FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving • The backbone of pose estimation network is based on paper "Digging into self-supervised monocular depth estimation”, which predicts rotation using Euler angle parameterization. • Replace normal convolutions with deformable convolutions for the encoder-decoder setting. • Predict the rotation using an axis-angle representation, and scale the rotation and translation outputs by 0.01. • For monocular training, use a sequence length of three frames, while pose network is formed from a ResNet18, modified to accept a pair of color images (or six channels) as input and to predict a single 6-DoF relative pose between It−1→t and It→t−1. • Perform horizontal flips and following training augmentations: random brightness, contrast, saturation, and hue jitter with respective ranges of ±0.2, ±0.2, ±0.2, and ±0.1. • Importantly, the color augmentations are only applied to the images which are fed to the networks, not to those used to compute photometric loss term Lp. • All 3 images fed to the pose and depth networks are augmented with the same parameters.
  53. 53. FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving (a) Depth network: U-Net. (b) Pose network: A separate pose network. (c) Per-pixel minimum reprojection: When correspondences are good, the reprojection loss should be low. (d) Full-resolution multi-scale: Upsample depth predictions at intermediate layers and compute all losses at the input resolution, reducing texture-copy artifacts.
  54. 54. FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving FisheyeDistanceNet produces sharp distance maps on distorted fisheye images.

×