Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Single shot multiboxdetectors

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 68 Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Single shot multiboxdetectors (20)

Anzeige

Aktuellste (20)

Single shot multiboxdetectors

  1. 1. OneStage DeTectors Here is where your presentations begins!
  2. 2. RETINANETSSD 01 02 03 NAS-FPN 04 EFFICIENTDET
  3. 3. SSD:sINGLEsHOT mULTIBOX dETECTOR 01
  4. 4. SSD : Introduction Object Detection 역사
  5. 5. Faster RCNN과 YOLO비교 SSD : Introduction
  6. 6. SSD : Introduction SOTA는 FASTER RCNN(2 Stage Detector) - BoundingBox 가설을 통해 각 Box에 대한 픽셀이나 피처의 Resample하고 Class를 분류하는 방법 Too computationally intensive for embedded systems - Faster RCNN도 7fps밖에 안나옴 Significantly increased speed - 정확도가 떨어짐, YOLO - Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4% The first deep network based object detector - does not resample pixels features for bounding box - accurate as approaches 두마리 토끼(속도와 정합성)을 잡자!
  7. 7. SSD : Single shot Detector - 여러개의 Default Box 사용, 여러개의 피처에 Prediction 진행 - 높은 레벨의 피처는 추상화가 잘되어 있어서 큰 물체를 잘 찾음 - 낮은 레벨의 피처는 위치정보가 정확함 이런 느낌? 마지막 피처에서만 찾지 말고, 처음, 중간, 마지막 피처에서 찾아보자
  8. 8. SSD : Model - VGG 16 의 변경 - VGG 16의 Conv5_3 Conv_7, Conv8_2, Conv9_2 Conv10_2, Conv11_2에서 추출 - Clasifier : 3x3x - Detections : 8732 - 74.3 mAP, 59FPS - 다양한 피처맵 SSD - 중간에 FC(?) - Detecion 98 Conv_7, Conv8_2, Conv9_2 Conv10_2, Conv11_2에서 추출 - Clasifier : 3x3x - Detections : 8732 - 63.4mAP, 45FPS(?) - 마지막 피처맵만 YOLO
  9. 9. SSD : Model Multi-scale feature maps for detection - 다른 Feature map에서 detection을 수행함 - 낮은 레이어는 물체의 위치가 더 정확히, 높은 레이어에서는 추상화가 잘되어 있으므로, 두개를 잘 섞자. Convolutional predictors for detection The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections - Detection을 할때는 3x3xP개의 Conv필터를 사용함 - 출력은 a score for a category(1개), or a shape offset relative to the default box coordinates(4개) Default boxes and aspect ratios - Our default boxes are similar to the anchor boxes used in Faster R-CNN - 마치 Faster RCNN처럼 기본 박스를 initial로 정하고, x, y, dw dh의 변화량을 학습함
  10. 10. SSD : Model Convolutional predictors for detection 좀더 자세히 - Classifier : Conv: 3x3x(4x(Classes+4)) - 구조 : 첫번째 박스[(4개(dx, dy, dh, dw), 20개(Poscal voc기준 20 class), + 1개(bg)] 두번째, 세번째 , ~6번재박스까지 - 출력 채널 : 150 = 6 x (21 = 4)
  11. 11. SSD : Model Yolo v3 참고 : 먼가 SSD랑 비슷함..(?)
  12. 12. SSD : Training Matching strategy - 많은 Default Boxes에서 GT랑 많이 겹치는 부분을 찾아내고 나머지는 Background처리 하는 기준이 IOU 0.5 - we then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5) - Jaccard overlap이 iou임 The key difference between training SSD and training a typical detector that uses region proposals, is that ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs. YOLO and for the region proposal stage of Faster R-CNN
  13. 13. SSD : Training Training objective - Faster RCNN이랑 비슷함 ● L conf : The confidence loss is the softmax loss over multiple classes confidences ● L Loc : we regress to offsets for the center (cx, cy) of the default bounding box (d) and for its width (w) and height (h), default box에서 얼마나 이동시키면 되는건가를 학습하는것임 Width와 height는 log임 스케일이 커질수 있으니까. N : the number of matched default boxes
  14. 14. SSD : Training - 고양이와 개가 존재(고양이는 작고, 개는 큼) - 8 x 8(낮은 레벨의 피처) 에서 iou가 0,5이상인것은 고양이만 검출(개는 더 크게 봐야함) - 4 x 4(높은 레벨의 피처) 에서는 iou사 0.5이상인것은 개만 검출(고양이는 너무 작음) - 피처에 따라 한 픽셀이 담당하는 원본이미지의 영역이 달라짐 Maching 알고리즘과 로스를 보고 다시한번 첫번째 그림을 해석하면
  15. 15. SSD : Training 여러 피처 맵에서 동일 물체를 찾을려고 서로 노력함
  16. 16. SSD : Training - 디폴트 박스를 만드는 식 설명 Choosing scales and aspect ratios for default boxes ● M : 몇개의 feature map에서 박스를 뽑아 낼것이냐 ● Smin, Smax는 상수(0.2~0.9) ● K는 선택하는 값 ● Example PASCAL VOC : sk 0.1, 0.2, 0.55, 0.725, 0.9 - Sk 계산이 끝나면 박스의 비율을 선택 ● ar ∈ {1, 2, 3, 1/2 , 1/3 }. ● 비율을 계산 width= sk √ ar, height = sk / √ ar 1이면, 정사각형 2 이면은 세로가 작은, 1/2이면 세로가 큰 ● 5개의 비율이 다른 박스를 생성 ● 바운딩 박스를 6개나 4개를 뽑았는데 1개는 sk만 가지고 추가로 만듬 ● 4개는 3이랑 1/3이 빠져서 4개가 됨
  17. 17. SSD : Training - After the matching step, most of the default boxes are negatives, especially when the number of possible default boxes is large - 모든 Detection에 대한 공통적인 문제, Bounding Box가 8732개인데 iou 0.5만 추려내서 사용한다면은 8732개중에 대부분이 Negative Sample이므로 거의 대부분의 데이터가 배경임 - Using the highest confidence loss for each default box - Thee ratio between the negatives and positives is at most 3:1. - 그래서 confidence로 순서를 세우고, Negative중에 높은것들중에 Positive의 3배만 선택 Hard negative mining - Use the entire original input image. - Sample a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9 - Randomly sample a patch. - The aspect ratio is between 1 2 and 2 - Horizontally flipped with probability of 0.5 - Applying some photo-metric distortions Data augmentation
  18. 18. SSD : Experimental Results - VGG16 - We convert fc6 and fc7 to convolutional layers - Using the highest confidence loss for each default box - Subsample parameters from fc6 and fc7, change pool5 from 2 × 2 − s2 to 3 × 3 − s1 - We remove all the dropout layers and the fc8 layer - We fine-tune the resulting model using SGD with initial learning rate 10−3 , 0.9 momentum, 0.0005 weight decay, and batch size 32 Base network
  19. 19. SSD : Experimental Results - Both Fast and Faster R-CNN use input images whose minimum dimension is 600 - The two SSD models have exactly the same settings except that they have different input sizes (300×300 vs. 512×512)
  20. 20. SSD : Experimental Results - XS=extra-small; S=small; M=medium; L=large; XL =extra-large. Aspect Ratio: XT=extra-tall/narrow; T=tall; M=medium; W=wide; XW =extra-wide - SSD는 작은 물체를 잘 검출하지 못한다. - 비율은 일그러져도 나름 잘 찾음
  21. 21. SSD : Experimental Results - 이 논문에서는 Data Augmentation 으로 해결 할려함. 작은 이미지를 train data에 추가함 Sensitivity and impact of different object ● we first randomly place an image on a canvas of 16× of the original image size filled with mean values 원본이미지에 16배 큰 캔버스에 붙여 넣기할 이미지의 평균값으로 채운다 ● We we do any random crop operation ● 그리고 이미지를 붙여 넣음 나름 잘 찾음
  22. 22. SSD : Experimental Results Other reasons? FPN의 시작 - 작은 물체는 낮은 레이어에서 검출됨. - 낮은 레이어는 충분하게 Abstraction 이 되어 있지 않아서 검출이 힘듬 - 높은 레이어에서는 충분한 Abtration이 되어 있으나 작은 물체는 검출이 힘듬(큰물체는 잘 찾음) - 높은 레이어의 Abtration결과를 낮은 레이어로 전파해주자. 다시 거꾸로 올려줌 - FPN의 시작. 그중 Retina를 살펴보겠음
  23. 23. RETINANET:FocalLossfor DenseObjectDetection 02
  24. 24. RETINA : Introduction SOTA는 Two Stage Detector(FASTER FCNN …) Could a simple one-stage detector achieve similar accuracy? Class imbalance가 문제인데 (Negative : 배경이 너무 많음) We propose a new loss function that acts as a more effective alternative to previous approaches for dealing with class imbalance - Faster RCNN은 RPN을 통해 바운딩 박스를 휴리스틱방법을 통해 줄여줌 - Single Stage Detector는 제안하는 박스가 너무 많고 대부분이 배경임 - One Stage : Fast, Simple - Two Stage : 10~40% better accuracy - CE(Cross Entropy)에 몇개 Term을 추가한 focal loss를 제안 - 쉬운 샘플을 더욱더 쉽게 만들어서 어려운 샘플에 더 focus하게 만드는 loss - YOLOv1(98 boxes), YOLOv2(1K), OverFeat(1~2K), SSD(~8-26k) - Default boxes가 많을수록 성능이 좋음
  25. 25. RETINA : Introduction Cross Entropy with Imbalance Data We propose a new loss function that acts as a more effective alternative to previous approaches for dealing with class imbalance - CE(Cross Entropy)에 몇개 Term을 추가한 focal loss를 제안 - 쉬운 샘플을 더욱더 쉽게 만들어서 어려운 샘플에 더 focus하게 만드는 loss - 100000 easy, 100 hard examples - 40x bigger loss from easy examples - 그래서 CE를 살짝 변경함
  26. 26. RETINA : Focal loss
  27. 27. RETINA : Focal loss Focal Loss - We introduce the focal loss starting from the cross entropy (CE) loss for binary classification ● y ∈ {±1} specifies the ground-truth class ● p ∈ [0, 1] is the model’s estimated probability for the class with label y = 1
  28. 28. RETINA : Focal loss Balanced Cross Entropy ● For instance, with γ = 2, an example classified with pt = 0.9 would have 100× lower loss compared with CE and with pt ≈ 0.968 Focal Loss Definition 쉬운것을 더 쉽게 만들어서 Hard sample에 더 집중하게 만드는 loss
  29. 29. RETINA : Retinanet Detector RetinaNet Detector - RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks - The backbone is responsible for computing a convolutional feature map over an entire input image - The second subnet performs convolutional bounding box regression - We construct a pyramid with levels P3 through P7 - the spatial resolution is upsampled by a factor of 2 using the nearest neighbor for simplicity.(FPN), 1 by 1 Conv 추상화가 잘된 피처를 낮은 레이어로 내려서 작은 물체도 잘 디텍션 하게
  30. 30. RETINA : Retinanet Detector Experiments
  31. 31. RETINA : Retinanet Detector 추가 고민사항 - Backbone을 유지한채로 FPN부분만 잘 설계하면 성능이 좋아지지 않을까? - 꼭 FPN을 top-down으로 섞어야 하는가? - 어떻게 섞는것이 효율적일까? - 잘 모르겠으니 Automl로 이것저것 다 섞어서 테스트를 해보자 NAS-FPN으로 넘어감
  32. 32. RETINA : Retinanet Detector 추가 고민사항 - Backbone을 유지한채로 FPN부분만 잘 설계하면 성능이 좋아지지 않을까? - 꼭 FPN을 top-down으로 섞어야 하는가? - 어떻게 섞는것이 효율적일까? - 잘 모르겠으니 Automl로 이것저것 다 섞어서 테스트를 해보자 NAS-FPN으로 넘어감
  33. 33. NAS-FPN: LearningScalableFeaturePyramid ArchitectureforObjectDetection 03
  34. 34. NAS-FAN : Introduction The challenge of designing feature pyramid architecture is in its huge design space The key contribution of our work is in designing the search space that covers all possible cross-scale connections to generate multiscale feature representations. The discovered architecture, named NAS-FPN, offers great flexibility in building object detection architecture. - Recently, Neural Architecture Search algorithm demonstrates promising results on efficiently discovering top-performing architectures for image classification in a huge search space Current state-of-the-art convolutional architectures for object detection are manually designed. Here we aim to learn a better architecture of feature pyramid network for object detection.
  35. 35. NAS-FAN : Method - The architecture of FPN can be stacked N times for better accuracy - The backbone model and the subnets for class and box predictions follow the original design in RetinaNet RetinaNet with NAS-FPN
  36. 36. NAS-FAN : Method - 5 scales {C3, C4, C5, C6, C7} with corresponding feature stride of {8, 16, 32, 64, 128} pixels - The C6 and C7 are created by simply applying stride 2 and stride 4 max pooling to C5 - 피처맵 2개 선택해서 적당한 연산을 통해 합쳐주는 방법 MergingCell을 제안 Merging Cell - Feature map을 2개 뽑고, output resolution 선택하고, Binary op를 해서 합친다. - The input feature layers are adjusted to the output resolution by nearest neighbor upsampling or max pooling if needed before applying the binary operation - The merged feature layer is always followed by a ReLU, a 3x3 convolution, and a batch normalization layer - 다시 피처맵에 넣고 N time 반복
  37. 37. NAS-FAN : Method Merging Cell
  38. 38. NAS-FAN : Experiments Architecture Search for NAS-FPN - To speed up the training of the RNN controller we need a proxy task - Proxy task for 10 epochs, instead of 50 epochs - A small backbone architecture of ResNet-10 with input 512 × 512 image size - Reward : We reserve a randomly selected 7392 images from the COCO train2017 set as the validation set, which we use to obtain rewards Proxy Task - Similar to our controller is a recurrent neural network (RNN) and it is trained using the Proximal Policy Optimization (PPO) algorithm. - The total number of unique architectures generated by the RNN controller Contoller
  39. 39. NAS-FAN : Experiments Architecture Search for NAS-FPN - Left : The reward is computed as the AP of sampled architectures on the proxy task - Right: The number of sampled unique architectures to the total number of sampled architectures - Unique 한 FPN 구조는 대충 8000개 정도에서 수렴함 - 수많은 TPUs 사용해서 만들어낸 결과는?(100 TPUs,? 1000 TPUs??)
  40. 40. NAS-FAN : Experiments Scalable Feature Pyramid Architecture - 7 merging cell - RCB : Relu, Conv, BatchNorm - GP : Global pooling - 파란색(서로다른 스케일의 feature map)에서 feature에서 Box Regression
  41. 41. NAS-FAN : Experiments Architecture graph of NAS-FPN - Feature layers in the same row have identical resolution - The resolution decreases in the bottom-up direction - 해석을 하자면 FPN은 low 에서 high resolution 으로만 연결이 있음 - NAS가 AP가 높은것을 찾을수록 High resolution을 low resolution으로 연결할려는 모습을 보임 작은 물체를 감지하는 고해상도 피처를 연결하는 feature를 생성할수록 성능이 좋아짐
  42. 42. NAS-FAN : Experiments Detection accuracy
  43. 43. NAS-FAN : Experiments Further Improvements with DropBlock - We apply DropBlock with block size 3x3 after batch normalization layers in the the NAS-FPN layers - DropBlock을 사용하면 성능이 더 좋아짐
  44. 44. 추가 고민사항 - AutoML이 Detection 영역으로 적용된 사례 - AutoML을 돌릴려면 무지막지한 장비와 시간이 드는데 과연 우리들이 할수 있을까? - 더 효과적인 방법이 있을까? - Multi resolution feature를 더할때 그냥 sum만 하는데 다른 방법이 없을까? Efficient DET의 시작. NAS-FAN : Experiments
  45. 45. EfficientDET: Scalable andEfficientObject Detection 04
  46. 46. EFFICIENTDET : Introduction The state of-the-art object detectors also become increasingly more expensive The key contribution of our work is in designing the search space that covers all possible cross-scale connections to generate multiscale feature representations. - The latest AmoebaNet-based NASFPN detector requires 167M parameters and 3045B FLOPS (30x more than RetinaNet) - Given these real-world resource constraints, model efficiency becomes increasingly important for object detection. Model efficiency has become increasingly important in computer vision. First, we propose a weighted bi-directional feature pyramid network. Second, we propose a compound scaling method(EfficientNet). We have developed a new family of object detectors, called EfficientDet
  47. 47. EFFICIENTDET : Introduction Although these methods tend to achieve better efficiency, they usually sacrifice accuracy - Most previous works only focus on a specific or a small range of resource requirements - the variety of real-world applications, from mobile devices to datacenters A natural question Is it possible to build a scalable detection architecture with both higher accuracy and better efficiency across a wide spectrum of resource constraints. 모든 OD 논문의 공통 질문, 정확도와 효율성을 동시에 잡겠다!
  48. 48. EFFICIENTDET : Introduction Challenge 1: efficient multi-scale feature fusion - FPN has been widely used for multiscale feature fusion - PANet, NAS-FPN, and other studies have developed more network structures for cross-scale feature fusion - Most previous works simply sum them up without distinction - We propose a simple yet highly effective weighted bi-directional feature pyramid network (BiFPN) - PANet Retina Top-Down에서 하나더 Down-Top을 추가로 넣음 - 이유는 낮은 레벨의 feature는 위치정보가 더 있으니, 한번더 위로 올려주어서 상위레벨의 feature에 위치정보를 더 주면 성능이 좋아질것으로 예상.
  49. 49. EFFICIENTDET : Introduction Challenge 2: model scaling - Inspired by recent works EfficientNet, we propose a compound scaling method for object detectors, which jointly scales up the resolution/depth/width for all backbone, feature network, box/class prediction network - 모델을 크게 만드는 3가지 방법이 width, depth, resolution이 있는데 3개를 동시에 적절히 잘해보자.(Efficient Net방법 적용)
  50. 50. EFFICIENTDET : Introduction Our contributions can be summarized - We proposed BiFPN, a weighted bidirectional feature network for easy and fast multi-scale feature fusion - We proposed a new compound scaling method, which jointly scales up backbone, feature network, box/class network, and resolution, in a principled way - Based on BiFPN and compound scaling, we developed EfficientDet
  51. 51. EFFICIENTDET : BiFPN Problem Formulation - We proposed BiFPN, a weighted bidirectional feature network for easy and fast multi-scale feature fusion - We proposed a new compound scaling method, which jointly scales up backbone, feature network, box/class network, and resolution, in a principled way - Based on BiFPN and compound scaling, we developed EfficientDet
  52. 52. EFFICIENTDET : BiFPN Problem Formulation - Formally, given a list of multi-scale features Feature Pyramid에서 사용하는 Feature를 P in - Our goal is to find a transformation f that can effectively aggregate different features. - Output a list of new features
  53. 53. EFFICIENTDET : BiFPN Feature network design
  54. 54. EFFICIENTDET : BiFPN Cross-Scale Connections - We observe that PANet achieves better accuracy than FPN and NAS-FPN - 진짜?? 그럼 왜 NAS를 돌린걸까?? - First, we remove those nodes that only have one input edge - Our intuition is simple: if a node has only one input edge with no feature fusion then it will have less contribution called Simplified PANet - Second, we add an extra edge from the original input to output node if they are at the same level - Third, unlike PANet that only has one top-down and one bottom-up path, we treat each bidirectional (top-down & bottom-up) path as one feature network layer, and repeat the same layer multiple times to enable more high-level feature fusion First Second Third N times repeat
  55. 55. EFFICIENTDET : BiFPN Weighted Feature Fusion - A common way is to first resize them to the same resolution and then sum them up. - Pyramid attention network introduces global self-attention upsampling to recover pixel localization(SENET과 비슷) Unbounded fusion - Wi is a learnable weight that can be a scalar (per-feature), a vector (per-channel), or a multi-dimensional tensor (per-pixel). - We find a scale, The scalar weight is unbounded - we resort to weight normalization to bound the value range of each weight
  56. 56. EFFICIENTDET : BiFPN Softmax-based fusion - An intuitive idea is to apply softmax to each weight, such that all weights are normalized to be a probability with value range from 0 to 1, representing the importance of each input. - The extra softmax leads to significant slowdown on GPU hardware Fast normalized fusion - where wi ≥ 0 is ensured by applying a Relu after each Wi - E = 0.0001 is a small value to avoid numerical instability - This fast fusion approach has very similar learning behavior and accuracy as the softmax-based fusion, but runs up to 30% faster on GPUs
  57. 57. EFFICIENTDET : BiFPN Fast normalized fusion Ptd 6 P out 6 P out 5
  58. 58. EFFICIENTDET : BiFPN Fast normalized fusion Ptd 6 P out 6 P out 5
  59. 59. EFFICIENTDET : Architecture EfficientDet architecture - EfficientNet as the backbone network - BiFPN as the feature network n times - Shared class/box prediction network
  60. 60. EFFICIENTDET : EFFICIENTNET Efficient Net 채널을 늘리거나 (width) 더 깊게 쌓거나 (Depth) Input Image를 키우거나 (Resolution) 적당한 방법으로 늘리자
  61. 61. EFFICIENTDET : EFFICIENTNET Compound Scaling - We propose a new compound scaling method for object detection, which uses a simple compound coefficient φ to jointly scale up all dimensions of backbone network, BiFPN network, class/box network, and resolution. - Grid search for all dimensions is prohibitive expensive. Therefore, we use a heuristic-based scaling approach Backbone network - We reuse the same width/depth scaling coefficients of EfficientNet-B0 to B6
  62. 62. EFFICIENTDET : EFFICIENTNET BiFPN network - We exponentially grow BiFPN width Wbifpn (#channels) - Linearly increase depth Dbifpn (#layers) Box/class prediction network - We fix their width to be always the same as BiFPN (i.e., Wpred = Wbifpn) - But linearly increase the depth (#layers) 채널 깊이, 레이어 수 Input image resolution - Since feature level 3-7 are used in BiFPN, the input resolution must be dividable by 2^7=128 - But linearly increase the depth (#layers)
  63. 63. EFFICIENTDET : EFFICIENTNET Scaling configs for EfficientDet D0-D7 Wpred = Wbifpn EfficientNet-B0 to B6 Heuristic-based 만든 공식으로 Scale up 진행
  64. 64. EFFICIENTDET : Experiments EfficientDet performance on COCO
  65. 65. EFFICIENTDET : Experiments Model size and inference latency comparison
  66. 66. EFFICIENTDET : Conclusion Weighted bidirectional feature network Customized compound scaling method Improve accuracy and efficiency EfficientDet-D7 achieves state-of-the-art accuracy 3.2x faster on GPUs and 8.1x faster on CPU
  67. 67. THE END
  68. 68. Appendix ntos.gitbooks.io/artificial-inteligence/content/single-shot-detectors/ssd.html https://uk-kim.github.io/2018/12/07/Focal-loss-for-dense-object-detection.htmlDeep Learning for Generic Object Detection: A Survey https://taeu.github.io/paper/deeplearning-paper-ssd/ https://leonardoaraujosa https://towardsdatascience.com/review-fpn-feature-pyramid-network-object-detection-262fc7482610 https://www.groundai.com/project/pyramid-attention-network-for-semantic-segmentation/1 https://www.youtube.com/watch?v=11jDC8uZL0E

×