Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

AR/SLAM for end-users

1.992 Aufrufe

Veröffentlicht am

2019年6月13日、SSII2019 Organized Session: Multimodal 4D sensing。エンドユーザー向け SLAM 技術の現在。登壇者:武笠 知幸(Research Scientist, Rakuten Institute of Technology)
https://confit.atlas.jp/guide/event/ssii2019/static/organized#OS2

Veröffentlicht in: Technologie
  • My special guest's 3-Step "No Product Funnel" can be duplicated to start earning a significant income online. ♥♥♥ https://tinyurl.com/y3ylrovq
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Gehören Sie zu den Ersten, denen das gefällt!

AR/SLAM for end-users

  1. 1. AR/SLAM for end-users June 13, 2019 Tomoyuki Mukasa Rakuten Institute of Technology Rakuten, Inc.
  2. 2. 2 2015Ph.D. Student Engineer Researcher2012 3D Reconstruction & Motion Analysis Tomoyuki MUKASA, Ph.D. 3D Vision Researcher VR for Exhibition AR for Tourism AR/VR/HCI for e-commerce
  3. 3. 4 Mission of Our Groups: Create New User Experiences Applicable to Rakuten Services Contributing to existing businesses Exploring new ideas Increasing tech-brand awareness Using Computer Vision & Human Computer Interaction
  4. 4. 5 The 3 R’s Of Computer Vision Woman Red Blouse Category Attributes
  5. 5. 6 The 3 Main Points for End-users • Easy access on mobile • AR furniture app • Web AR/SLAM • Improving the experience on mobile • Dense 3D reconstruction • Occlusion-aware AR • Manipulatable AR • Understanding & manipulation of the environment around the user • Delivery robots
  6. 6. 7 Easy access on mobile • AR furniture app • - / / 5 • - / • Web AR/SLAM • / / / • / / / • / - /5 / / / 5 5 /
  7. 7. 10 Floor Detection & IMU Fusion Need to be tracked in 3D!
  8. 8. 11 Floor Detection & IMU Fusion Need to be tracked in 3D! Almost solved in ARKit/ARCore…
  9. 9. 12 What is still missing? Merchants’ pages 3D models SLAM w/ scale estimation Advanced visualization w/ inpainting & relighting AR app for everyone E. Zhang, M. F. Cohen, and B. Curless. "Emptying, Refurnishing, and Relighting Indoor Spaces”, SIGGRAPH Asia, 2016.
  10. 10. 13 ARKit / ARCore What is still missing? Merchants’ pages 3D models SLAM w/ scale estimation Advanced visualization w/ inpainting & relighting AR app for everyone E. Zhang, M. F. Cohen, and B. Curless. "Emptying, Refurnishing, and Relighting Indoor Spaces”, SIGGRAPH Asia, 2016.
  11. 11. 14 Pros: • AR • HTML(+Javascript) Cons: • • iOS11 Safari, Android5 Chrome • AR.js: • A-frame: Future work • • AR (cf. ARKit, ARCore) • Geolocation AR
  12. 12. 15 Simplest Web AR Pros: • No need to install native app • Easy to create only w/ HTML(+Javascript) Cons: • Marker-based • Need newer environment Later than iOS11Safari, Android5 Chrome Implementation • AR.js + A-frame
  13. 13. 16 About 1K people tried @Spartan race in Sendai 2018/12/15 • AR photo booth: 240 groups • AR lottery: 510 people 2K people tried @Japan Open 2018/10/02-07
  14. 14. 17 AR message card 2019/5/12, 6/16 Trial in Mother’s day & Father’s day AR Quiz 2019/5/16 Trial @Tokyo Dome AR Lottery 2018, 2019/3, 4 R-mobile campaign
  15. 15. 18 State-of-the-art Web AR (8th Wall) 8th Wall © 2019 8th Wall built their own highly-optimized SLAM engine, and then re-architected it for the mobile web. AUGMENTED REALITY FOR THE WEB Javascript WebGL WebAssembly Six-Degrees-of-Freedom (6DoF) Tracking Point-Cloud Lighting Surface Estimation Image Detection
  16. 16. 19 Web SLAM Light- weight Web AR SOTA SLAM Schneider, Thomas et al. “Maplab: An Open Framework for Research in Visual-Inertial Mapping and Localization.” IEEE Robotics and Automation Letters, 2018.
  17. 17. 20 Results Offline loop closure and optimization Online recording StartEnd Office Space Mapping and Optimization Back end visualization of the location map
  18. 18. Objects Object detection & recognition Input image Surface orientation Partial view alignment 3D pose estimation Plane fitting 3D scene initialization Room Geometry Objects in 3D scene walls initialized with unknown scale
  19. 19. 24 Improving the experience on mobile • Dense 3D reconstruction • - / / / • + & / • Occlusion-aware AR • - / / / • & • - / • Manipulatable AR • & / •
  20. 20. 25 Dense 3D reconstruction • Depth prediction by CNN • SLAM + Depth prediction
  21. 21. 26 Dense Visual Monocular SLAM • Direct method based on photo consistency • Multi-baseline stereo using GPU • Getting easier to run on the latest mobile device, but still unwanted from the end-user point of view because of energy consumption, etc. R. A. Newcombe, S. J. Lovegrove and A. J. Davison, "DTAM: Dense tracking and mapping in real-time," ICCV, 2011
  22. 22. 27 Depth prediction by CNN D. Eigen, C. Puhrsch, and R. Fergus. “Depth map prediction from a single image using a multi-scale deep network.” NIPS, 2014. M. Kaneko, K. Sakurada and K. Aizawa. “MeshDepth: Disconnected Mesh-based Deep Depth Prediction.” ArXiv, 2019. Global Coarse-Scale Network + Local Fine-Scale Network Disconnected mesh representation
  23. 23. 28 SLAM + Depth prediction Semi-dense SLAM + Prediction Compact and optimizable representation of dense geometry K. Tateno, F. Tombari, I. Laina and N. Navab, "CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction," CVPR, 2017. M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger and A. J. Davison. “CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM.” CVPR, 2018.
  24. 24. 29 Mesh CNN-SLAM (ICCV WS 2017) Image capturing & Visualization thread 2D tracking thread 3D Mapping thread Depth prediction thread Depth fusion thread CLIENT-SIDE SERVER-SIDE Monocular visual SLAM Depth prediction by CNN 3D reconstruction Depth fusion by surface mesh deformation t t+1 t+2 t+3 t+4 Key-frame ARAP deformation
  25. 25. 30 Mesh CNN-SLAM (ICCV WS 2017) Figure 4. (Top) Distribution of weights wi for the deformation and (bottom) the corresponding textured mesh. Larger intensity values in the top figure indicate the higher weights. 4. Experiments frames detected by ORB-SLAM because these are selected based on visual changes. We filter out those key-frames us- ing a spatio-temporal distance criterion similar to the other feature-based approaches, e.g., PTAM, and send them to the server. The key-frames are processed on the server and the depth image for each frame is estimated by the CNN architecture. In the fusion process, we convert the depth images to a re- fined mesh sequence as shown at the bottom of Figure 5.We also make the ground truth mesh sequence correspond to the refined one from the raw depth maps captured by the depth sensor on the other hand. We compute residual errors be- tween the refined mesh and the ground truth as shown in Ta- ble 2 and Figure 6. We can observe that our framework ef- ficiently reduces the residual errors for all sequences. Both the average and the median of the residual errors fall within the range from about two thirds to a half. We also evaluate the absolute scale estimated from depth prediction as shown in the rightmost column in the Table 2. The average error of the estimated scales for our six office scenes is 20% of the ground truth scale. 5. Conclusion In this paper, we proposed a framework fusing the re- sult of geometric measurement, i.e., feature-based monocu-
  26. 26. 31 Mesh CNN-SLAM (ICCV WS 2017) Sofa area 1 Sofa area 2 Sofa area 3 Desk area 1 Desk area 2 Meeting room Figure 5. Input data for our depth fusion and the reconstructed scenes. From top to bottom row: color images, feature tracking result of SLAM, corresponding ground truth depth images, depth images estimated by DNN, and 3D reconstruction results on six office scenes, respectively. Scene Mesh from CNN depth map Refined mesh by our method Mean Median Std dev Mean Median Std dev Scale
  27. 27. 32 Mesh CNN-SLAM (ICCV WS 2017) Method 3D Reconstruction Computational complexity Accuracy Scale Monocular visual SLAM (feature based) Sparse (scene complexity dependent) Low (runs on mobile device) High None CNN-based depth pre- diction Dense (estimated for each pixel) High (a few seconds for each frame) Medium (training-data dependent) Available Proposed framework Dense (estimated for each pixel) High (but only visual SLAM runs on mobile device) High Available Table 1. Properties of individual reconstruction methods and of their combination, which retains desirable properties of each. less et al. proposed to use averaging truncated signed dis- tance functions (TSDF) for depth susion [3] which is simple yet effective and used in a large number of reconstruction pipelines including KinectFusion [21]. Mesh deformation techniques are widely used in graph- ics and vision. Especially, linear variational mesh deforma- tion techniques were developed for editing detailed high- resolution meshes, like those produced by scanning real- world objects [2]. For local detail preservation mesh defor- mations that are locally as-rigid-as-possible (ARAP) have been proposed. The ARAP method by Sorkine et al. [25] 3.1. Monocular visual SLAM Although our framework is compatible with any type of feature-based monocular visual SLAM methods, we em- ploy ORB-SLAM [20] because of its robustness and ac- curacy. ORB-SLAM incorporates three parallel threads: tracking, mapping and loop closing. The tracking is in charge of localizing the camera in every frame and deciding when to insert a new key-frame. The mapping processes new key-frames and performs local bundle adjustment for reconstruction. The loop closing searches for loops with every new key-frame. Each key-frame Kt is associated with camera pose Tkt at time t, locations of ORB features p2D (t) and correspond-
  28. 28. 33 Occlusion-aware AR • Optical-flow based depth edge approach • Disparity + Bilateral grid approach • CNN-based approach A. Holynski, J. Kopf. “Fast Depth Densification for Occlusion-aware Augmented Reality.” SIGGRAPH Asia 2018
  29. 29. 34 Optical-flow based depth edge approach (Facebook) A. Holynski, J. Kopf. “Fast Depth Densification for Occlusion-aware Augmented Reality.” SIGGRAPH Asia 2018 (U-Washington + FB)
  30. 30. 35 Disparity + Bilateral grid approach (Google) J. Valentin, et al. “Depth from motion for smartphone AR.” ACM Trans. Graph, 2018.
  31. 31. 36 CNN-based approach (Niantic) C. Godard, O. M. Aodha and G. J. Brostow. “Digging Into Self-Supervised Monocular Depth Estimation.” ArXiv, 2018 C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017. • Monodepth: Unsupervised Monocular Depth Estimation with Left-Right Consistency • Monodepth2: Self-Supervised Monocular Depth Estimation
  32. 32. 37 CNN-based approach M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia. “Towards real-time unsupervised monocular depth estimation on cpu.” IROS, 2018. C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017. • PyD-Net: Pyramidal features extractor to reduce complexity • Based on Monodepth • Customized for CPU
  33. 33. 38 Manipulatable AR • Disocclusion • • • • Manipulation of the viewpoint & appearance •
  34. 34. 39 Spatial consistency for disocclusion P. P. Srinivasan, R. Tucker, J. T. Barron, R. Ramamoorthi, R. Ng and N. Snavely. “Pushing the Boundaries of View Extrapolation with Multiplane Images.” CVPR, 2019. • View synthesis based on Multiplane Image (MPI) Cf. Multiplane camera in animation • Novel view extrapolations with plausible disocclusions • Consistency between rendered views
  35. 35. 40 Temporal consistency for disocclusion R. Xu, X. Li, B. Zhou and C. C. Loy. “Deep Flow-Guided Video Inpainting.” CVPR, 2019.
  36. 36. 41 Learning Human Depth for Disocclusion Z. Li, T. Dekel, F. Cole, R. Tucker, N. Snavely, C. Liu and W. T. Freeman. “Learning the Depths of Moving People by Watching Frozen People.” CVPR, 2019. Apple also revealed ”people occlusion” feature for ARKit 3 in WWDC 2019
  37. 37. 42 Manipulation of the viewpoint & appearance M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, N. Snavely and R. M-Brualla. “Neural Rendering in the Wild.” CVPR, 2019 Total Scene Capture • Encode the 3D structure of the scene, enabling rendering from an arbitrary viewpoint, • Capture all possible appearances of the scene and allow rendering the scene under any of them. • Understand the location and appearance of transient objects in the scene and allow for reproducing or omitting them.
  38. 38. 43 Understanding & manipulation of the environment around the user • Delivery robots • . .) • . , .) ( , (. . (. ( ) (,( • .) , ( .
  39. 39. 44 Steps to a fully automated delivery process

×