Published on January 5, 2017
1. HoloSwap: Object Removal and Replacement using Microsoft Hololens Yun Suk Chang∗ Deepashree Gurumurthy† Hannah Wolfe‡ ABSTRACT Object removal and replacement in augmented reality has applica- tions in interior design, marker hiding and many other ﬁelds. This project explores the viability of object removal and replacement for the Microsoft HoloLens. We implemented gesture based ob- ject selection, video inpainting, tracking and 3 different ways to display our results. We placed (1) the display plane and a white Stanford bunny at the inpainted depth, (2) a solid display plane with video inpainting covering the ﬁeld of view, and (3) a vignetted view of the video inpainting. We found placing the display plane and a white Stanford bunny at the inpainted depth to have the most reasonable results. Due to the semi-transparency of the HoloLens replacing the object with a white mesh and having the object on a light background is important. We also tested 3 different inpainting masks for the updating frame. Our results show that the HoloLens can be a reasonable solution in certain cases for object replacement in augmented reality. 1 INTRODUCTION With increase in research development and popularity, the number of augmented reality (AR) application increased in many different areas. One of the popular AR application is object removal and re- placement. The object removal and replacement application ranges from interior design  to replacing a tracking marker in maker based AR , using diminished reality techniques. There has been much work in tablet or other mobile device based diminished real- ity application [8, 12, 14]. However, little is known for diminished reality on see-through, head-mounted displays. There are many pros to export diminish reality applications onto see-through, head-mounted display devices. First, the user is able to see the replaced or removed object with correct projection from their view rather than from camera’s projection. Second, the user gets better experience since the actual object is hidden from their view at all times while they are wearing the display, preventing from breaking out of the immersion. Lastly, in see-through dis- plays, only the diminished portion of the real world view is affected by display and camera distortions unlike the view from the camera where whole view is affected. In this work, we perform diminished reality on Microsoft HoloLens using inpainting technique to test the viability of object replacement in see-through, head-mounted display. We test the via- bility in three different ways: (1) placing an inpainted display frame and replacement object on the object selected for replacement to test the object replacement, (2) running video-based inpainting al- gorithm in view to test object removal in changing viewpoints, and (3) blending an inpaint display frame into the scene to test the limits of minimal seam achieved. Through our evaluation, we discovered the limitations of di- minished reality on see-through, head-mounted displays, includ- ing problem of eye separation and dominance, limitation of ﬁeld of view, and inability to completely block out real world scene. ∗e-mail: firstname.lastname@example.org †e-mail: email@example.com ‡e-mail: firstname.lastname@example.org However, our results show that in certain environments, see-though display device can achieve reasonable object replacement and re- moval. 2 RELATED WORK 2.1 Object Selection in Augmented Reality There has been much work done in selecting object in 3D space using augmented reality. For example, Olwal et al.  used track- ing glove to point and cast a primitive object to the scene to select an area of interest, which is used for object selection by statistical analysis. On the other hand, Oda et al.  uses tracked Wii remote to select the object by moving an adjustable sphere into the object of interest. More recently, Miksik et al.  presented the Seman- tic Paintbrush for selecting the real-world objects via a ray-casting technique and semantic segmentation. Our drawing method can be considered a pointing or ray-casting technique, almost like using a precise can of spray-paint [3, 15, 17]. 2.2 Inpainting in Augmented Reality Enomoto et al. implemented inpainting of objects using multi- ple cameras. Later predetermined object removal for augmented reality was implemented by Herling et al. . Researchers have implemented inpainting to cover markers in augmented reality applications[12, 14]. In their papers, they do not remove objects, but just cover predetermined 2D markers with textures. We could not ﬁnd any examples of video inpainting in an augmented reality head-mounted display. 2.3 Origins of Inpainting Inpainting was ﬁrst done digitally in 1995 to ﬁx scratches in ﬁlm . Harrison et al.proposed a non-hierarchical procedure for re- synthesis of complex textures . An output image was generated by adding pixels selected from input image. This method produced good results but took a long time for implementation. Other researchers tried to accelerate Harrison’s algorithm, and while their results were potentially faster they caused more artifacts [7, 20]. Many of these papers tried to combine, texture synthesis, replicating textures ad inﬁnitum for large regions using exemplar- based techniques, and inpainting, ﬁlling small gaps in images by propagating linear structures via diffusion. ”Fragment-based im- age completion” would leave blurry patches . ”Structure and texture ﬁlling-in of missing image blocks in wireless transmission and compression applications,” would inpaint some sections and use texture synthesis in others causing inconsistencies . Cri- minisi et al.was the ﬁrst group to combine texture synthesis and inpainting together effectively and efﬁciently. They did this by in- painting based on a priority function instead of raster-scan order . The PatchMatch algorithm proposed by Barnes et al.is the ba- sis for many modern inpainting algorithms. Their algorithm ﬁnds similar patches and copies data from surrounding patches to inpaint areas . 2.4 State of the Art Inpainting Inpainting has been performed on video sequences in the recent years. ”Video Inpainting of Complex Scenes”  produces very
2. good results of inpainting. The inpainting algorithm stores infor- mation from previous frames and thereby can also recreate miss- ing text. ”Real time video inpainting using PixMix”  also ac- counted for illumination changes between frames while inpainting. ”Exemplar-Based Image Inpainting Using a Modiﬁed Priority Deﬁnition”  produced high quality output images by propagat- ing patches based on their priority deﬁnition. Lately, novel inpaint- ing algorithms have been included in pipeline for modular real time diminished reality for indoor applications . This however only proposed a method which could be used in augmented reality de- vices and did not use any such device for their implementation. 3 APPROACH There are many parts of the pipeline to build an interactive applica- tion to select, remove and replace objects in augmented reality. The desired object to be removed or replaced is ﬁrst selected by inter- action with HoloLens. For object removal, we use a method of in- painting using PatchMatch . The ﬁrst frame is captured through HoloLens and the output of the ﬁrst frame is an inpainted area at the location of the selected object. We used different approaches to inpaint the current frame. The ﬁrst method was by using informa- tion from entire frame, the second method only used information surrounding the selected object and the last method used informa- tion from previous inpainted frame and small area around selected object in current frame. In order to achieve effective inpainting for consecutive frames at the location of the object in world space, we incorporated object tracking into our algorithm. This would help inpainting the current location of object in every frame, thereby accounting for slight changes in position of the camera. Object replacement is implemented by placing a plane in the scene, and placing a 3D mesh(Stanford Bunny mesh) over the location of the selected object. We also display the full screen capture in ﬁeld of view, which shows the resulting inpainted sequence of frames. (a) (b) Figure 1: (a) Shows what the user would see while selecting an area. (b) Shows a selected object. 3.1 Object Selection For object selection, we let the user draw a 3D contour around the object to select it through the HoloLens. For the drawing method, we designed the Surface-Drawing method, which lets the user draw on the detected real world surface in following way: For the draw- ing gesture, we use pinch-and-drag gestures to start the drawing on the detected surface and release-pinch gestures to ﬁnish the draw- ing. To reduce noise in the gesture input, we sample the user’s drawing positions at 30 Hz and the ﬁnished annotation’s path points at 1 point per 1 mm. For drawing on the surface, we deﬁne the draw- ing position as the intersection between the detected surface mesh data and a ray cast from user’s head through the ﬁnger tip position. Consequently, as the user is drawing annotations, the user can eas- ily verge on the object of interest since the annotation is displayed at the detected surface. The drawing method is determined to be effective through pilot studies. The completed contour drawing would be transformed into 2D pixel-space coordinate system so that it could be used for the in- painting algorithm. 3.2 Tracking Once the object was selected, we needed to track the objects be- tween frames. We used OpenCV  implementations of Shi et al.’s paper ”Good Features to Track”  and Bouguet’s ”Pyramidal im- plementation of the afﬁne Lucas Kanade feature tracker description of the algorithm” . On the original frame we ran cv::goodFeaturesToTrack to ﬁnd 1000 features and then culled the features to be only ones in the selected area. In the proceeding frames we ran cv::cvCalcOpticalFlowPyrLK on the features to see which ones were still present. We then created a velocity vector from the cen- troid the of features still present both in the original frame and the update frame. We updated the original selection area with the ve- locity vector to ﬁnd the new selection area to inpaint. Figure 2: Patchmatch Algorithm [CC BY-SA 3.0, Alexandre Delesse] 3.3 Inpainting Our algorithm uses the PatchMatch algorithm for inpainting. The PatchMatch algorithm aims at ﬁnding nearest-neighbor matches be- tween image patches. Best matches are found via random sampling and the coherence of imagery allows for smooth propagation across patches. In the ﬁrst step of the PatchMatch algorithm, the nearest- neighbor ﬁeld is initialized by ﬁlling it with random offset values or information available earlier. Iterative process is then applied to the nearest-neighbour ﬁeld. During this process, offsets are examined in scan order and good patches are then propagated towards adja- cent pixels. While propagating, if the coherent mapping is good ear- lier, then all of the mapping is ﬁlled into the adjacent pixels of same coherent region. A random search is carried out in the neighbour- hood to look for the best offset found. The halting criteria for the iterations depends upon the convergence of the nearest-neighbour ﬁelds, which was found to be around 4-5 iterations as per the article by Barnes et al. . Figure 2 illustrates the basis of the PatchMatch. The grey region in the ellipse is the area of the image that needs to be inpainted. The entire image is scanned for the best match for the patches sur- rounding the selected region and these patches are propagated ac- cordingly. 3.3.1 Algorithm for Video Sequence In our implementation for HoloSwap, this algorithm for inpainting is used to ”remove” the selected objects in pixel-space. We ﬁrst se- lect the region of interest or the object in the visible scene through interaction with the HoloLens as in Figure 3a. The selected region of interest is then used to create a mask for inpainting. The ini- tial mask is white in the region of interest and black everywhere
3. (a) Cow Selected (b) Full Inpainting Mask (c) Thin Inpainting Mask (d) Thick Inpainting Mask Figure 3: Example masks for inpainting a selection. else as shown in Figure 3b. The ﬁrst frame of the video sequence is inpainted in pixel-space using the initial mask and patterns from image patches of the entire frame is used to ﬁll in the region of inter- est. For the consecutive frames, the mask is updated to account for slight movements in the position of web camera on the HoloLens. The initial mask is altered by using the OpenCV  implementation of erosion, and the initial mask is subtracted from the altered previ- ous mask to give a ring of selection area on mask. This ring-shaped mask as in Figure 3c is the new update mask for current frame. The update mask is also made to shift along with object’s location by tracking the selected object’s features so that the mask remains in the right location with respect to the selected object. This process is repeated for every frame by updating masks from the previous frame, thereby achieving object removal by inpainting at the loca- tion of object in pixel-space. 3.3.2 Inpainting and Mask Updating Methods Three methods of inpainting were incorporated in attempt to im- prove efﬁciency. The ﬁrst method involved inpainting by selecting matching texture patches from the entire frame. The PatchMatch algorithm scans the entire frame for suitable matches. In the sec- ond method, we use only the area around the selected object for in- painting in order to improve computation time. To further improve performance, we copied the inpainted portion of the ﬁrst frame into the area enclosed by initial mask to maintain consistency between frames and updated the inpainting using area around selected ob- ject. In the third method, we used the ring-shaped mask as men- tioned in Section 3.2.1 to further reduce computation time. The size of the ring was varied by altering number of iterations in OpenCV’s erosion function to check variation in accuracy of inpainting. 3.4 Display We tested two forms of displaying the results: video inpainting in ﬁeld of view rendering and object replacement in still frame. For both approaches, the display image is placed on a plane (in future referred to as ”display plane”) in the scene. To display the image correctly, we get the world-to-camera matrix and projection matrix from the HoloLens webcam when the image is taken. Then, the matrices are used in the shader to calculate UV texture space for the plane in following ways: First, the plane’s corner vertex posi- tions are calculated in world space. Second, the world space vertex positions are transformed so that they are relative to the physical web camera on the HoloLens. Third, the camera relative vertices are converted into normalized clipping space. Lastly, x and y com- ponents of the vertices are used to deﬁne UV space of the texture, which is used to correctly apply the image texture on to the plane. 3.4.1 Video Field of View For the video ﬁeld of view, the display plane was placed slightly in front of the user’s view. This was done by positioning the display plane and rotating it based on camera to world matrix. When the in- painted display plane is placed, we repeat the procedure, receiving next frame, calling the inpainting update function, and displaying it. We originally implemented this both as a solid display plane and with the edges of the display plane vignetted to transparency. 3.4.2 Replacement in Still Frame The object replacement in still frame is done in 4 steps. First, we retrieve selected object’s 2D center position in pixel-space and in- painted frame image from the inpainting algorithm. Second, we unproject a ray from the 2D center position to 3D real world sur- face mesh in attempt to ﬁnd an intersecting position where we can place the display plane. Third, if the intersection is found, we make the display plane with inpaint image using the steps described pre- viously and place it on the intersection. Lastly, we place the re- placement object at the same intersection point. In order to prevent, the display plane from occluding the replace- ment object, we render both objects separately and render the re- placement object last so that replacement object can be rendered on top of the display plane. 3.5 System Design We used Microsoft Visual Studio and Unity for creating this project. The application was written in two parts, one was creating a Dynamic-link library (DLL) for inpainting and the other was se- lection and display in the HoloLens. The inpainting DLL required OpenCV, so we had to build and add OpenCV’s pre-alpha DLLs which were not complete. We also had a series of C# scripts which managed data transfer with the inpainting DLL, taking screenshots, placing display planes correctly, and capturing gestures. 4 RESULTS We tested three different ways to display inpainting. We placed (1) the display plane and a bunny at the inpainted depth, (2) a solid display plane covering the ﬁeld of view, and (3) a vignetted display plane covering the ﬁeld of view. Examples of these displays are shown in Figure 4. In all three cases, having the frames update every 4 seconds was not ideal, and lead to discontinuity when the user moved too much between frames. This would lead to ghosting until the next frame updated. We found that object replacement, displaying a plane and a white bunny at the inpainted depth, was the best was to obscure the in- painted object. When this was done, the majority of the inpainted area is covered by the bunny. Also we chose the color white for the bunny, because it was the least transparent. We also found that inpainting on a light or white background worked better for inpaint- ing, though our example images are using a black background. We tested both a solid and vignetted display plane covering the users ﬁeld of view. We found that both display settings had their drawbacks. The solid display plane was better at inpainting the object, because no portion of the plane was transparent. The issue
4. (a) Original Scene (b) Vignetted Inpainted Field of View Image overlayed (c) Simulated image of inpainted scene with Stanford Bunny replacing the phaser (d) Inpainted Field of View Image overlayed Figure 4: The original scene and example views of the three ways we tested object removal and replacement. Table 1: Inpainting Runtime: Different masks drawing patches from different sized areas were tested on a frame. The selected area to inpaint was an ellipse with width and height 1/16th of frame (1280 x 720) Inpainting Mask Area Inpainted From Seconds Full mask Full frame 10+ Full mask 1/4 frame 4.85 Thin mask 1/4 frame 4.05 Thick mask 1/4 frame 4.2 was that a user could see the display planes edges easily, which was potentially distracting. The vignetted display plane’s edges melted into the background, but the inpainting was semitransparent and therefore did not cover the object as well. 5 ASSESSMENT Assessment of HoloSwap was done based on the accuracy and ef- ﬁciency of inpainting and object replacement. Our main goal was to assess viability of real time object removal and replacement on Microsoft Hololens and hence runtime was an important factor in assessing our results. 5.1 Object Removal in Video Sequence For Object Removal test cases, we used a contour of an ellipse with size of about 1/16th of the frame. The selected object is centered at the scene in these cases for consistency of evaluation. Table 1 shows the effective runtime for each of the methods used. We ﬁrst used the initial mask as in Figure 3b for inpainting and the patches for inpainting were extracted from the entire frame of size 1280x720. For this method, we found that the time required to obtain the in- painted result took about 10 seconds on an average. In order to reduce the runtime, we reduced the area of frame for which patches were extracted for inpaiting to about a quarter of the frame and this gave us a runtime of 4.85 seconds on average. To further improve the runtime, we retained the inpainting in- formation from the ﬁrst frame and reduced the area of mask such that only the edges of the region of selected object was inpainted to account for slight movements. We used 10 iterations in OpenCV’s erosion function for the mask and this gave us a runtime of 4.05 seconds but this method signiﬁcantly deteriorated the quality of in- painting. To improve the quality of inpainting we used a thicker mask by adjusting the number of iterations to 20 in the erosion function and this had a runtime of 4.2 seconds. Though this method was slightly slower than the one with thin mask, we retained this as our ﬁnal method due to the higher quality of inpainting. There was a trade off between the quality of inpainting and runtime and we chose the method which gave us the best results amongst all the explored methods. 5.2 Object Replacement We used the Stanford Bunny mesh to replace the selected object. Object replacement was performed by inpainting the selected ob- ject in pixel-space and then placing the bunny at the location of previously selected object. In most of the test results, the bunny mesh seemed to placed at the right location and occluded the previ-
5. ously selected object completely. Runtime for object replacement was similar to that of inpainting the selected object using full mask and information from full frame. The results were also signiﬁcantly better for selected objects which had sharp contrast with its back- ground. Object replacement seemed to produce better results for the augmented reality device as compared to just the removal of the selected object. 6 DISCUSSION There are some limitation in HoloLens that affects the user experi- ence when using HoloSwap. First, the user would be able to still see the real object after the inpainting since HoloLens uses a see- through display. However, when HoloSwap is used in an environ- ment where the selected object is surrounded by white background objects and the HoloLens display brightness is at 100%, the real object becomes hard to see due to the display brightness. Second, every user would have different experience since they have different eye dominance. For many users, the display plane would look slightly shifted since the HoloLens display does not account for eye dominance. Third, there is a limitation to the amount we can inpaint at once and display at once. Because HoloLens’ webcam has smaller ﬁeld of view than that of human eyes, we are unable to inpaint full ﬁeld of view that human can see. In addition, since HoloLens only has small ﬁeld of view display, it cannot display all of the content gen- erated by HoloSwap. Consequently, if the selected object is big or close to the user, user would not be able to completely swap the selected object with replacement object. 7 CONCLUSION We tested HoloLens viability for inpainting and object removal and replacement in real time. We achieve near real time inpainting on the HoloLens and believe with a more efﬁcient algorithm, we could have real time results. We found using a thick ring-shaped mask was the best way to implement video inpainting with consistency between frames. Among our three test cases: object replacement, object inpainting with a solid plane and object inpainting with a vignetted plane, we found that object replacement had the most be- lievable results. There are certain disadvantages of the HoloLens, like ﬁeld of view, eye separation, and the see-through display. However, object replacement on see-through, head-mounted dis- plays could be reasonable under certain circumstances like having white/light background and replacement objects. ACKNOWLEDGEMENTS The authors wish to thank Matthew Turk and Tobias H¨ollerer for their support. We would also like to thank Adam and Brandon for letting us use the HoloLens in time of need. REFERENCES  C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman. Patch- match: a randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics-TOG, 28(3):24, 2009.  J.-Y. Bouguet. Pyramidal implementation of the afﬁne lucas kanade feature tracker description of the algorithm. Intel Corporation, 5(1- 10):4, 2001.  D. A. Bowman, E. Kruijff, J. J. LaViola, and I. Poupyrev. 3D User In- terfaces: Theory and Practice. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 2004.  G. Bradski. Opencv library. Dr. Dobb’s Journal of Software Tools, 2000.  A. Criminisi, P. Perez, and K. Toyama. Object removal by exemplar- based inpainting. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 2, pages II–721. IEEE, 2003.  L.-J. Deng, T.-Z. Huang, and X.-L. Zhao. Exemplar-based im- age inpainting using a modiﬁed priority deﬁnition. PloS one, 10(10):e0141199, 2015.  I. Drori, D. Cohen-Or, and H. Yeshurun. Fragment-based image com- pletion. In ACM Transactions on graphics (TOG), volume 22:3, pages 303–312. ACM, 2003.  A. Enomoto and H. Saito. Diminished reality using multiple handheld cameras. In Proc. ACCV, volume 7, pages 130–135. Citeseer, 2007.  P. Harrison. A non-hierarchical procedure for re-synthesis of complex textures. University of West Bohemia, 2001.  J. Herling and W. Broll. Advanced self-contained object removal for realizing real-time diminished reality in unconstrained environments. In Mixed and Augmented Reality (ISMAR), 2010 9th IEEE Interna- tional Symposium on, pages 207–212. IEEE, 2010.  J. Herling and W. Broll. High-quality real-time video inpainting with pixmix. IEEE transactions on visualization and computer graphics, 20(6):866–879, 2014.  N. Kawai, M. Yamasaki, T. Sato, and N. Yokoya. Ar marker hid- ing based on image inpainting and reﬂection of illumination changes. In Mixed and Augmented Reality (ISMAR), 2012 IEEE International Symposium on, pages 293–294. IEEE, 2012.  A. C. Kokaram, R. D. Morris, W. J. Fitzgerald, and P. J. Rayner. De- tection of missing data in image sequences. IEEE Transactions on Image Processing, 4(11):1496–1508, 1995.  O. Korkalo, M. Aittala, and S. Siltanen. Light-weight marker hiding for augmented reality. In Mixed and Augmented Reality (ISMAR), 2010 9th IEEE International Symposium on, pages 247–248. IEEE, 2010.  O. Miksik, V. Vineet, M. Lidegaard, R. Prasaath, M. Nießner, S. Golodetz, S. L. Hicks, P. Perez, S. Izadi, and P. H. S. Torr. The semantic paintbrush: Interactive 3d mapping and recognition in large outdoor spaces. Proceedings of the 33nd annual ACM conference on Human factors in computing systems (CHI), 2015.  A. Newson, A. Almansa, M. Fradet, Y. Gousseau, and P. P´erez. Video inpainting of complex scenes. SIAM Journal on Imaging Sciences, 7(4):1993–2019, 2014.  B. Nuernberger, K.-C. Lien, T. H¨ollerer, and M. Turk. Interpreting 2d gesture annotations in 3d augmented reality. In 2016 IEEE Symposium on 3D User Interfaces (3DUI), pages 149–158, March 2016.  O. Oda and S. Feiner. 3d referencing techniques for physical objects in shared augmented reality. In 2012 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 207–215, Nov 2012.  A. Olwal, H. Benko, and S. Feiner. Senseshapes: Using statistical ge- ometry for object selection in a multimodal augmented reality system. In Proceedings of the 2Nd IEEE/ACM International Symposium on Mixed and Augmented Reality, ISMAR ’03, pages 300–, Washington, DC, USA, 2003. IEEE Computer Society.  S. D. Rane, G. Sapiro, and M. Bertalmio. Structure and texture ﬁlling- in of missing image blocks in wireless transmission and compression applications. IEEE Transactions on image processing, 12(3):296–303, 2003.  J. Shi and C. Tomasi. Good features to track. In Computer Vision and Pattern Recognition, 1994. Proceedings CVPR’94., 1994 IEEE Computer Society Conference on, pages 593–600. IEEE, 1994.  S. Siltanen. Diminished reality for augmented reality interior design. The Visual Computer, pages 1–16, 2015.  S. Siltanen, H. Sarasp, and J. Karvonen. [demo] a complete interior design solution with diminished reality. In 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 371– 372, Sept 2014.  G. Turk and M. Levoy. The stanford bunny, 1993.