Virtual Background for Video Conferencing using Machine Learning

Implementation using the DeepLab v3 Image Segmentation model and OpenCV Python

Published in

Towards Data Science

6 min readNov 11, 2020

Every time our surroundings change, technology evolves with it. The increased video conferencing during COVID has highlighted the concept of Virtual Backgrounds. Name any video communication service, be it Google Meet, Zoom, or MS Teams and you will notice this feature. Protecting one’s privacy or simply hiding their chaotic surroundings, Virtual backgrounds can really be helpful.

Curious about how you can implement it with simple coding? This article contains all information needed to understand and execute it in your own system. Concepts of Image Segmentation and masking are explained with step-by-step coding.

Virtual Background

The aim of a virtual background is to customize one’s background for various individual reasons. To be able to modify ours, let us understand the details. How do you define a background? In a digital frame, everything except your body definition can be considered as the background. Imagine a frame of your webcam feed, separate your body description from the rest, pixel by pixel.

image by Simon Barthelmé in https://dahtah.github.io/imager/foreground_background.html

Similarly, this image also describes the task at hand. The segmented foreground parrot with no background is the primary focus. It can be superimposed on any image/background taken to form an overall image. Use a simple OpenCV code to import a virtual background image-

Image Segmentation

Image segmentation is a computer vision algorithm used to divide any image into various segments. The output of segmentation is solely application based. For object detection, the segmented image would contain different colored cars, humans, roads, stop-signs, trees, and other objects present in the scenario.

image in https://www.anolytics.ai/semantic-segmentation-services/

Our aim is to segment the background and foreground as mentioned before. So if we detect a human figure, its pixelated segmentation should be performed. This segmentation is implemented using the DeepLab v3 model by TensorFlow. It amounts to one of the best open-source implementations to date and is even able to give decent frame rates on video segmentation.

Write the following code to download the model. Once downloaded copy it to the working directory and delete this block. It must be downloaded using this code and cannot be done manually through the link.

Reference:https://colab.research.google.com/github/tensorflow/models/blob/master/research/deeplab/deeplab_demo.ipynb#scrollTo=c4oXKmnjw6i_

Import the necessary libraries. PIL and OpenCV are mainly used for image manipulation post segmentation while others are used for running the DeepLab class. The model is able to classify between the following mentioned labels. It takes in a zip model file and predicts the output respectively.

Declare the DeepLab class and other functions needed to segment the image. The following code segment can be found in the notebook reference provided by DeepLab in their readme. If you are worrying too much about the complexity of this code, don’t. In general, all segmentation models are trained using Coco or ImageNet datasets. Depending on the architecture, input, and output format the following code will vary.

Let us do a status check. We imported the virtual background we need, downloaded the model, loaded it, and defined the DeepLab class. The segmented output can be obtained by calling the run function of the class. In total, we have our background and segmentation that needs to be superimposed on it.

OpenCV

The main task of OpenCV in our application is to create a mask of the segmentation in the virtual background, resize all the in-use images to the same dimensions, and finally add the masked and segmented image. Continuing from the loaded model, let us see how each step in OpenCV is performed.

Step 1: Video Capture

Each frame of the video must be extracted, segmented, and added to the virtual background. OpenCV video capture is used for loading webcam video and extracting every frame. In general, this process would result in a lag, but the lightweight model and minimal preprocessing are able to provide decent frame rates.

Step 2: Frame Segmentation

Every frame is passed through the MODEL.run() function to get the resultant segmentation. The class returns 2 values: original imaged resized and map outline of segmentation. The segmentation map is used to form an image outlining the boundaries.

Step 3: Segmented Map Processing

The segmented image is used to extract the original posture from each frame. It is first converted into a black and white image using the following. This conversion is done to make the process of masking simpler.

Step 4: Masking

Masking, in simple terms, is creating a layer of cloaking on an image. On one hand, we have a black and white segmented frame and on the other, the original webcam frame. Wherever there is segmentation(black/0) in seg_img, the respective pixel of the original frame remains the same else it is converted to black. This is the mask of segmentation in the original frame. Using similar logic, just the reverse is performed on the virtual background. Wherever there is segmentation, the pixels are converted to black.

a. Masked Segmentation of original frame b.Virtual background mask

Output

We are almost to the finish line. After masking on each of the required images, only addition remains. Before showing each frame of the video, add seg_img and back.

Conclusion

We have implemented image segmentation through the TensorFlow DeepLab v3 model. The final virtual background can be seen by using masking and other OpenCV preprocessing. This method agrees with the current standard of technology. Although this can also be done by other Computer Vision and image processing algorithms, even ones not including deep learning, many flaws are observed. Let us list out some of ours:

Simple Image processing algorithms depend on color variation for background and foreground separation. This being a deep learning model, however, focuses on identification.
The accuracy of real-time output, especially in videos can be poor at times.
In a good system, the frame rates are decent but otherwise, the output may contain a huge lag. To overcome this, the code can either be hosted or implemented on CUDA.
The code mainly uses TensorFlow and OpenCV. It can easily be coded and run on any system without much dependency issues.

This concludes the article at large. I hope it was helpful in the understanding of how a virtual background can be implemented. For the original DeepLab v3 implementation visit the DeepLab Notebook. The resultant accuracy may note not very satisfying and therefore I look forward to any suggestions for improvement. If you come across any errors or have any doubts do comment.

About Me

I am a 3rd year CSE student. My main interests and work lie mostly in deep learning, reinforcement learning, Genetic Algorithms, and Computer Vision. If you are interested in these topics, go through my previous blogs and follow me to stay updated. My projects and profile can be viewed on Github and LinkedIn.