The role of data annotation in the development of AR applications

Augmented reality (AR) applications combine the physical and digital worlds by adding digital objects to the real world. AI technologies for fields such as computer vision, object recognition, speech recognition, NLP, and translation have AR applications that enhance the applications and provide a more compelling user experience. AI is used for object recognition, facial recognition, text recognition and translation, and other features in AR.

The foundation these AI technologies operate on is the colossal volume of data used to train them. Consequently, the efficiency and accuracy of AR applications depend on how well the data has been annotated to fit the requirements of the AI technology. Data annotation labels and categorises training data of all kinds: text, audio, image, or video.

So, how is data annotated for AR applications?

Object recognition and facial recognition involve image data annotation. Image data annotation methods have interesting names such as bounding boxes, lines, polygonal segmentation, splines, semantic segmentation, 3D cuboid, and landmark.

Let us take a quick look at each of these methods.

Bounding box labels objects in an image within a rectangle. Retail AR apps use it to identify products and guide customers to them. Objects which do not fit into a rectangular shape are labelled using polygonal segmentation. A tourism AR app can use polygonal segmentation to recognise statues and provide further details to users. Semantic segmentation marks each pixel of the object and, therefore, is accurate and reveals detailed information. A museum AR app can display reconstructed artefacts using semantic segmentation like a dinosaur skeleton overlaid with muscle and skin. It is also one of the data annotation methods for facial recognition as it provides information of fine granularity. Phone apps with face animation use facial recognition. Landmark annotates key points or dots within the image. Counting applications use it to determine the density of the target object in an image, like finding the number of zombies in an AR game. It is also used for gesture and facial identification. 3D cuboid involves annotating objects in 2D images to get a three-dimensional aspect using height, width, and depth. A furniture retailer's space planning app can use the 3D cuboid method to display how furniture products fit into a customer's home. Line and spline annotation labels the image with either straight lines or curves. Gaming AR apps can use line and spine to demarcate pavements, road lanes, and other road marks.

Video data annotation has the same types as image data but is applied frame by frame. An AR app that functions as a guide is a use case for video annotation. These apps overlay complicated layouts with information allowing users to quickly reach their destination in hotels, museums, or malls.

Audio data annotation is of the following types: Speech-to-text transcription, sound labelling, event tracking, audio classification, natural language utterance, and music classification. Sound labelling involves categorising and labelling sounds like musical tones or spoken words. Event tracking is done for clips with overlapping sound sources, like a city street. Audio classification separates voice from other sounds. Natural language utterance annotates human speech focusing on nuances like dialect, semantics, tone, and context. Music classification labels the kinds of music, instruments, and types of music groups. AR apps use event tracking and natural language utterance to enable interaction using speech. AR apps can use speech-to-text transcription to provide a feature that converts users' verbal comments as displayed overlaid text in a remote assistance AR app or a tourist venue AR app.

Text annotation consists of sentiment, intent, semantic, entity, and linguistic labelling. Sentiment labelling classifies whether the data has a positive, negative, or neutral overtone. Intent labelling categorises the aim behind the text as command, request, or confirmation. Semantic labelling adds metadata to the subject discussed to understand the concept. Entity labelling marks parts of speech, named entities, and key phrases in the text. Linguistic labelling pinpoints grammatical elements to understand the context and the subject being discussed across multiple sentences. AR tutor apps use text annotation to provide further explanations for a passage that the student has highlighted. Gaming and AR maps use text labelling to read street and traffic signs. Text annotation is used in combination with audio annotation in AR for education, remote assistance, and manufacturing.

Augmented reality is now maturing into a utility and is no longer a niche solution. Varied use cases across industries are being conceived with the advancements in technology. Data annotation is a cornerstone in the growth of AR, as the solution's core functionality depends on the training data's quality. The focus on and demand for data annotation will continue; otherwise, the consequence will be GIGO (garbage in, garbage out) scenarios.

*For organizations on the digital transformation journey, agility is key in responding to a rapidly changing technology and business landscape. Now more than ever, it is crucial to deliver and exceed on organizational expectations with a robust digital mindset backed by innovation. Enabling businesses to sense, learn, respond, and evolve like a living organism, will be imperative for business excellence going forward. A comprehensive, yet modular suite of services is doing exactly that. Equipping organizations with intuitive decision-making automatically at scale, actionable insights based on real-time solutions, anytime/anywhere experience, and in-depth data visibility across functions leading to hyper-productivity, Live Enterprise is building connected organizations that are innovating collaboratively for the future.

Recent Posts