Master Data Management
NLP labelling: The types of data in NLP
Why must humans learn machine language to communicate with machines? Why can’t machines learn our languages? After all, we created machines! This rather simple (some may say simplistic) question can unearth a whole host of responses. It is something computer scientists have been pondering as far back as the times of Alan Turing - his famous Turing Test is still used today to evaluate the language processing capabilities of AI systems.
Linguist Noam Chomsky’s work in the 1950s was invaluable in the development of many early NLP algorithms and models, and his work continues to shape the field to this day. In essence, NLP (Natural Language Processing) is a part of Artificial Intelligence (AI) and computational linguistics that enables computers to understand, interpret, and generate human language. That is, it helps machines get closer to understanding human languages.
Data annotations are a critical part of NLP, where we label or tag text data with machine-readable information that provides NLP algorithms vital information on how to process the tagged data. Clear and concise annotations ensure we train machine models with better quality inputs. It also helps us to scale NLP operations.
For example, let’s take a data set that we are using to train a model on positive and negative sentiment. Consider:
Original text: "I absolutely loved the movie! The acting was fantastic and the story was engaging."
Annotation: Positive sentiment
Original text: "The movie was terrible. The acting was wooden and the story was predictable."
Annotation: Negative sentiment
The annotation helps the machine understand the sentiment being expressed in the text. Such input annotations can help us train a model to recognise language patterns that express positive or negative sentiment.
As one might expect, there are many types of annotations employed by NLP scientists and software engineers. Here are some popular ones:
- Part of speech (POS) tagging:
- Named entity recognition (NER):
- Sentiment analysis:
- Text classification:
- Event extraction:
- Event: Acquisition
- Trigger: announced
- Agent: Apple
- Object: a startup
- Coreference resolution:
POS tagging involves labeling each word in a sentence with its grammatical category, such as noun, verb, or adjective. POS tagging is useful in areas such as text parsing and information extraction.
Consider the sentence “Sarla sat on the chair”.
We would annotate it as “Sarla/NN sat/VBD on/IN the/DT chair/NN”
Where NN corresponds to a noun, VBD implies verb in past tense and so on.
NER involves identifying and categorizing entities in text data, such as people, places, and organizations. NER is useful for tasks such as information extraction and entity linking.
For example, in the sentence "I ordered a dozen roses from Ferns N Petals," "Ferns N Petals." could be tagged as an ORGANIZATION.
As we saw in the example in the previous section, sentiment analysis involves categorizing text data according to its emotional tone, such as positive, negative, or neutral. It is useful for tasks such as social media monitoring and customer feedback analysis.
With text classification we assign text data to predefined categories or labels, such as topics or genres. Think of ‘Horror Movie’ as a classification for ‘The Shining’. It is useful for tasks such as content filtering and document classification.
Event extraction involves identifying events or actions that are described in text data, along with their associated entities and attributes. Event extraction is useful for tasks such as news summarization and event detection.
Consider the sentence: "Apple announced that it has acquired a startup."
To extract the event described in this sentence, we would identify the relevant entities and attributes and link them to the event. The resulting event might look something like this:
In this example, the event is an acquisition, which is triggered by the verb "announced". The agent performing the event is "the company", and the object being acquired is "a startup".
Coreference resolution involves identifying and linking noun phrases in a text that refer to the same real-world entity. Coreference resolution is useful for tasks such as text summarization and question answering.
Consider the following sentences:
A. Mary went to the store to buy some apples. She paid for them with cash.
B. John saw Mary at the store. He said hello to her.
In these sentences, there are two different mentions of the same entity, Mary. Coreference resolution identifies these mentions as referring to the same entity. By resolving the pronoun ‘she’ in sentence A to refer to Mary, and the pronoun ‘her’ in sentence B, the the resulting coreference resolution might look like this:
A. Mary went to the store to buy some apples. Mary paid for them with cash.
B. John saw Mary at the store. John said hello to her.
*For organizations on the digital transformation journey, agility is key in responding to a rapidly changing technology and business landscape. Now more than ever, it is crucial to deliver and exceed on organizational expectations with a robust digital mindset backed by innovation. Enabling businesses to sense, learn, respond, and evolve like a living organism, will be imperative for business excellence going forward. A comprehensive, yet modular suite of services is doing exactly that. Equipping organizations with intuitive decision-making automatically at scale, actionable insights based on real-time solutions, anytime/anywhere experience, and in-depth data visibility across functions leading to hyper-productivity, Live Enterprise is building connected organizations that are innovating collaboratively for the future.