Why data quality is a chatbot’s foundation of success

Nina Praß Sept. 6, 2018

A chatbot that magically understands everything from scratch does not exist. Each one of them needs to be carefully trained and taken care of in order to develop an understanding of the input that it’s given. This means when we train irregular or non-sufficient data to the chatbot, it will not be functioning well and give unexpected results. Everything the chatbot does was taught to it that way.

This is why we need to put extra attention to what we give the chatbot to learn. Here are some key factors to take into account when feeding a chatbot with data.

How to categorize data

A chatbot is usually trained for a specific domain like recruiting or customer service. This domain is divided into categories that hold multiple different variations of questions to one answer. Those can be categories about the recruitment processes, salaries or the company culture. All sentences matched to a certain category should be efficiently answered by the provided answer for that category. Using a language processing algorithm, new incoming questions for the chatbot will be compared to the data that is already in all the categories and then matched to the one with questions that appear most similar. This is how a chatbot “learns”. It sounds easy in theory but can be quite tricky in reality.

To avoid spoiling the chatbot from the beginning on, it is important to pay attention to the structuring of the data. Each category should be clear and distinctive, including questions that are possibly very different to those of all other categories. This way we can ensure high accuracy of detection.

Provide many examples in a few categories

It makes sense that the more examples are available for comparison in a category, the higher will be the chance that one example matches. That is why we should try to have as many questions in a category as possible. Putting those examples into a few clearly distinctive categories avoids the confusion of categories. Using fewer categories also makes the management of the data set easier. It is helpful to find representative key words that make a category unique.

Use natural data

Generally speaking: The closer the data that is fed is to natural appearance, the better. This will lead to better accuracy of detecting incoming inquiries.

On the other hand, when the chatbot is trained a sentence that is never used in natural language, that data is useless. In a worst case scenario, unregulated synthetic data population produces noise and lets categories collide.

Use cohesive data

Diverse and wild formulations attract noise, destroy the inner category coherence and generate contiguity to other categories. Too few and one-sided examples lead to worse accuracy. Therefore, finding the golden road is the key: You should find multiple ways to express a question and (at the same time) be careful not to express it too ambiguous or confusable with another category. 
Have in mind here that shorter sentences as examples lead to a better overall result.

Avoid biased data

Recently, concerns about biased data have been shaking up trust in AI more and more. A lot of research, for example of IBM, concluded that what is given to chatbots can be bad data. Bad data can contain implicit racial, gender, or ideological biases. Therefore it is crucial to start paying extra attention to not use strongly one-sided terminology and to balance out the data set.

When you take care of these few key points when feeding your chatbot with data, chances are that it will become your new favorite recruiting assistant.

If you want to learn more about chatbots, get in touch with us!

Other blog posts