NOISE IN A DATASET

Jul 31, 20216 min read

Updated: Aug 6, 2021

Author – Durgesh Mishra

Noisy data is data with a large amount of additional meaningless information in the data called noise. This includes data that is corrupted or distorted.

What is Noise in a Dataset?

While collecting data, humans tend to make mistakes and instruments tend to be inaccurate, so the collected data has some error bound to it. This error is referred to as noise in a dataset. Noisy data can significantly impact the prediction of any meaningful information. Algorithms can think of noise to be a pattern and can start generalizing from it, which of course is undesirable. Therefore, the problem of identifying and handling noise in prediction applications has drawn considerable attention over the past many years.

Figure-1

Studies have shown that noisy datasets can dramatically lead to decreased classification accuracy and poor prediction results. Improper procedures to remove out the noise in a dataset can lead to a false sense of accuracy or false conclusions. We don’t get to observe Functions f(x) directly, but instead get noisy observations y in the real world, where

y = f(x) + ϵ

Here we will assume that ϵ is a random variable distributed according to a zero-mean Gaussian with standard deviation σϵ2. Note that because ϵ is a random variable, y is also a random variable. As an example, say that the true function f(x) we want to determine has the following form (Though we do not know what true function is in a real-world scenario)

f(x) = cos(πx)

Thus the observations y, we get to see have the following distribution.

Y = cos(πx) + N(0, σϵ2)

Below we have plotted the function f(x) and few a random samples y,

Figure-2

Our goal is to characterize the function f(x), but we don’t know the functional form of f(x), we must instead estimate some other function g(x) that we believe will provide an accurate approximation to f(x).
The function g(x) is called an estimator of f(x).
In general, an estimator can capture a wide range of functional forms.

We don't want our model to overfit or underfit. To manage noisy data, here are some techniques that are extensively used:

1. Train-Test split:

To begin with, the train-test split is a technique for evaluating the performance of a machine learning algorithm. Splitting the dataset is an essential part of an unbiased evaluation of prediction performance. The procedure involves taking a dataset and dividing it into 2 subsets.

Train Dataset: This is used to fit the machine learning model.
Test Dataset: This is used to evaluate the fit machine learning model.

Figure-3

The scikit-learn (sklearn) Machine Learning library in Python provides an implementation of the train-test split evaluation procedure via the train_test_split() function imported from sklearn.model_selection.

2. Cross-Validation (CV):

In cross-validation (CV), we run our modeling process on different subsets of the data to get multiple measures of model quality. In K Fold cross-validation, the data is divided into k subsets, we train our model on k-1 subsets and hold the last one for testing. This process is repeated k times, such that each time, one of the k subsets is used as the test set and the other k-1 subsets are used as a training set. We then average the model for each of the folds and then finalize our model.

Figure-4

Cross-validation gives a better accurate measure of model quality. However, it takes more time to run, because it estimates models once for each fold.

Trade-offs Between Cross-Validation and Train-Test Split:

On small datasets, the extra computational burden of running cross-validation isn't a big deal. These are problems where model quality scores would be least reliable with a train-test split. So, if our dataset is small, we should run cross-validation. For the above reasons, a simple train-test split is sufficient for larger datasets. It will run faster, and we may have enough data that there's little need to re-use some of it for the holdout. The simplest way to use cross-validation is to call the cross_val_score helper function from sklearn.model_selection on the estimator and the dataset.

3. Regularization:

Let us understand first what bias-variance tradeoff is:

What is a Bias?

Bias is the difference between the Predicted and Expected Value. A model with high bias oversimplifies the model and pays very little attention to the training data. It leads to a high error on training and test data.

What is a Variance?

Variance is the amount that the estimate of the target function will change if a different training dataset were to be used. A model with high variance pays a lot of attention to training data and does not generalize on the unseen data. As a result, such models perform very well on training datasets but have high error rates on the test datasets. There is an inverse relationship between bias and variance in machine learning.

Increasing the bias will decrease the variance.
Increasing the variance will decrease the bias.

Figure-5

Underfitting:

when a model is unable to capture the underlying pattern of the data. These models usually have high bias and low variance.

Overfitting:

when our model captures the noise along with the underlying pattern in data. It happens when we train our model a lot over noisy datasets. These models have low bias and high variance. We need to find the optimal fit for the dataset. Finding the right balance between the bias and variance of the model is called the Bias-Variance trade-off. The expected prediction error,

Bias-Variance trade-off. = Variance + Bias2 + Irreducible error

When we think our model is too complex (models that have low bias, but high variance). It is a method for "constraining" or "regularizing" the size of the coefficients ("shrinking" them towards zero). In regularization, a penalty term is added to the algorithm’s cost function, which represents the size of the weights (parameters) of the algorithm. There are 2 commonly used techniques in regularization:

Lasso regression (L1 regularization): In Lasso regression, we minimize,

RSS + λ ∑ |βj|

Lasso regression shrinks coefficients to zero, thus removing them from the model.

Ridge regression (L2 regularization): In Ridge regression, we minimize,

RSS + λ ∑ (βj)2

Ridge regression shrinks coefficients toward zero, but they rarely reach zero.

Where λ is a tuning parameter.
A tiny λ imposes no penalty on the coefficient size and is equivalent to a normal linear regression.
Increasing λ penalizes the coefficients and thus shrinks them towards zero.

4. PCA:

Noise can also be unwanted data items, features, or records which do not help in explaining the feature itself, or the relationship between feature & target. One of the ways to deal with unwanted features is to do the Feature selection. Feature Selection is the process where we select those features which contribute most to our model and thus remove or combine features. We can also use an important unsupervised Machine Learning algorithm for dimensionality reduction called Principal Component Analysis (PCA).

PCA is a statistical procedure that uses an orthogonal transformation that converts a set of correlated variables to a set of uncorrelated variables, thus PCA reduces the number of variables of a dataset while preserving as much information as possible.

5. Collecting more Data:

To handle noisy data, we can collect more data. The more data we collect, the better will we be able to estimate the underlying true function. This will eventually help in reducing the effect of noise from our dataset. Some of the origins of noise:

Outliers: Outliers are extreme values that are outside the range of what is expected. The InterQuartile Range approach of finding the outliers is the most commonly used and most trusted approach used in the research field. If actual outliers are not removed from the data, they corrupt the results to a small or large degree depending on circumstances. If valid data is identified as an outlier and is removed, that also corrupts results.
Duplicate Data: It is necessary to see whether the dataset has duplicate entries, if present it is required to remove the duplicate entries, or else it might create a poor model.
Missing Data (NaN values):In my opinion, handling missing values is an essential part of the data cleaning and preparation process because almost all data in real life comes with some missing values. In general, datasets simply arrive with missing data, either because it exists and was not collected or it never existed. Some of the methods for dealing with missing data are removing Rows, Replacing the Nan values with Mean/Median/Mode, using algorithms that support missing values like KNN or Random Forest, Predicting the missing values.

Conclusion:

Every dataset comes with some noise. All Data Scientists need to understand the impact the noise can create on the model, therefore it becomes very much important for any Data Scientist to take care of the noise when applying any Machine Learning algorithm.

Github Link:

https://github.com/durg3sh10/Noisy_dataset

References:

https://en.wikipedia.org/wiki/Noisy_data
https://www.sciencedirect.com/science/article/pii/S1877050919318575
https://magoosh.com/data-science/what-is-deep-learning-ai/
https://www.i2tutorials.com/what-do-you-mean-by-noise-in-given-dataset-and-how-can-you-remove-noise-in-dataset/
https://wikimili.com/en/Raw_data

Madras Scientific Research Foundation