What is a problem with using incorrect training data to the train the machine?

Machine Learning is an interesting area. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI.

Nội dung chính Show

Machine Learning Pipeline
Problems with data
1. Model Complexity:
2. Transfer Learning:
3. Data Augmentation:
4. Synthetic Data:
Conclusion:
About Author:
What is a problem with using incorrect data to train a machine?
What is a problem with using incorrect training data to train a machine multiple choice sample bias prejudice bias measurement bias variance bias?
What are some of the potential problems regarding training data?
What can go wrong with machine learning?

Source: Google images – Machine Learning

The process of transforming an unstructured data format into a structured one in order to feed it into an ML model is called Data preparation.

Data preparation is the most difficult and time-consuming process while building a machine learning model. There are several actions in data preparation such as dealing with missing data, dropping unnecessary attributes, etc,..the main objective of the data preparation is to fit the data to apply an ML algorithm.

In this blog, we going to dive deeper into the data preparation part of the ml pipeline. we are going to discuss the problems we face with data in the real world and ways to handle insufficient data to get accurate results from the Machine Learning model.

After reading this blog, you will get to know How to process insufficient data efficiently to get accurate results from ML models. We used python language for coding.

Let’s get started!!

What is machine learning?

Machine learning is the scientific study of
algorithms and statistical models that computer systems use to perform a specific task without any instruction, but relying on patterns.

There are 3 types of Machine Learning algorithms:

Supervised (labeled data)
Unsupervised (unlabelled data)
Reinforcement (goal-orientated learning)

Source: Google Images – Machine Learning Types

Machine Learning helps us to work with,

Huge amount of data
Finding patterns from data
Make intelligent decisions

Machine Learning Pipeline

Now let’s take a quick overview of the Machine Learning pipeline 😀

What is the ML pipeline? Machine Learning Pipeline is also called Machine Learning Workflow. It consists of multiple sequential steps that do everything from data extraction and preprocessing to model training and deployment.

Source: Google Images – ML pipeline

As we all know the data we obtain from the database is unformatted and dirty. So, we just cannot feed the raw dataset into the machine learning model (if done, it leads to inaccurate results). Since the ML model requires well-formatted data to be fed into it, the process of data cleaning is necessary to get accurate predictions from the model.

As we get started by cleaning data, we may encounter many problems like missing values, outliers, and many more. Let us see all the problems associated with data while preparing data for the Machine Learning model.

Problems with data

Insufficient data
Too much data
Non- Representative data
Missing data
Duplicate data
Outliers

So in this blog, we are going to dive deeper into the problem of, insufficient data. As we all know, Data is the new oil and lots of data is generated every minute, second, and microsecond. But one may wonder why do we have insufficient when data is generated in terabytes every day!!

LET’S LOOK AT THE ANSWER 🙂

Why do we have insufficient data?

It is a common struggle in the real world. Nowadays every firm has confidential pieces of information stored in a database that has no access to every employee but some executives. FOR EXAMPLE, in a firm like Healthcare the patient’s information is kept confidential and will not be shared with anyone for privacy reasons. In such cases, we will be provided with a common and small amount of data to make future predictions, which may lead to inaccurate results. Like Healthcare some of the others firms include Consulting, Law, Accounting( Bank services), etc…

Disadvantages:

Relevant data may not be available.
The collection process is difficult and time-consuming.

How it affects our Model?

Models trained with this type of data perform poorly while making predictions. It may lead to 2 cases:

1. Over-fitting: Here the training model reads the data too much for too little data. this means the training model actually memorizes the patterns. It has low training errors and high test errors. Does not work well in the real world.

2. Under-fitting:
It builds an overly simple model. This means the data is not trained to our
expectations. the model unable to capture relationship in data. No predictive power.

1. Model Complexity:

Model complexity is nothing but building a simple model with fewer parameters. This method is less susceptible to over-fitting. Example: Naive Bayes, Linear Regression.
Using Ensemble learning technique: It is defined as several learners are combined to obtain a better performance than any individual learners. It is often used to improve classification and prediction.

Source: Google Images – Ensemble Learning

Implementation:

#Logistic regression

from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

Voting Ensemble for Classification

from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.svm import SVC 
from sklearn.ensemble import VotingClassifier 
kfold = model_selection.KFold(n_splits=10, random_state=seed) 
# create the sub models
estimators = [] 
model1 = LogisticRegression() 
estimators.append(('logistic', model1)) 
model2 = DecisionTreeClassifier() 
estimators.append(('cart', model2)) 
model3 = SVC()
estimators.append(('svm', model3)) 
# create the ensemble model 
ensemble=VotingClassifier(estimators) 
results = model_selection.cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())

2. Transfer Learning:

Transfer Learning is used in the case of Deep Learning and
Neural Networks. It uses a pre-built model, which is then tweaked on the small dataset that you have.
It is also defined as the practice of reusing a trained Neural
Networks, that solve a similar problem to yours, usually leaving the network
architecture unchanged and reusing some of the model weights.
It is very much useful when the new dataset is small and
not sufficient to train the model from scratch.

Source: Google Images – Transfer Learning

Interpretation: From the above image, we can clearly see that in Task 1, the Data1 is fed into model and used to make predictions. Then that knowledge is transferred to Task 2 to make another prediction for another dataset( Data2)

Implementation:

STEP 1 – Importing required packages

#Install Tensorflow via PIP using the command below
pip3 install tensorflow==1.13.1
#Install Keras using the command below
pip3 install keras==2.2.4
#Install OpenCV using the command below
pip3 install opencv-python
#Install Numpy using the command below
pip3 install numpy==1.16.1
#Finally, install ImageAI (v2.0.3) using the command below
pip3 install imageai --upgrade

#imported ImageAI training class
from imageai.Prediction.Custom import ModelTraining

#created an instance of the class and set the Model type to ResNet
trainer = ModelTraining()
trainer.setModelTypeAsResNet()
#specified the path to the folder containing dataset (here it is "fruits")
trainer.setDataDirectory("fruits")
#started the training
trainer.trainModel(num_objects=5, num_experiments=50, enhance_data=True, save_full_model=True, batch_size=32, show_network_summary=True, transfer_from_model="resnet50_weights_tf_dim_ordering_tf_kernels.h5", initial_num_objects=1000, transfer_with_full_training=True)

STEP 3- Run prediction code

from imageai.Prediction.Custom import CustomImagePrediction
import os
predictor = CustomImagePrediction()
predictor.setModelPath(model_path="transfer_trained_fruits_model_ex-050_acc-0.862500.h5")
predictor.setJsonPath(model_json="model_class.json")
predictor.loadFullModel(num_objects=5)
prediction, probability = predictor.predictImage(image_input=os.path.join(os.getcwd(), "sample.jpg"), result_count=1)
print(prediction, " :", probability)

3. Data Augmentation:

Data Augmentation helps to tweak (make slight improvements) to get new images.

It takes the pre-existing samples and changes them in some way to
create new samples and also increase the number of training samples and
typically used with Image data.
Disturb images in some way to generate new images, such as,
- Scaling
- Rotation
- Affine Transforms
These image processing options are often used as pre-processing
techniques to make image classification models built using CNN are robust

Implementation:

# Importing necessary functions
from keras.preprocessing.image import ImageDataGenerator, 
array_to_img, img_to_array, load_img
# Initialising the ImageDataGenerator class.
# We will pass in the augmentation parameters in the constructor.
datagen = ImageDataGenerator(
        rotation_range = 40,
        shear_range = 0.2,
        zoom_range = 0.2,
        horizontal_flip = True,
        brightness_range = (0.5, 1.5))
# Loading a sample image 
img = load_img('image.jpg') 
# Converting the input sample image to an array
x = img_to_array(img)
# Reshaping the input image
x = x.reshape((1, ) + x.shape) 
# Generating and saving 5 augmented samples 
# using the above defined parameters. 
i = 0
for batch in datagen.flow(x, batch_size = 1,
                          save_to_dir ='preview', 
                          save_prefix ='image', save_format ='jpeg'):
    i += 1
    if i > 5:
        break

4. Synthetic Data:

Synthetic
data generally refers to artificially generating samples which
mimic the real-world data (it is one only if we have a good understanding of
features). This may induce bias in existing data.

Source: Google Images – Synthetic data

Implementation:

Generating Samples Derived from an Input Dataset

# 1.Get the faces data
# 2.Generate the kernel density model from data
# 3.Use the kernel density to generate new samples of data
#4. Display the original and synthetic faces.
# Fetch the dataset and store in X
faces = dt.fetch_olivetti_faces()
X= faces.data
# Fit a kernel density model using GridSearchCV to determine the best parameter for bandwidth
bandwidth_params = {'bandwidth': np.arange(0.01,1,0.05)}
grid_search = GridSearchCV(KernelDensity(), bandwidth_params)
grid_search.fit(X)
kde = grid_search.best_estimator_
# Generate/sample 8 new faces from this dataset
new_faces = kde.sample(8, random_state=rand_state)
# Show a sample of 8 original face images and 8 generated faces derived from the faces dataset
fig,ax = plt.subplots(nrows=2, ncols=8,figsize=(18,6),subplot_kw=dict(xticks=[], yticks=[]))
for i in np.arange(8):
    ax[0,i].imshow(X[10*i,:].reshape(64,64),cmap=plt.cm.gray)
    ax[1,i].imshow(new_faces[i,:].reshape(64,64),cmap=plt.cm.gray)
ax[0,3].set_title('Original Data',fontsize=20)
ax[1,3].set_title('Synthetic Data',fontsize=20)
fig.subplots_adjust(wspace=.1)
plt.show()

Source: Google Images – Synthetic data (images)

References:

1.https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8

2. https://www.datacamp.com/community/tutorials/ensemble-learning-python

3. https://www.geeksforgeeks.org/python-data-augmentation/

4. https://medium.com/deepquestai/transfer-learning-with-5-lines-of-code-5e69d0290850

Conclusion:

I hope you enjoyed my article and understood how to handle insufficient data in the real world using the above 4 methods which may be very much useful to get a good Machine Learning model.

And I would like to thank Analytics Vidhya for providing me the such a great opportunity to share my knowledge with other people!!

If you have any doubts/suggestions please feel free to contact me on Linkedin / Email.

Once again, THANKS FOR READING :))

About Author:

Hello! This is Priyadharshini, I am currently pursuing M.Sc. in Decision and Computing Sciences. I am very much passionate about Data Science and continuously gaining knowledge from various sources to shine in the Data Science field. I love exploring and analyzing things!!

Thank you!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

What is a problem with using incorrect data to train a machine?

Sample bias. This happens when there's a problem with the data used to train the machine learning model. In this type of bias, the data used is either not large enough or representative enough to teach the system.

What is a problem with using incorrect training data to train a machine multiple choice sample bias prejudice bias measurement bias variance bias?

Sample bias is a problem with training data. It occurs when the data used to train your model does not accurately represent the environment that the model will operate in.

What are some of the potential problems regarding training data?

The model will not perform well if training data is small, or noisy with errors and outliers, or if the data is not representative(results in biased), consists of irrelevant features(garbage in, garbage out), and lastly neither too simple(results in underfitting) nor too complex(results in overfitting).

What can go wrong with machine learning?

5 Common Machine Learning Problems & How to Solve Them.

1) Understanding Which Processes Need Automation. It's becoming increasingly difficult to separate fact from fiction in terms of Machine Learning today. ... .

2) Lack of Quality Data. ... .

3) Inadequate Infrastructure. ... .

4) Implementation. ... .

5) Lack of Skilled Resources..