Machine Learning is an interesting area. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. Show
Source: Google images – Machine Learning The process of transforming an unstructured data format into a structured one in order to feed it into an ML model is called Data preparation. Data preparation is the most difficult and time-consuming process while building a machine learning model. There are several actions in data preparation such as dealing with missing data, dropping unnecessary attributes, etc,..the main objective of the data preparation is to fit the data to apply an ML algorithm. In this blog, we going to dive deeper into the data preparation part of the ml pipeline. we are going to discuss the problems we face with data in the real world and ways to handle insufficient data to get accurate results from the Machine Learning model. After reading this blog, you will get to know How to process insufficient data efficiently to get accurate results from ML models. We used python language for coding. Let’s get started!! What is machine learning? Machine learning is the scientific study of There are 3 types of Machine Learning algorithms:
Source: Google Images – Machine Learning Types Machine Learning helps us to work with,
Machine Learning PipelineNow let’s take a quick overview of the Machine Learning pipeline 😀 What is the ML pipeline? Machine Learning Pipeline is also called Machine Learning Workflow. It consists of multiple sequential steps that do everything from data extraction and preprocessing to model training and deployment. Source: Google Images – ML pipeline As we all know the data we obtain from the database is unformatted and dirty. So, we just cannot feed the raw dataset into the machine learning model (if done, it leads to inaccurate results). Since the ML model requires well-formatted data to be fed into it, the process of data cleaning is necessary to get accurate predictions from the model. As we get started by cleaning data, we may encounter many problems like missing values, outliers, and many more. Let us see all the problems associated with data while preparing data for the Machine Learning model. Problems with data
So in this blog, we are going to dive deeper into the problem of, insufficient data. As we all know, Data is the new oil and lots of data is generated every minute, second, and microsecond. But one may wonder why do we have insufficient when data is generated in terabytes every day!! LET’S LOOK AT THE ANSWER 🙂 Why do we have insufficient data? It is a common struggle in the real world. Nowadays every firm has confidential pieces of information stored in a database that has no access to every employee but some executives. FOR EXAMPLE, in a firm like Healthcare the patient’s information is kept confidential and will not be shared with anyone for privacy reasons. In such cases, we will be provided with a common and small amount of data to make future predictions, which may lead to inaccurate results. Like Healthcare some of the others firms include Consulting, Law, Accounting( Bank services), etc… Disadvantages:
How it affects our Model? Models trained with this type of data perform poorly while making predictions. It may lead to 2 cases: 1. Over-fitting: Here the training model reads the data too much for too little data. this means the training model actually memorizes the patterns. It has low training errors and high test errors. Does not work well in the real world. 2. Under-fitting: 1. Model Complexity:
Source: Google Images – Ensemble Learning Implementation: #Logistic regression from sklearn.linear_model import LogisticRegression X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) logreg = LogisticRegression() logreg.fit(X_train, y_train) y_pred = logreg.predict(X_test) Voting Ensemble for Classification from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from sklearn.ensemble import VotingClassifier kfold = model_selection.KFold(n_splits=10, random_state=seed) # create the sub models estimators = [] model1 = LogisticRegression() estimators.append(('logistic', model1)) model2 = DecisionTreeClassifier() estimators.append(('cart', model2)) model3 = SVC() estimators.append(('svm', model3)) # create the ensemble model ensemble=VotingClassifier(estimators) results = model_selection.cross_val_score(ensemble, X, Y, cv=kfold) print(results.mean()) 2. Transfer Learning:
Source: Google Images – Transfer Learning
Implementation: STEP 1 – Importing required packages #Install Tensorflow via PIP using the command below pip3 install tensorflow==1.13.1 #Install Keras using the command below pip3 install keras==2.2.4 #Install OpenCV using the command below pip3 install opencv-python #Install Numpy using the command below pip3 install numpy==1.16.1 #Finally, install ImageAI (v2.0.3) using the command below pip3 install imageai --upgrade #imported ImageAI training class from imageai.Prediction.Custom import ModelTraining #created an instance of the class and set the Model type to ResNet trainer = ModelTraining() trainer.setModelTypeAsResNet() #specified the path to the folder containing dataset (here it is "fruits") trainer.setDataDirectory("fruits") #started the training trainer.trainModel(num_objects=5, num_experiments=50, enhance_data=True, save_full_model=True, batch_size=32, show_network_summary=True, transfer_from_model="resnet50_weights_tf_dim_ordering_tf_kernels.h5", initial_num_objects=1000, transfer_with_full_training=True) STEP 3- Run prediction code from imageai.Prediction.Custom import CustomImagePrediction import os predictor = CustomImagePrediction() predictor.setModelPath(model_path="transfer_trained_fruits_model_ex-050_acc-0.862500.h5") predictor.setJsonPath(model_json="model_class.json") predictor.loadFullModel(num_objects=5) prediction, probability = predictor.predictImage(image_input=os.path.join(os.getcwd(), "sample.jpg"), result_count=1) print(prediction, " :", probability) 3. Data Augmentation:Data Augmentation helps to tweak (make slight improvements) to get new images.
Implementation: # Importing necessary functions from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img # Initialising the ImageDataGenerator class. # We will pass in the augmentation parameters in the constructor. datagen = ImageDataGenerator( rotation_range = 40, shear_range = 0.2, zoom_range = 0.2, horizontal_flip = True, brightness_range = (0.5, 1.5)) # Loading a sample image img = load_img('image.jpg') # Converting the input sample image to an array x = img_to_array(img) # Reshaping the input image x = x.reshape((1, ) + x.shape) # Generating and saving 5 augmented samples # using the above defined parameters. i = 0 for batch in datagen.flow(x, batch_size = 1, save_to_dir ='preview', save_prefix ='image', save_format ='jpeg'): i += 1 if i > 5: break 4. Synthetic Data:
Source: Google Images – Synthetic data Implementation: Generating Samples Derived from an Input Dataset # 1.Get the faces data # 2.Generate the kernel density model from data # 3.Use the kernel density to generate new samples of data #4. Display the original and synthetic faces. # Fetch the dataset and store in X faces = dt.fetch_olivetti_faces() X= faces.data # Fit a kernel density model using GridSearchCV to determine the best parameter for bandwidth bandwidth_params = {'bandwidth': np.arange(0.01,1,0.05)} grid_search = GridSearchCV(KernelDensity(), bandwidth_params) grid_search.fit(X) kde = grid_search.best_estimator_ # Generate/sample 8 new faces from this dataset new_faces = kde.sample(8, random_state=rand_state) # Show a sample of 8 original face images and 8 generated faces derived from the faces dataset fig,ax = plt.subplots(nrows=2, ncols=8,figsize=(18,6),subplot_kw=dict(xticks=[], yticks=[])) for i in np.arange(8): ax[0,i].imshow(X[10*i,:].reshape(64,64),cmap=plt.cm.gray) ax[1,i].imshow(new_faces[i,:].reshape(64,64),cmap=plt.cm.gray) ax[0,3].set_title('Original Data',fontsize=20) ax[1,3].set_title('Synthetic Data',fontsize=20) fig.subplots_adjust(wspace=.1) plt.show() Source: Google Images – Synthetic data (images) References: 1.https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8 2. https://www.datacamp.com/community/tutorials/ensemble-learning-python 3. https://www.geeksforgeeks.org/python-data-augmentation/ 4. https://medium.com/deepquestai/transfer-learning-with-5-lines-of-code-5e69d0290850 Conclusion:I hope you enjoyed my article and understood how to handle insufficient data in the real world using the above 4 methods which may be very much useful to get a good Machine Learning model. And I would like to thank Analytics Vidhya for providing me the such a great opportunity to share my knowledge with other people!! If you have any doubts/suggestions please feel free to contact me on Linkedin / Email. Once again, THANKS FOR READING :)) About Author:Hello! This is Priyadharshini, I am currently pursuing M.Sc. in Decision and Computing Sciences. I am very much passionate about Data Science and continuously gaining knowledge from various sources to shine in the Data Science field. I love exploring and analyzing things!! Thank you! The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion. What is a problem with using incorrect data to train a machine?Sample bias.
This happens when there's a problem with the data used to train the machine learning model. In this type of bias, the data used is either not large enough or representative enough to teach the system.
What is a problem with using incorrect training data to train a machine multiple choice sample bias prejudice bias measurement bias variance bias?Sample bias is a problem with training data. It occurs when the data used to train your model does not accurately represent the environment that the model will operate in.
What are some of the potential problems regarding training data?The model will not perform well if training data is small, or noisy with errors and outliers, or if the data is not representative(results in biased), consists of irrelevant features(garbage in, garbage out), and lastly neither too simple(results in underfitting) nor too complex(results in overfitting).
What can go wrong with machine learning?5 Common Machine Learning Problems & How to Solve Them. 1) Understanding Which Processes Need Automation. It's becoming increasingly difficult to separate fact from fiction in terms of Machine Learning today. ... . 2) Lack of Quality Data. ... . 3) Inadequate Infrastructure. ... . 4) Implementation. ... . 5) Lack of Skilled Resources.. |