AWS ML Specialty Exam - Data Preparation Techniques

This blog is part of a blog series giving a high level overview of Machine Learning theory and the AWS services examined on the AWS Machine Learning Specialty exam. To view the whole series click here

When working on large Machine Learning projects, you need data. However, raw data is rarely in the ideal format for training models and so some form of data preparation is typically needed. Data pre-processing is a crucial step in machine learning which involves cleaning, transforming, and organising data to make it suitable for training models. The quality and suitability of your data can significantly influence the performance of your machine learning algorithms.

Understanding the nature of your data and the requirements of your machine learning algorithms will help you when choosing a data pre-processing techniques.

Data Cleaning

Data cleaning involves removing incomplete, incorrect, irrelevent or duplicated data.

A big part of data cleaning involves handling missing values, which can come in the form of null values, NaN, or other placeholders.

When dealing with missing data it is very important to understand why the data is missing, to know if there is a correlation with something else.

Types of missing data

Missing Completely at Random (MCAR) → The probability of a data point being missing is unrelated to both the empty value and the values of other variables in the dataset. This means that the missingness occurs randomly and independently of any observed or unobserved data. MCAR is considered the least problematic type of missing data because it doesn't introduce any systematic bias into the analysis.

Missing Not At Random (MNAR) → Occurs when the missingness depends on the empty value or is related to the values of other variables in the dataset. In other words, the missing data is not random, and the reasons for the data being missing are related to the data itself. Dealing with MNAR can be challenging because the missing data is systematically different from the observed data, potentially introducing bias into the analysis.

Missing At Random (MAR) → The probability of data points being missing is related to some of the observed data but not directly related to the missing data itself. This means that the missingness is dependent on observed variables but not on empty data. MAR is a common scenario, and it can often be addressed through various statistical techniques, such as imputation methods.

How to handle missing data

Understanding the type of missing data is crucial when deciding how to handle it. Dealing with MCAR is relatively straightforward because it doesn't introduce bias, but MNAR and MAR require more careful consideration and potentially specialised methods for handling the missing values to ensure that the analysis or modelling results are not unduly affected by the missing data.

Supervised Learning - Takes into consideration the relationships between variables and allows you to predict missing values based on those other features in the dataset. This can give you the best results, but would require the most time and effort to implement.
Mean - Find the average value and replace all the missing values with that average. This is a very quick and easy way to fill in missing values. However, it can distort the data distribution, especially if the data has outliers. It also doesn’t consider relationships between data.
Median - Orders the non-empty values and then chooses middle value and fills the missing values with the middle value. This is a quick and easy way to fill missing values and is robust to outliers as it is not influenced by extreme values. This can also distort the data distribution, if the data is not symmetric and also does not consider the relationship between variables.
Mode - Find the most common value from the non-emply values and fill the empty values with it. This is a quick and easy way to fill missing data and is suitable for categorical or nominal data. However, it is not as useful for continuous or numerical data.
Dropping rows - removing rows that contain missing values. It is simple and effective and preserves the integrity of the data. However, it does reduce the size of the dataset, which potentially can lead to loss of valuable information. This approach is not suitable when missing data is widespread.
Data imputation - Means sourcing the value of the missing data and replacing it. This provides a way to retain all data points. However, might introduce bias if not chosen carefully.

When deciding how to handle missing data, it's essential to consider the nature of the data, the reasons for the missingness, and the goals of your analysis or modelling. In practice, a combination of these methods may be used, depending on the specific dataset and problem at hand. It's also important to assess the impact of the chosen method on the overall analysis and model performance to make informed decisions.

Feature Engineering

Features → in a dataset, variables/ attributes are sometimes referred to as features.

The number and type of features in a dataset can vary widely depending on the nature of the data and the machine learning problem being addressed. They could be Numeric, Categorical, Text, Image or Audio.

If you are working with a particularly large dataset, you will want to use feature engineering to remove any feature/attribute in the data, that will either not affect the outcome of the machine learning problem or by grouping the feature/attribute with another to make it simpler to process. The process of feature selection and engineering is often an iterative one, where you continuously refine the features to improve model accuracy and interpretability.

Principle Component Analysis (PCA)

Is an unsupervised learning algorithm to reduce the number of features in a dataset, while still retaining as much info as possible. PCA is a dimensionality reduction technique commonly used in statistics, data analysis, and machine learning.

It's important to note that while PCA is a powerful technique, it may not be suitable for all datasets, and the choice of the number of principal components to retain should be made carefully to balance dimensionality reduction with information retention. Additionally, PCA assumes linear relationships between features and is sensitive to outliers.

Numeric Feature Engineering

Numeric feature engineering involves creating, transforming, or modifying numerical features to improve the performance of machine learning models.

Binning → Allows you to transform continuous numerical features into categorical features by dividing the range of values into discrete bins or intervals. An example of this can be transforming an age column from having all individual ages to creating groups such as under 18, 18-25, 25-40 and over 40. Binning is a valuable for improving the performance of machine learning models, especially when dealing with large amounts continuous numerical features that exhibit non-linear relationships or complex patterns. However, binning can result in an uneven distribution of data within bins. Some bins may contain very few data points, while others may be densely populated.

Quantile Binning → Divides data into bins based on quantiles, which are specific data percentiles. Each bin is created to have roughly the same number of data points. Quantile binning is more suitable for dealing with data distributions that are skewed or have outliers. It ensures a more even distribution of data among the bins.

Numeric Feature Scaling

When working with numeric values where there is a big difference between the lowest number and the highest number, feature scaling is the process of transforming values into a common scale or range - which overall helps to ensure that all features have similar influence on the model and one feature is not dominating the others due to scale difference.

Common methods of numeric scaling include:

Normalisation - Rescales values to a range between 0 and 1. The lowest value becomes 0 and the highest becomes 1, everything else is scaled in between. However, an issue with normalisation is that outliers can throw it off.
Standardisation - Marks the average value as 0 and then uses a z-score to plot the other values around it. It takes into account the average and standard deviation. This method is widely used in various algorithms like regression, clustering, and neural networks.

When should you use which? If your dataset has a substantial range between the lowest and highest values, use normalisation. However, be cautious of outliers, as they can skew normalisation. In such cases, standardisation is your best bet

Categorical Feature Engineering

Categorical Feature → Values that are organised into groups or categories. An example of categorical data would be hair color, such as Blonde, Brunette, and Auburn. Categorical data is a qualitative data type because it is not numeric; it is discrete, and its values are limited to the predefined groups. Categorical Data can be Ordinal or Nominal!

Ordinal - When order is important within the data and allows each value to have a position on a scale. For example, customer satisfaction e.g. unhappy, happy, extremely happy

Nominal → When order is not important within the data, for example Eye Colour e.g. Blue, Brown and Green, all have no meaningful order between them.

Categorical encoding → Converting categorical values into numeric values using mappings and one-hot techniques.

When to encode categorical data:

When your machine learning algorithm expects numeric data, like linear regression or convolutional neural networks, you need to convert categorical values into numeric ones.

Encoding

When working with Ordinal categorical data, you can assign numbers, with meaning e.g. numbers that represent the order of each categorical value.

For example, when dealing with a size category that has values like small, medium, and large, you can map them to small as 5, medium as 10, and large as 15. This way, you can still indicate that medium is smaller than large and small is smaller than medium.

However, this encoding technique doesn't work for nominal categorical data, as the algorithm may infer an order even when there is none. This is where one-hot encoding comes into play.

One Hot Encoding

One hot encoding transforms categorical features and generates new binary columns for each observation. In these new columns, 1 represents TRUE, and 0 represents FALSE.

In the example below, you can see that we have categorical data about colours with three unique values: Blue, Green, and Red. We begin by creating a separate column for each of these values. For every entry in the original column, we assign a one or zero to indicate the category to which it belongs.

For example:

	Blue	Green	Red
Blue	1	0	0
Green	0	1	0
Red	0	0	1
Blue	1	0	0
Green	0	1	0

Disadvantages of one-hot encoding

If you have loads of categories there will be a feature created for each of those categories and it can exponentially grow your data set. If this is the problem you can use grouping to create fewer categories before you encode.

Cartesian Product Transformation

This transformation is applied to categorical variables, and it aims to create new features by combing two or more text or categorical values. The resulting features can be used as input for machine learning models.

For instance, if you have a column for shapes with values like 'circle' and 'square' and another column for colours with values such as 'red,' 'green,' and 'blue,' you could opt to combine these columns to form a new one containing all possible combinations.

For Example:

Column Shapes: Circle, Square

Column Colours: Red, Green, Blue

New Combination Values: Red_Circle, Red_Square, Blue_Circle, Blue_Square, Green_Circle, Green_Square

Text Feature Engineering

Text feature engineering involves transforming or modifying text based data, so that a Machine Learning Algorithm can better analyse it.

Stop-word Removal

Stop-words are common words like "the," "and," "in," etc., that often don't carry much information. As part of pre-processing text data, you may choose to remove them to reduce dimensionality and noise in the data.

Stemming and Lemmatisation

These techniques reduce words to their base or root form. For example, "running" and "ran" might both be reduced to "run." This helps in treating similar words as the same word, reducing dimensionality and improving model performance.

Bag of words (BoW)

Is a simple technique used in text feature engineering. It represents text data by breaking it by white space into single words. It then treats each document as an unordered set of words (or tokens) and counts the frequency of each word.

For example:

"The cat is on the mat.” becomes ["The", "cat", "is", "on", "the", "mat"]

N-Grams

N-grams are an extension of bag of words which produces groups of words of n-size. It looks at contiguous sequences of n words from a given text and captures local word patterns for text analysis.

For an example of how this works lets use the following sentence:

"Machine learning is a really fun topic"

1-grams (unigrams): ["Machine", "learning", "is", "a", "really", "fun", “Topic”]
2-grams (bigrams): ["Machine learning", "learning is", "is a", "a really", "really fun", "fun topic"]
3-grams (trigrams): ["Machine learning is", "learning is a", "is a really", "a really fun”, "really fun topic”]

Orthogonal Sparse Bigram (OSB)

The OSB transformation is an alternative to the bi-gram transformation (where n-grams are created with a window size of 2). OSBs are formed by sliding a window of size 'n' over the text and producing pairs of words that include the first word in the window.

In constructing each OSB, the individual words are connected using an "_" (underscore) and any omitted tokens are marked by adding an additional underscore. As a result, the OSB not only encodes the tokens found within a window but also provides an indication of how many tokens were skipped within that same window.

For example, if we used the following sentence with an OSBs of size 4: "Machine learning is a really fun topic"

"Machine learning is a" = {Machine_learning, Machine__is, Machine___a}

"learning is a really" = {learning_is, learning__a , learning___really}

"is a really fun" = {is_a, is__really , is___fun}

"a really fun topic" = {a_really, a__fun , a___topic}

"really fun topic" = {really_fun , really__topic}

"fun topic" = {fun_topic}

Term Frequency - inverse Document frequency (TF-IDF)

TF-IDF is a numerical statistic that shows the popularity of a word within a document and across a collection of documents. It's used to represent the importance of words in a document and is often used for text classification tasks. It also makes common words such as (the, and, a) less important.

The vectorisation process depends on the number of documents and the number of unique n-grams in your dataset.

How to get the dimensions of the TF–IDF matrix

You may be asked to define the dimensions of a TF–IDF matrix in the exam and to do so you use the formula below:

(Number of documents, number of unique n-grams)

Here is an example of how we can work it out:

This is a test sentence.
Here is an example sentence.

If the question asked for uni-grams this would be the answer:

[This, is , a, test, sentence] [Here, an, example]

(2, 8) - because there are 2 documents and 8 unique words!

If the question asked for bi-grams this would be the answer:

[this is, is a, a test, test sentence] [here is, is an, an example, example sentence]

(2, 8)