AWS ML Specialty Exam - Types of Data

This blog is part of a blog series giving a high level overview of Machine Learning theory and the AWS services examined on the AWS Machine Learning Specialty exam. To view the whole series click here

Data is all around us and is an essential part of Machine Learning. Before diving into the different Machine Learning algorithms and services, it is important to understand the type of data you are working with!

Types of data

Qualitative → Non-numeric data.

Quantitative → Numeric data that can be measured or counted.

Discrete → Data that can only be a finite number of values, typically takes the form of whole numbers or categories.

Continuous → Measurable numeric data with an infinite number of potential values. It can be subdivided into smaller parts and can take the form of valid fractional and decimal values. An example of continuous data would be market share prices. Continuous data is a type of quantitative data.

Categorical → Values that are organised into groups or categories. An example of categorical data would be hair color, such as Blonde, Brunette, and Auburn. Categorical data is a qualitative data type because it is not numeric; it is discrete, and its values are limited to the predefined groups.

Ordinal - When order is important within the data and allows each value to have a position on a scale. For example, customer satisfaction e.g. unhappy, happy, extremely happy.

Nominal → When order is not important within the data, for example Eye Colour e.g. Blue, Brown and Green, all have no meaningful order between them.

Structured Data → This is data that has a defined schema/ standardised format. For example tabular data with rows and columns clearly labelled.

Unstructured Data → Has no defined schema or structural properties. For example, a folder of different word documents.

Semi-structured data → Unstructured for relational data, but has some organisational structure usually CSV, JSON or XML.

Labelled Data → Data that has a meaningful annotation of what it is. For example, if we had a dataset of emails, some may have a label for “Spam” and others might have a label of “Not Spam”.

Unlabelled Data → Data that has been collected with no target or annotation.

Ground truth → Factual data that has been observed or measured. This data has been successfully labelled and the label is trusted.

Time series → Data captured over time that changes. For example the stock market.