This blog is part of a blog series giving a high level overview of Machine Learning theory and the AWS services examined on the AWS Machine Learning Specialty exam. To view the whole series click here

Data Visualisations can play a number of different roles within Machine Learning projects. They can be a big part of the Data Exploration process to help find insights within the data, understand its structure and see any patterns. They can aid in identifying relevant features and even with detecting anomalies or outliers in the data.

Visualisations for showing relationships

Scatter plots → Used to explore relationships between two variables through the distribution and density of the data points. In the example below you can see a scatter plot that shows the relationship between the size of a house and its price.

Bubble plot → Ideal for visualising relationships involving three attributes, like home size, age, and price. They resemble scatter plots but use varying bubble sizes or bubble colour to represent the third attribute. In the example below you can see, this bubble chart uses different colours:

Visualisations for showing comparisons

Bar Charts → Typically used to show and compare discrete categories or groups of data. They can be used for value look up or single point in time lookup. In the below example you can see I am using a bar chat to compare the number of star ratings for movies.

Line Charts → Typically used to show the relationship between data points over a continuous range, typically involving time or some other continuous variable. In the example below you will see I am tracking the average house prices over time.

Visualisations for showing distributions

Histogram → Shows the distribution of a single continuous variable. It divides the data into bins or intervals, and the height of each bar represents the frequency or count of data points falling into that bin. They allow you to see the frequency and density of data within different intervals and can help reveal the presence of outliers. Can be used for things like amount, frequency, duration, density etc.

Box Plots → Shows the distribution of one or more continuous variables (can be used for mutl-distribution). They allow you to see things like lowest and highest values, outliers and where most of the values fall. Box plots show a lot of information in one visulisation. In the example below you can see the top line is the maximum value, the bottom line is the minimum value. Then the pink line in the box represents the median value. There are circles outside the lines - these are outliers or extreme values. But majority of data points fall inside the box area.

Scatter Plots → I previously mentioned that Scatter plots can be used to show relationships in data, but they can also be used to show multi-distribution and clustering within data.

Heat Maps - particularly effective at displaying the distribution of data across two dimensions. They are commonly used to visualise the concentration, intensity, or density of data points within a grid or matrix. They typically use colour intensity to represent the magnitude of a particular value.

Visualisations for showing compositions

Pie Charts → Show how various values compares as a whole share of the total. e.g. monthly spend.

Stacked Area Charts → Shows the measurement of various items over longer period of times. e.g. showing how new customers reached your website:

Stack column charts → also known as stacked bar charts. These graph show quantity of various items over shorter periods of time. For example, showing how new customers reached your website:

AWS services for Visualisations

SageMaker

While SageMaker's primary focus is on machine learning tasks, you can use SageMaker to create data visualisations. You can use Jupyter notebooks within SageMaker to explore and visualise your data before applying preprocessing techniques. Within these notebooks you have access to Python libraries like Matplotlib, Seaborn, and Pandas to create various types of visualisations, such as histograms, scatter plots, box plots, and heat maps.

QuickSight

AWS QuickSight is a fully managed business intelligence (BI) and data visualisation service that allows organisations to create interactive and visual reports, dashboards, and data analysis from various data sources, such as AWS S3, RDS, RedShift, Glue, Athena, and more.

QuickSight also offers data transformation capabilities, including filtering, pivoting, aggregating, and calculated fields.

QuickSight has a serverless architecture and can automatically scales to accommodate tens of thousands of users, it also uses a usage-based pricing model, allowing you to pay only for what you use.

Amazon QuickSight Q

QuickSight Q, allows users to ask questions in natural language and receive answers with relevant visualisations to help gain insights from the data.