This blog is part of a blog series giving a high level overview of Machine Learning theory and the AWS services examined on the AWS Machine Learning Specialty exam. To view the whole series click here

Now that you have a solid grasp of data pre-processing techniques, let's explore how AWS can assist in this crucial phase of your machine learning project.

AWS Glue

AWS Glue is a fully managed ETL service that allows you to discover, prepare, transfer and integrate data from different sources. Glue can immediately query and search data using Athena, EMR and Redshift Spectrum. It is used as the first step in many analytics and Machine Learning projects to get data in the correct format for processing.

Glue Crawlers → Are scripts that allow you to process data from your data store and determine a schema for it that is used to create a Data Catalog. The Crawler uses a classifier that reads the data in a data store and if it recognizes the format of the data, it generates the schema. Glue has a number of built in classifiers for formats such as JSON, CSV and many database formats. If you are using Glue Crawlers and you notice they finish without determining a schema and no tables are created, you may need to create your own custom classifier.

Glue Data Catalog → Acts like a persistent metadata store of attribute types for you datasets, it includes databases, tables, crawlers, classifiers, connections and schema registry. Data Catalogs are Hive-compatible metastores.

Key features

Unify and search data from different data sources
Automatically infers data schemas during data discovery
Visually transfer data
You can invoke Glue jobs, on schedule, on demand or on event basis allowing you to build complex ETLs
Can automatically scale based on the workload
Can monitor your jobs and gain insights

Entity Map Reduce (EMR)

EMR is a fully managed Hadoop cluster that runs on multiple EC2 instances. It is a big data solution for petabyte-scale data processing. Useful for machine learning using open-source frameworks such as Apache Spark, Apache Hive, and Presto.

EMR is a cluster of EC2 instances called nodes and each node contains different software components and has a role within the distributed application.

EMR allows you to transform and move large volumes of data between AWS data stores and databases e.g. S3 to DynamoDB. It also allows you to provision capacity as you need and scale up/down based on variable requirements. It also monitors nodes within the cluster and automatically terminates and replaces an instance in case of failure.

If your machine learning project is using petabyte scale data and needs to run Apache Spark, Apache Hive, and Presto then you should consider EMR. It can be used for statistical algorithms and predictive models for uncovering hidden market trends, and preferences.

Athena

Athena is a serverless interactive query service, that allows you to run SQL queries on petabyte-scale data that can be read directly from a source e.g. S3. It can work with both structured or unstructured data and removes the need for complex ETL. You can use Athena with AWS Glue to allow you to create more sophisticated Data Catalogs using features like partition recognition, automated schema and the ability to create central repositories for metadata across multiple services.

Athena can run a single query to analyse data from a range of different sources including relational and non-relational databases, object storage or on premise. It very performant as it can automatically execute queries in parallel.

You can now also use Machine Learning models that are hosted in SageMaker within your Athena SQL query, to allow you to identify anomalies, cohorts or make predictions.

Athena also integrates directly with Amazon QuickSight, to allow you to quickly create visualisations of your data.