Chloe McAree (McAteer)
Published on

AWS ML Specialty Exam - Data Streaming With Kinesis

Authors

This blog is part of a blog series giving a high level overview of Machine Learning theory and the AWS services examined on the AWS Machine Learning Specialty exam. To view the whole series click here

AWS Kinesis is a suite of managed services for collecting, processing, and analysing real-time streaming data. It is designed for ingesting, storing, and processing large volumes of data from various sources, making it useful for applications that require real-time analytics, such as IoT data streams, log data, clickstreams, and more.

Kinesis can be a valuable component in machine learning solutions, particularly in scenarios where real-time data processing and analysis are required.

Kinesis Data Streams

Kinesis Data Streams allows you to collect and process real-time data streams. It gets data from data producers and these producers typically have some form of JSON as their payload. Streams carry data using shards and each shard can handle a certain amount of data. You can have multiple consumers consume data streams and perform different processes.

Shards

At the core of Amazon Kinesis Data Streams are shards. These shards serve as containers, holding the data you want to send to AWS.

Each shard is associated with a unique partition key and has its own sequence.

Each shard consists of a sequence of data records these can be ingested at 100 records per second. A data record is the fundamental unit of captured data, comprising a sequence number, partition key, and data blob, which can be as large as 1 MB.

Notably, shards are transient data stores, and the retention period for data records typically ranges from 24 hours to 365 days, with the default being 24 hours.

A Data Stream can have 1- 500 shards.

Interacting with Shards

To interact with Kinesis Data Streams, you have several options:

  1. Kinesis Producer Library (KPL): This library simplifies writing data to Kinesis Data Streams. It provides an easy-to-use interface and includes features like automatic and configurable retry mechanisms for reliable data ingestion.
  2. Kinesis Client Library (KCL): KCL is integrated directly with KPL for consumer applications, allowing them to consume and process data from Kinesis Data Streams efficiently.
  3. Kinesis API (AWS SDK): For low-level API operations, AWS SDK is the go-to choice. It enables you to send records to Kinesis Data Streams but involves more manual handling of stream creations, re-sharding, and record management.

It's important to note that you can't stream data directly into storage with Kinesis Data Streams.

When we use Kinesis Data Stream

  • Real-time data needs to be processed by consumers.
  • Real-time analytics are a priority.
  • Data needs to be fed into other services in real time.
  • Actions must be triggered based on incoming data.
  • When storage is optional but data retention is critical.

Amazon Kinesis Data Firehose

Kinesis Data Firehose simplifies the process of loading streaming data into AWS services such as Amazon S3, Amazon Redshift, and Amazon Elasticsearch. It can automatically transform and compress data before delivering it to the destination.

If you need a streamlined way to get data from data producers (e.g. EC2 instances or servers) to storage without worrying about shards, consider Kinesis Data Firehose. It allows you to pre-process data using AWS Lambda functions, acting as an ETL service, right before it lands in the data store. This service is primarily used when direct streaming to storage is a priority.

Kinesis Video Streams

Kinesis Video Streams is designed for real-time video and audio streaming. It's often used for applications like security cameras, video analytics, and video content delivery.

Data from producers is sent to AWS, where continuous or batch consumers (e.g. EC2 instances) process the data. Data is captured in fragments and frames, making it easier to view, process, and analyse.

Kinesis Data Analytics

Kinesis Data Analytics lets you analyse streaming data in real time. You can use SQL queries to process data, and it's particularly useful for applications like anomaly detection and aggregation of streaming data. For Kinesis Data Analytics you can use from Kinesis Data Streams or Kinesis Firehose as your source and perform real time analytics using SQL queries.

Which Kinesis service to use?

Amazon Kinesis is a versatile suite of services, each catering to specific real-time data needs. Below are some use cases of when to use each service:

Kinesis Firehose → If you need to stream data directly to storage you should use Kinesis Firehose.

Kinesis Video Stream → If you need to stream real time video, you should use Kinesis Video Streams.

Kinesis Data Streams → If you need to stream and transform large amounts of data you should use Kinesis Data streams.

Kinesis Analytics → If you need to run SQL queries on real time data, you should use Kinesis Analytics.